capOS Documentation
cap-os.dev documents the current capOS implementation: the implemented operating model, build and boot workflow, runnable demos, architecture, configuration surface, and security and verification boundaries.
capOS is a research operating system where kernel and userspace services are typed Cap’n Proto capabilities invoked through shared-memory rings. The manual focuses on behavior that exists or is directly reviewable in this repository; project plans, proposals, and research notes remain available as archives rather than driving the primary reading path.
The Basic Idea
capOS is an experiment in making an operating system easier to reason about. In familiar operating systems, a program’s power is spread across many mechanisms: system calls, file paths, sockets, process identity, permissions, environment variables, inherited handles, and service-specific protocols. That model is flexible, but it can be hard to answer simple questions: what can this program actually do, who gave it that power, and can that power be passed, revoked, recorded, or moved somewhere else without hidden side effects?
capOS tries a different tradeoff. A program can act only through explicit typed capabilities it already holds. The interface is the permission: instead of giving a broad handle plus a separate rights mask, capOS gives a narrower object with only the methods the caller should have. The same Cap’n Proto schema describes the kernel call, the service call, and the wire format used between processes.
If that approach works, it should make several things more natural: running small services, tools, and future AI agents with least authority, handing a resource from one program to another without accidentally duplicating it, auditing or replaying service traffic, and eventually moving services across persistence or network boundaries without inventing a second permission model. capOS is not a production OS or a Linux replacement; it is a prototype for testing whether those design choices hold together in real runnable code.
Start Here
For a printable current-system reference, use the PDF manual; planning archives and research notes remain on the website.
- What capOS Is describes the implemented system model and the main authority boundaries.
- Current Status lists what works today, what is partial, and what remains future work.
- Build, Boot, and Test gives the commands used to build the ISO, boot QEMU, and run host-side validation.
- Configuration explains operator overlays, host-user tag injection, the tools cache, and schema-aware data conversion.
- Repository Map maps the main subsystems to source files.
- Programming Languages describes current native Rust support and the status of Python, Go, Lua, C/C++, WASI, and POSIX adapters.
- ABI Evolution Policy defines the compatibility rules for schema, ring, bootstrap, and runtime ABI changes.
- First Chat Demo shows the smallest runnable resident-service chat proof and its current single-terminal limits.
- Aurelian Frontier (proof slice) shows the current runnable multi-process slice of the Aurelian Frontier game and its QEMU proof.
- Paperclips Terminal Demo shows a clean-room incremental terminal game running as an ordinary shell-launched process.
Site Map
- System Architecture is the design reference for current behavior: boot, process, capability, runtime, memory, scheduling, IPC, threading, and park behavior.
- Programming Languages summarizes implemented native Rust support and points language-specific future work back to owning proposals.
- Security and Verification is the reviewer path: trust boundaries, validation workflow, trusted inputs, panic inventory, and DMA design.
- Runnable Demos documents the proof paths that exercise the implemented service model.
- Reference and Project Archives keeps planning, proposal, research, and topic-index material available below the manual sections without making it part of the manual PDF.
What capOS Is
A research kernel that boots on x86_64 QEMU. The rest of this page is about why it looks the way it does — the specific design bets behind the code — not a feature inventory. For the feature-by-feature matrix, see Current Status.
What Makes capOS Different
capOS is a research vehicle for a few specific design bets. Each is unusual on its own; the combination is the point.
- Everything is a typed capability. System resources are accessed through
Cap’n Proto interfaces defined in
schema/capos.capnp. There is no ambient authority — no global path namespace, no open-by-name, no implicit inherit. A process can only invoke objects present in its local capability table. See Capability Model and the schema/repo map. - The interface IS the permission. Instead of a parallel READ/WRITE/EXEC
rights bitmask (Zircon, seL4), attenuation is a narrower capability: a
wrapper
CapObjectexposing fewer methods, or anEndpointclient facet that cannotRECV/RETURN. The kernel just dispatches; policy lives in interfaces. See Capability Model, IPC and Endpoints, and the prior-art notes on Zircon and seL4. - Identity metadata is not authority. In prose, a user is the human-facing actor, a principal is identity metadata, an account is planned durable local record state, and policy/resource profiles select bundles and quotas. Sessions receive capabilities; none of those labels become kernel subjects or bypass cap-table authority. See the local users backlog, User Identity and Policy, and Resource Accounting and Quotas.
- io_uring-style shared-memory ring for every call. Every process owns a
submission/completion queue page. Userspace writes SQEs with a normal
memory store; the kernel processes them through
cap_enter. New operations are SQE opcodes (CALL,RECV,RETURN,RELEASE,NOP), not new syscalls. The remaining syscall surface iscap_enterandexit; the accepted threading contract keeps current-thread exit as aThreadControlcapability operation. See Capability Ring, Userspace Runtime, and In-Process Threading. - Release is transport, not an application method. Dropping the last
owned handle in
capos-rtqueues one localCAP_OP_RELEASE; acquiring or dropping a runtime ring client flushes the queue, and long-running code can callRuntime::flush_releases()explicitly. Noclose()method on every interface, no mutable table self-reference during dispatch. See Userspace Runtime and Capability Ring. - Capability transfer is first-class. Copy and move descriptors ride
sideband on
CALL/RETURNSQEs. Move reserves the sender slot until the receiver accepts and preflight checks pass, then commits or rolls back atomically — no lost, duplicated, or half-inserted authority. See Authority Accounting and IPC and Endpoints. - Cap’n Proto wire format end-to-end. The same encoding describes the boot manifest, runtime method calls, and future persistence/remote transparency. The debug tap records fixed, bounded SQE/CQE metadata today; authorized payload capture, replay, audit, and migration remain future transport work. See Manifest and Service Startup, Error Handling, and Storage and Naming.
- Host-testable pure logic. Cap-table, frame-bitmap, ELF parser, frame
ledger, lazy buffers, small ABI constants, and the ring model live in
capos-lib,capos-abi, andcapos-config, and run undercargo test-lib, Miri, Loom, Kani, andproptestwithout any kernel scaffolding. Kernel glue stays thin. See Verification Workflow and Repository Map. - Schema-first boot.
system.cueis compiled to a Cap’n ProtoSystemManifestembedded as the single Limine boot module. The kernel validates only the kernel-owned boot boundary and launchesinitConfig.init;mkmanifestand init validate the service graph underinitConfig.servicesas structured data, not shell scripts or baked environment variables. See Boot Flow, Manifest and Service Startup, and Build, Boot, and Test.
Execution Model
Each process owns an address space, a local capability table, a mapped
capability-ring page, and a read-only CapSet page that enumerates its
bootstrap handles. The kernel enters Ring 3 with iretq and returns through
cap_enter or the timer. Ordinary capability calls progress only via
cap_enter; timer-side polling handles non-CALL ring work and call targets
that are explicitly safe for interrupt dispatch. Details in
Process Model,
Capability Ring,
In-Process Threading, and
Scheduling.
Boot Flow
The kernel receives exactly one Limine module — a Cap’n Proto
SystemManifest compiled from system.cue — validates the kernel-owned boot
boundary, loads only initConfig.init.binary, builds that process’s bootstrap
capability table and CapSet page from initConfig.init.caps, and starts the
scheduler. The default manifest now boots the standalone init ELF, and init
validates the service graph before spawning the foreground capos-shell, the
remote-session CapSet gateway, and the resident demo services. The shell mints an
anonymous UserSession when it starts and the user runs login or setup as
ordinary shell commands to upgrade to an operator session. Focused shell-led
manifests such as system-smoke.cue and system-shell.cue still boot
capos-shell directly as initConfig.init until the run-target/init-policy
cleanup migrates them. Full walkthrough in
Boot Flow and
Manifest and Service Startup.
Authority Boundaries
Authority is carried by cap-table hold edges with generation-tagged
CapIds. Ring 0 ↔ Ring 3, capability table ↔ kernel object, endpoint IPC,
copy/move transfer, manifest/boot-package, and process spawn are the
boundaries reviewers care about; each one fails closed at hostile input. See
Trust Boundaries for the boundary table and
Authority Accounting for the
transfer and quota invariants.
What capOS Is Not
A POSIX clone, a microkernel-shaped Linux replacement, or a production OS. It is a place to try the above choices and see which ones survive contact with real workloads. See Build, Boot, and Test to run it.
Current Status
This page describes current repository behavior, not the full long-term design.
Current Snapshot
capOS boots on x86_64 QEMU, starts a standalone init process from the default
manifest, and runs the native shell plus resident demo services through typed
capabilities. The current operator path starts as an anonymous shell session;
login prompts for username> and hidden password>, validates the selected
bootstrap account through SessionManager and CredentialStore, and upgrades
to a broker-issued operator bundle. Default password-authenticated local
operator sessions do not expire by wall-clock timestamp; they remain intended
to end through logout, terminal/connection/process-tree close, or
administrator revocation. Manifests can still set a non-default operator
lifetime for focused expiry proofs. setup can create a volatile local
operator credential and then follows the same login upgrade path.
The implemented baseline includes isolated processes, user-mode ELF loading, shared-memory capability rings, endpoint IPC, copy/move capability transfer, thread and park primitives, init-owned service spawning, local shell login, focused Telnet shell demo, resident chat/adventure services, and the Paperclips terminal demo. The current selected milestone is GCE Self-Hosted Web UI: the next visible path is serving the remote-session Web UI through the Phase C userspace network stack and proving private GCE reachability before any public endpoint. Installable System is the completed previous selected milestone for the bounded local/QEMU contract: data-region mount, config-overlay merge, generation/rollback machinery, integrated bootable disk, install, first-boot provision, update/rollback, and structural proposal/body wording reconcile have landed. Device Driver Foundation is also complete; its GCP-first provider rollup has live operator-access, selected NIC raw-frame, selected storage I/O, and gVNIC portability evidence. Durable multi-account credential storage, broader account policy, production SSH/WebShell ingress, public GCE ingress/TLS, AWS/Azure providers, broader storage variants, high-throughput NIC work, direct-remapping DMA, and persistence beyond the landed installable data-region and generation paths remain future work.
Recent Status Notes
Update 2026-06-11 19:21 UTC: the GCE Self-Hosted Web UI local readiness wave
is reconciled here. Since the local Web UI L4 proof, the following landed as
local QEMU/cloudboot or no-spend harness evidence only: legacy GCE virtio-net
Web UI serving (make run-cloud-gce-legacy-virtio-webui-serving proves a host
HTTP peer fetching the byte-verified UI bundle over the kernel-brokered legacy
virtio 0.9 runtime that backs the typed Nic cap), a browser-facing hardening
set proved on the L4 gate (single public-origin policy, IAP-aware SameSite
cookie policy, JSON content-type guard, security response headers with a strict
CSP, GFE-range-pinned X-Forwarded-Proto trust, the public /healthz
health-check contract, and in-guest login peer-gate/failure-backoff hardening),
and a no-spend provider-harness fixture set (the private-proof
--preflight-only mode, private and public proof-evidence validators, the
public ingress resource plan gate, the journal-driven teardown engine, and the
provider-command allowlist gate), each fixture gate driving recording stub
provider CLIs only, with no real provider invocation or mutation on any
path. None of this proves private GCE
reachability, public exposure, TLS custody, or production readiness; the live
private and public proofs stay on hold.
The selected GCE Self-Hosted Web UI evidence ladder is:
- Landed local serve-from-userspace proof:
cloud-prod-userspace-network-stack-smoltcp-local-proofproves a non-qemucloudboot application client using a userspace-servedTcpListenAuthorityfor one local hostfwd TCP request/response. This is local QEMU/cloudboot evidence, not provider reachability. - Landed local IPv4 configuration proof:
cloud-prod-network-stack-dhcp-ipv4-config-local-proofproves DHCP/IPv4 lease, default route, and ARP/neighbor state for the userspace smoltcp network-stack process under local QEMU/cloudboot. - Landed cloudboot Web UI authority inventory:
remote-session-webui-cloudboot-authority-inventoryrecords theremote-session-web-uirequired and forbidden grants, trusted listener/source metadata, browser-visible forbidden markers, and expected local L4 proof markers for the non-qemucloudboot path. - Landed local Web UI L4 proof and hardening set:
cloud-prod-remote-session-web-ui-l4-local-proofservesremote-session-web-uithrough the non-qemucloudboot L4 path with the full fixed-name bundle, browser login, backend-heldSystemInfo, logout/stale failure, manual viewer, and browser-boundary checks. The samemake run-cloud-prod-remote-session-web-ui-l4gate now also proves server-side session hardening, per-connection deadlines, in-guest login peer-gate and failure-backoff hardening, the single public-origin policy, the IAP-aware SameSite cookie policy, the JSON content-type guard, the security response headers and strict CSP, the GFE-range-pinned forwarded-scheme trust, and the public/healthzhealth-check contract. All of this is local QEMU/cloudboot evidence, not private provider reachability or public exposure. - Landed legacy GCE virtio-net serving proof:
cloud-gce-legacy-virtio-webui-serving-local-proofcloses the legacy-virtio serving gap locally:make run-cloud-gce-legacy-virtio-webui-servingproves a host HTTP peer fetching the byte-verified Web UI bundle over the kernel-brokered legacy virtio 0.9 runtime underdisable-modern=on. This closes only the local serving story for the GCE NIC shape; it is not live GCE reachability. - Landed no-spend provider-harness gates: the private-proof harness
--preflight-onlymode, the private and public proof-evidence validators, the public ingress resource plan gate, the journal-driven teardown engine, and the provider-command allowlist gate validate the future private/public proofs’ evidence, resource graph, teardown, and provider-command boundaries against recording stub provider CLIs; no real provider is invoked or mutated on any fixture path. These are local fixture gates only; they authorize no spend, exposure, or provider mutation. A matching public-harness no-spend preflight task is dispatchable future work, not landed. - Private GCE proof:
cloud-gce-private-self-hosted-webui-proofremains on hold on missing firewall IAM against GCE default-deny ingress (shared with the on-hold private ICMP proof) and on per-run billable authorization; the legacy-virtio serving gap was closed locally on 2026-06-11. It must keep the no-public-IP cloudboot posture and prove a private probe that crosses the live GCE NIC. - On-hold public ingress/TLS proof:
cloud-gce-public-self-hosted-webui-ingress-tlsstill requires the private proof first and explicit public-exposure approval. The local plan/teardown/evidence/allowlist gates above bound that future run but are not exposure authorization. The status above does not authorize public IPs, firewall widening, DNS, certificate issuance, TLS key custody, browser production readiness, or a release. The capOS-terminated TLS successor (cloud-tls-self-hosted-webui-terminated-endpointand the on-hold Let’s Encrypt direct-termination proof) is a separate later evidence class behind the provider-terminated first public proof.
Update 2026-06-08 10:21 UTC: local bounded ICMPv4 Echo Reply diagnostics have
landed for the Phase C userspace network stack. make run-cloud-prod-icmp-echo-reply first reruns the served
TcpListenAuthority local proof, then boots the ICMP manifest, acquires the
QEMU SLIRP DHCP lease 10.0.2.15/24, proves same-subnet ARP and ICMP Echo
Request / Echo Reply preservation for identifier 0x04d2, sequence 9, and a
23-byte payload, and rejects bad-checksum, invalid-code, truncated,
address-family, oversized-payload, and oversized-frame controls. This answers
bounded local ping for diagnostics only; it is not Web UI readiness and does
not open public ICMP or change GCE firewall posture.
Update 2026-06-08 08:24 UTC: Phase C slice 7c-ii(b) has a local
serve-from-userspace proof, and the legacy kernel socket-path grant is retired
for non-qemu production manifests. make run-cloud-prod-userspace-network-stack-smoltcp boots the non-qemu cloudboot
manifest, starts a userspace smoltcp network-stack service, grants an
application client only Console plus a served TcpListenAuthority, and
completes one hostfwd TCP request/response through served
TcpListener/TcpSocket caps while preserving host_physical_user_visible=0.
Non-qemu manifests that request kernel network_manager or
tcp_listen_authority now fail closed instead of reaching
kernel/src/virtio_stub.rs; remaining kernel socket grants are qemu-only
fixtures. DHCP/IPv4 configuration, Web UI L4, private GCE reachability, public
ingress/TLS, and kernel smoltcp/virtio-net cleanup remain separate tasks. The
current GCE Self-Hosted Web UI evidence ladder lives in the
2026-06-11 19:21 UTC update above.
Update 2026-06-07 18:20 UTC (through commit 12b8334a, committed
2026-06-07 18:19 UTC): Installable System is closed for the bounded local/QEMU
contract. The closeout reconciles the proposal, backlog, proposal index,
roadmap, and status page to the landed data-region, overlay, generation,
install, provision, and update/rollback proofs while preserving the RAM-only
Namespace caveat and leaving secure boot/signing, production release
authority, public ingress, broader provider support, full userspace smoltcp/L4
readiness, and full durable account policy as future work. The selected
milestone is now GCE Self-Hosted Web UI.
Update 2026-06-07 08:23 UTC (commit ef8d98c2): Device Driver Foundation
production-authority closeout is recorded by
ddf-production-authority-closeout. The closeout ties together the landed
provider-driver, interrupt, audit, and DMA-policy prerequisites and keeps
public ingress, AWS/Azure support, direct-remapping hardware, device-autonomous
MSI-X, high-throughput NIC, and userspace smoltcp/L4 readiness as future
follow-up work.
Update 2026-06-07 05:26 UTC: the GCP-first usable-instance provider rollup is
closed by cloud-usable-instance-provider-nic-storage. The rollup cites real
GCE serial-console operator access (1779868872-2424), live legacy virtio-net
raw-frame provider-nic-bound (1780412056-e1cb), live NVMe Persistent Disk
brokered READ (1780806087-bf69), and separate live gVNIC raw-frame /
typed-Nic portability evidence (1780794927-1aa9, 1780796615-decc). This is
not a public L4 ingress, SSH/WebShell, AWS/Azure, high-throughput NIC, broader
storage, direct-remapping DMA, or production cloud-image release claim.
Update 2026-05-23 16:51 UTC (commit c86374f8): make run-ddf-provider-consumer now has a stable userspace virtio-net provider
closeout proof line. The line ties together selected queue 1 TX
descriptor/avail/doorbell/used-ring/CQ ownership across the full QEMU TX queue
depth, bounded queue 0 RX synthetic-token CQ identity, selected TX/RX
MSI-X/LAPIC wait/ack/EOI, selected-route reset/reassignment, teardown,
stale-handle blocking, and explicit no-silent-provider-fallback boundaries.
This remains local bounded provider evidence over manager-owned bounce buffers:
live hardware RX used-ring ownership, full virtio-net ownership, direct
DMA/IOMMU, cloud NIC/storage readiness, and virtio block/storage drivers remain
future work.
Update 2026-05-23 13:36 UTC (commit e248d42b): make run-ddf-provider-consumer now exercises selected userspace virtio-net TX CQ
ownership across the full eight-entry TX queue depth used by the smoke. Eight
manager-owned bounce buffers can be live before the first completion, the ninth
allocation fails closed, wrong-order completion of descriptor 7 preserves
descriptor 0, CQ identity is delivered and acknowledged in order for
descriptors 0 through 7, release drains seven incomplete descriptors as
teardown-only, and provider TX release retires seven delivered but
unacknowledged CQ events. This remains bounded selected-queue evidence over
bounce buffers: live hardware RX used-ring ownership, direct DMA/IOMMU, full
userspace virtio-net ownership, and cloud NIC/storage readiness remain open.
Update 2026-05-12 16:40 UTC: make run-ddf-provider-consumer now exercises
the selected userspace virtio-net TX path through bounded queue 1
descriptor/avail publication, exactly one selected notify doorbell, and a
runtime-visible tx_interrupt completion event tied to the used-ring handoff.
The selected submit path validates the live DMABuffer record, scrubs the
bounce page, consumes the live no-write notify_mmio policy, publishes the
stored descriptor/avail entry, and rings the notify doorbell only after those
gates pass. DMABuffer.completeDescriptor then observes the real TX used-ring
entry for the stored software descriptor generation, clears the manager
in-flight record, and delivers a bounded
selected-tx-used-ring-completeDescriptor event to a live tx_interrupt.wait
for the same route. This is still selected-route proof coverage rather than
full userspace virtio-net ownership: arbitrary doorbells, production NIC or
storage migration, cloud readiness, hardware IRQ ownership, hardware
acknowledgement/mask/unmask, direct DMA, IOMMU programming, broader CQ
ownership, and grantable device ownership remain open.
Update 2026-05-11 14:39 UTC, commit f04a14f4: make run-ddf-provider-consumer now turns the selected userspace virtio-net TX
doorbell gate into an explicit staged claimed-notify-offset admission proof.
The selected queue 1 provider entry reports accepted notify-offset policy,
blocked wrong-queue policy, blocked wrong-offset policy, and no_doorbell=true
after descriptor authority validation and submit scrub. The same smoke submits
queue 0 first and proves it remains neutral rather than selected/backend
doorbell-capable. This remains bounded manager-owned bounce-buffer evidence
only: no virtio-net notify BAR handle is granted, no notify register is
written, no real virtio-net descriptor ring is mutated, and production
userspace NIC or cloud readiness is not claimed.
Update 2026-05-11 12:01 UTC: make run-ddf-provider-consumer now extends the
bounded provider-visible submit effect with a selected provider-owned queue
entry. Accepted DMABuffer.submitDescriptor still validates descriptor
authority and scrubs the manager-owned bounce page first, then writes queue
magic, queue id, tail, descriptor id, submitted length, and flags before the
submit marker. The focused smoke maps the buffer after completion and proves
the queue entry and marker remain visible outside the completed byte range; it
also rejects submits shorter than the full 24-byte provider-effect footprint
with zero in-flight accounting and no provider mutation. This is bounded
bounce-buffer evidence only: no hardware
descriptor ring or CQ is published, no direct DMA or IOMMU/remapping is
enabled, host physical/IOVA values stay hidden, no MMIO doorbell is written,
and provider-driver IRQ consumption plus cloud NIC/storage readiness remain
future work.
Update 2026-05-11 11:22 UTC: make run-ddf-provider-consumer now proves a
bounded descriptor-ring-equivalent provider side effect after
DMABuffer.submitDescriptor authority validation. The accepted submit path
scrubs the manager-owned bounce page, writes a provider-visible shadow
descriptor entry with magic, queue, descriptor id, submitted length, and
flags, and then writes the existing submit marker before the in-flight record
is committed. The same process maps the buffer after
DMABuffer.completeDescriptor and proves the shadow descriptor and marker
remain visible outside the completed byte range. A follow-up boundary check
rejects submits shorter than the 24-byte provider-effect footprint as
dmabuffer-provider-effect-too-short, preserving zero in-flight accounting and
blocking side effects. This is still bounded bounce-buffer evidence: no
hardware descriptor ring or CQ is published, no direct DMA or IOMMU/remapping
is enabled, host physical/IOVA values stay hidden, arbitrary MMIO doorbells
remain blocked, and provider-driver IRQ consumption plus cloud NIC/storage
readiness remain future work.
Update 2026-05-11 10:46 UTC: make run-ddf-provider-consumer now proves the
first bounded provider-visible DMA side effect in the four-cap provider
consumer. On accepted DMABuffer.submitDescriptor, the manager-owned bounce
page is scrubbed, then a submit marker is written only after descriptor
authority validation succeeds. The same process maps the buffer after
DMABuffer.completeDescriptor and proves the completion pattern is limited to
the completed byte range while the submit marker outside that range remains
visible. This is still bounded bounce-buffer evidence: no descriptor ring or
CQ is published, no direct DMA or IOMMU/remapping is enabled, host
physical/IOVA values stay hidden, arbitrary MMIO doorbells remain blocked, and
provider-driver IRQ consumption plus cloud NIC/storage readiness remain future
work.
Update 2026-05-11 10:06 UTC: commit c52064c0 extends the same
make run-ddf-provider-consumer four-cap provider-consumer smoke beyond
bounded DMABuffer submit/complete accounting. The service now also calls
brokered DeviceMmio.read32, the existing claimed-register
DeviceMmio.write32 path, and brokered readback before DeviceMmio.unmap.
The interrupt half proves one async Interrupt.wait completes as delivered
after route unmask, a second async wait stays pending for a kernel turn, and
Interrupt.mask completes that second waiter as cancelled. This remains
bounded provider-authority composition evidence only: DMA is still the
manager-owned bounce-buffer path, MMIO writes remain limited to the claimed
register policy with no arbitrary doorbell, IRQ behavior is bounded
route-generation-checked waiter delivery/cancellation, and production
NIC/storage migration, IOMMU/remapping, descriptor-ring mutation,
completion-queue publication, provider-driver interrupt consumption, and
cloud readiness remain future work.
Update 2026-05-11 09:25 UTC: make run-ddf-provider-consumer now extends the
same four-cap provider-consumer smoke across the bounded
DMABuffer submit/complete descriptor-accounting path. After allocating and unmapping one
manager-owned bounce buffer, the service calls typed
DMABuffer.submitDescriptor and DMABuffer.completeDescriptor, asserts
manager-inflight-recorded then manager-inflight-completed, and checks
DMAPool.info reports live_inflight=1 after submit and live_inflight=0
after completion before freeing the buffer. This remains bounded
provider-authority composition evidence only: no descriptor ring is mutated,
no CQ is published, direct DMA stays blocked, host physical/IOVA exposure stays
hidden, arbitrary MMIO writes and doorbells remain blocked, and production
NIC/block migration remains future work.
Update 2026-05-11 03:09 UTC: commit 9c0a5183 carries the
manager-owned fixed bounce-buffer DMAPool budget ledger into
DMAPool.allocateBuffer. The manifest-granted three-slot pool now attaches a
device-manager-owned budget policy with three live buffers/pages, 12288 bytes,
four queues, eight descriptors per queue, one in-flight descriptor per live
slot, zero MMIO mappings/bytes, and zero interrupt holds. With all three fixed
slots live, a fourth valid 4096-byte allocation returns no result cap and
reports result=dmapool-budget-exceeded, reason=over-buffer-budget,
sideEffect=side-effect-blocked, and bufferPresent=false before slot
selection, frame allocation, generation allocation, cap minting, or manager
ledger mutation. Imported live virtio-net proof records continue to use the
kernel-owned device_dma:virtio-net budget policy. This remains the bounded
manager-owned bounce-buffer path: direct DMA stays blocked, host physical
addresses and IOVAs stay hidden, descriptor rings and completion queues are
not mutated, and IOMMU/remapping plus production driver consumption remain
future work.
Update 2026-05-11 00:20 UTC: make run-ddf-provider-consumer now boots one
focused service that receives console, DMAPool, DeviceMmio, and
Interrupt in the same CapSet. The smoke uses existing typed runtime clients
and unchanged ABI surfaces: it validates DMAPool.info, allocates one
manager-owned bounce-buffer DMABuffer, maps and unmaps it through the
existing userspace bounce-buffer path, frees it with scrub-before-frame-free
evidence, maps and unmaps a boot-preseeded DeviceMmio BAR page read-only,
and exercises bounded Interrupt.wait, acknowledge, unmask, and mask
on the manager-attached route. The harness asserts the four-cap service spawn,
one manager-grant-source acquire/release for each authority family, and the
stable provider-consumer proof line with
authorities=dmapool,device_mmio,interrupt, no direct DMA, no host
physical/IOVA exposure, no arbitrary MMIO write or doorbell, no real interrupt
delivery requirement, and no production NIC/block migration. This is bounded
composition evidence only; it does not add schema/generated/runtime changes,
IOMMU programming, provider-driver interrupt consumption, or production
storage/network driver migration.
Update 2026-05-10 20:06 UTC: manifest-granted
DeviceMmio.write32(offset, value) now performs a bounded kernel-side
volatile MMIO write after validating the active manager-attached handle,
owner/state, region and policy binding, pure
DeviceMmioOperation::Write authority, a dword-aligned in-BAR range, and the
single PCI MSI-X metadata-derived provider claim, including BDF, BAR, BAR
base, offset, and value. The effect uses only
the boot-preseeded kernel MMIO mapping cache already used by
DeviceMmio.read32; it does not install a post-userspace kernel mapping, and
the userspace BAR VMA remains read-only. The
focused smoke uses the claimed virtio-rng MSI-X entry-0 vector-control mask
dword, reports side_effect=mmio-write-performed and
register_write=performed, then reads the same value back through brokered
read32 and the read-only userspace VMA. It also attempts an unclaimed
message-address dword write and reads back the original value unchanged.
Invalid range and unclaimed-register paths remain typed side-effect-blocked
results, while stale or released handles fail closed before any write and do
not return a write32 result payload. This does not add writable userspace
MMIO, arbitrary register writes, doorbells, host physical/IOVA exposure, IOMMU
programming, or a production provider-driver consumer.
Update 2026-05-10 19:29 UTC: DMABuffer.completeDescriptor now produces a
bounded userspace-visible completion effect on the manager-owned bounce-buffer
page. On the existing valid matching manager-inflight-completed path, after
active owner/epoch/slot validation and submitted-length checks pass, the
manager writes a deterministic byte pattern into the first completionLength
bytes of that slot’s bounce page before clearing the in-flight record. The
focused DMAPool smoke maps the same slot after completion and proves byte 0
and the last completed byte match the pattern while the next byte remains
unchanged. Invalid, stale, no-inflight, mismatched, length-exceeded,
mapped-live, and after-free paths keep their fail-closed labels and do not
write. This is bounce-buffer completion data only: no direct DMA, descriptor
ring mutation, CQ publication, host physical/IOVA exposure, IOMMU programming,
or production driver consumer is added.
Update 2026-05-10 16:47 UTC: the manifest-granted Interrupt.wait path now
has a fixed-table deferred waiter object for the current manager-attached
route. A masked wait still fails closed synchronously as
stale-pending-irq-masked / route-masked / side-effect-blocked, but an
unmasked wait now returns pending. The focused interrupt smoke submits that
wait nonblocking, drives the kernel once, observes it remains pending, calls
Interrupt.mask, then finishes the original wait as
interrupt-waiter-cancelled / route-masked /
waiter-completed-no-irq with wake_blocked=false, matching source/route
generations, unchanged delivery counts, and no real IRQ delivery. This is
no-IRQ cancellation waiter behavior only; hardware acknowledgement, MSI/MSI-X
programming, IRQ-delivery userspace waiters, and production interrupt dispatch
remain future work.
Update 2026-05-10 15:33 UTC: the manifest-granted DeviceMmio.map path now
maps a boot-preseeded MMIO BAR page into the caller’s userspace address space
as read-only, user-accessible, no-execute, no-cache PTEs. Accepted requests
return userspace-mmio-bar-mapped,
boot-preseeded-read-only-bar-page, user-vma-mapped, and a nonzero
page-aligned userspace address; writable, executable, unknown-protection,
zero-size, unaligned, out-of-BAR, overflow, and duplicate active map requests
return typed no-side-effect results. The focused smoke reads the same QEMU BAR
value through the returned userspace address and brokered DeviceMmio.read32,
then exercises explicit DeviceMmio.unmap, a no-op second unmap, remap after
unmap, and stale unmap failure after cap release. Release, drop,
driver-crash, and reset-disable cleanup revoke any borrowed user VMA before the
manager record is detached. This does not add writable MMIO, doorbells,
volatile register writes, host physical/IOVA exposure, post-userspace kernel
MMIO mappings, IOMMU programming, or a production provider-driver consumer.
Update 2026-05-10 14:12 UTC: DMABuffer.submitDescriptor /
DMABuffer.completeDescriptor now keep in-flight descriptor identity on each
live DMABuffer slot instead of one pool-global descriptor. The focused
DMAPool grant smoke proves slot 0 and slot 1 can both be in flight,
duplicate submit on the same slot still fails closed, mismatched completion
preserves both live slot descriptors, completing slot 0 decrements aggregate
DMAPool.info live_inflight from 2 to 1, explicit freeBuffer of the
remaining in-flight slot fails closed, and cap release of an in-flight slot
drains only that slot while preserving another slot’s in-flight accounting.
This remains bounded manager accounting: no descriptor ring is mutated, no CQ
entry is published, no direct DMA is attempted, no IOVA or host physical
address is exposed, and no IOMMU programming or production driver consumer is
added.
Update 2026-05-10 13:45 UTC: commit 3bbeb3d4 makes DMABuffer.unmap
explicitly remove the manager-owned bounce-buffer userspace VMA for the calling
process without freeing or scrubbing the bounce page, detaching the
DMABuffer record, changing DMAPool.info live buffer/page/in-flight
accounting, or touching real DMA state. The typed result reports
userspace-bounce-buffer-unmapped / single-page-bounce-buffer /
user-vma-unmapped when a live mapping is removed, and
dmabuffer-mapping-absent / no-user-mapping with no side effect on a second
unmap. The focused smoke maps slot 0 read-only, rejects a writable remap while
the read-only mapping remains live, unmaps it, proves the second unmap is a
typed no-op, remaps the same VMA writable, and verifies stale unmap after
freeBuffer fails closed like info, map, submitDescriptor, and
completeDescriptor. While VMA teardown is in progress, the cap records an
in-progress mapping state so concurrent map/free/release paths fail closed
instead of observing an absent mapping before the page-table unmap and TLB wait
complete.
Update 2026-05-10 12:49 UTC: commit 28e16431 extends the manifest-granted
DMAPool bounce-buffer allocator from two live result caps to three fixed
manager-owned slots. The focused smoke now proves slot 0, slot 1, and slot 2
can be live at the same time, DMAPool.info reports
live_buffers=3 live_pages=3 live_bytes=12288, a fourth allocation fails
closed as dmapool-already-attached / active-buffer-attached, and freeing
slot 0 while slots 1 and 2 remain live leaves two
committed/resident/unswappable pages. The same smoke reallocates slot 0 with a
fresh generation while slots 1 and 2 remain live, verifies the old cap stays
revoked, explicitly frees the reused slot 0 and slot 2 buffers, and releases
slot 1 while one bounded descriptor submission remains in flight so
parent-first DMAPool release completes only after the last live result
buffer detaches and drains manager-owned accounting. This is a fixed
three-slot bounce-buffer allocator; it still does not expose direct DMA, IOVA
or host physical addresses, descriptor-ring mutation, CQ publication, IOMMU
programming, hostile isolation coverage, or a production driver consumer.
Update 2026-05-10 11:44 UTC: commit 75beeeb8 extends the
manifest-granted DMAPool bounce-buffer allocator from one live result cap to
two fixed manager-owned slots. The focused smoke now proves slot 0 and slot 1
can be live at the same time, DMAPool.info reports
live_buffers=2 live_pages=2 live_bytes=8192, a third allocation fails closed
as dmapool-already-attached / active-buffer-attached, and freeing slot 0
while slot 1 remains live leaves one committed/resident/unswappable page. The
same smoke reallocates slot 0 with a fresh generation while slot 1 remains
live, verifies the old cap stays revoked, and releases the parent DMAPool
before the remaining buffers so the staged pool detach completes only after
the final DMABuffer release drains bounded in-flight accounting. This is a
fixed two-slot bounce-buffer allocator; it still does not expose direct DMA,
IOVA or host physical addresses, descriptor-ring mutation, CQ publication,
IOMMU programming, hostile isolation coverage, or a production driver
consumer.
Update 2026-05-10 10:56 UTC: commit 9659763e adds typed
DeviceMmio.write32(offset, value) admission on the manifest-granted
DeviceMmio cap. The kernel validates the active manager-attached handle,
owner/state, region and policy binding, DeviceMmioOperation::Write
authority, and 32-bit aligned in-BAR range before returning
admission-check-only, real-mmio-write-not-programmed, and
side-effect-blocked with register_write=blocked. The focused smoke
asserts accepted-shaped admission, unaligned/out-of-BAR/overflow
mmio-write32-range-invalid denials, stale-after-release failure, and two
sequential grant-cycle runs. This is an admission proof only; no volatile
register write, userspace BAR mapping, doorbell, host physical exposure,
IOMMU programming, or production driver consumer is added.
Update 2026-05-10 10:20 UTC: commit 3777a50d gives the
manifest-granted proof-buffer DMABuffer descriptor path an explicit
single in-flight descriptor identity. After one valid submitDescriptor,
the focused grant smoke now proves a duplicate submit for the same
queue/descriptor returns dmabuffer-descriptor-already-inflight with
side-effect-blocked, and a valid-shaped completion for a different live
descriptor returns dmabuffer-inflight-descriptor-mismatch with
side-effect-blocked. Both refusals preserve DMAPool.info
live_inflight=1; the matching completion still restores it to 0. This
is still bounded manager accounting only, not descriptor-ring mutation, CQ
publication, direct DMA, IOVA export, or a production driver consumer.
Update 2026-05-10 05:03 UTC: unsupported-protection DeviceMmio.map
requests now stay on the typed admission result path. The focused grant smoke
decodes range_result=mmio-map-prot-invalid,
range_reason=unsupported-map-prot, range_side_effect=side-effect-blocked,
addr=0, and unchanged manager identity fields instead of treating executable
or missing-read protections as a capability exception. This remains
admission-only evidence; no real BAR mapping, register access, doorbell write,
or host physical address exposure is added.
Update 2026-05-10 09:36 UTC: a harness-hardening follow-up, commit
7dfa1d65, strengthens make run-dmapool-grant around that bounded
userspace bounce-buffer map. The current focused smoke maps the first slot
generation read-only, proves the zeroed page is readable, and asserts a
same-cap writable remap attempt fails while that read-only mapping remains
live. It then writes and reads a marker through the second slot generation’s
read-write mapping while preserving the existing typed partial-range,
protection, and free/release cleanup assertions. This is a stronger proof of
the existing mapping permission contract, not new direct DMA, IOVA, host
physical, descriptor-ring, CQ, IOMMU, hostile-isolation, or production-driver
authority.
Update 2026-05-10 08:53 UTC: the manifest-granted DMABuffer.map path now
maps the single manager-owned bounce-buffer page into the caller’s userspace
VMA. Accepted readable full-page requests return
userspace-bounce-buffer-mapped, single-page-bounce-buffer, and
user-vma-mapped with a nonzero page-aligned userspace address while still
reporting real_dma_mapping=not-programmed, direct_dma=blocked, and
host_physical_user_visible=false. The feature slice proved the mapped page
could be reached from userspace, kept typed partial-range/protection denials
at addr=0, and proved DMABuffer.freeBuffer / cap release revoke the user
mapping before the bounce page is scrubbed and freed. This is userspace access
to the kernel-managed bounce buffer only; it does not expose IOVA, host
physical addresses, direct DMA, descriptor-ring mutation, CQ publication,
IOMMU programming, or a production driver consumer.
Update 2026-05-10 06:37 UTC: the manifest-granted DMAPool.allocateBuffer
and DMABuffer.freeBuffer path now uses production bounce-buffer labels for
the single-page userspace allocation/free authority. A valid 4096-byte
request still mints exactly one same-session DMABuffer result cap, but the
cap surfaces now report userspace_dmapool=manager-issued-bounce-buffer,
allocation=single-bounce-buffer-page,
record_pool=userspace-bounce-buffer-live, and
free_buffer=bounce-buffer-page; zero-live records report
zero-live-dmapool-bounce-buffer, and oversized requests fail as
size-exceeds-bounce-buffer. The backing frame remains ledger-owned,
resident, unswappable, scrubbed before frame free, and hidden from userspace.
That slice advanced allocation/free authority; the later DMABuffer.map
slice above adds userspace bounce-buffer VMA access while descriptor methods
still only update bounded manager accounting, and no IOVA, host physical
address, CQ publication, descriptor-ring mutation, IOMMU programming, or
production driver consumer is added.
Update 2026-05-10 04:40 UTC: invalid-size DMAPool.allocateBuffer requests
now use the same typed no-result-cap rejection shape as duplicate-active
requests. Zero-size and over-bounce-buffer calls return
result=dmapool-allocation-request-invalid, the exact request reason,
side_effect=side-effect-blocked, buffer_present=false, no result cap, and
no page mutation instead of relying on a capability exception string.
Update 2026-05-10 04:27 UTC: the manifest-granted Interrupt.mask and
Interrupt.unmask methods perform bounded route-state control over the
manager-attached dispatch slot. unmask changes claimed-masked to
driver-unmasked, mask changes it back to claimed-masked, both preserve
delivery counts, and release masks retained manager-grant routes before
detaching. The 2026-05-10 16:47 UTC status above supersedes the original
post-unmask wait placeholder with deferred no-IRQ cancellation behavior.
Update 2026-05-10 03:23 UTC: DMAPool.allocateBuffer now reports the
duplicate-active bounded proof-buffer rejection as typed result data instead
of requiring userspace to infer the label from a capability exception string.
When the first proof DMABuffer is still live, the second valid-size
allocation returns no result cap and reports result=dmapool-already-attached,
reason=active-buffer-attached, side_effect=side-effect-blocked, and
buffer_present=false; the smoke then proves DMAPool.info still reports one
live 4096-byte proof frame. This remains bounded proof-buffer guard evidence,
not real multi-buffer DMA allocation or production userspace DMA authority.
Update 2026-05-10 02:52 UTC: the focused DMAPool grant smoke now proves
active duplicate allocation is refused while the single proof buffer is still
attached. The smoke calls DMAPool.allocateBuffer a second time after the
first result DMABuffer becomes live, requires the failure path, then re-reads
DMAPool.info to prove the record still reports exactly one live 4096-byte
proof frame. This is a bounded proof-buffer guard only; it does not add real
multi-buffer allocation, DMA mappings, IOVA/physical exposure, or production
driver authority.
Update 2026-05-10 02:21 UTC: DMAPool.info now reports the live manager
record accounting for the bounded proof-buffer path. The manifest-granted
DMAPool starts as zero-live-dmapool-proof, moves to
synthetic-live-dmapool-proof with one live 4096-byte page while the
manager-attached DMABuffer result cap is active, and returns to zero-live
after typed DMABuffer.freeBuffer scrubs/releases the proof frame. The
focused grant smoke asserts the live and after-free accounting lines through
the typed runtime client. This remains bounded proof-buffer accounting only;
it does not implement real device-visible DMA mappings, IOVA/physical address
exposure, production descriptor side effects, or production page lifecycle.
Update 2026-05-09 23:52 UTC: the manifest-granted Interrupt skeleton now
also has bounded Interrupt.mask and Interrupt.unmask admission methods.
They validate the current manager-attached route through the existing
Mask/Unmask authority paths and return typed no-side-effect labels while
proving route state and delivery counts stay unchanged. This does not
implement real route mask/unmask mutation, hardware acknowledgement, blocking
userspace waiters, MSI/MSI-X table programming, or real interrupt delivery.
Update 2026-05-09 23:21 UTC: the manifest-granted Interrupt skeleton now
also has a bounded Interrupt.acknowledge admission method. It validates the
current manager-attached route through the existing Acknowledge authority
path and returns typed no-side-effect labels without acknowledging hardware,
waking waiters, or changing delivery counts. This does not implement real
interrupt acknowledgement, blocking userspace waiters, real mask/unmask route
mutation, or real interrupt delivery.
Update 2026-05-09 19:18 UTC: the manifest-granted Interrupt skeleton now
has a bounded Interrupt.wait admission method. It validates the current
manager-attached route, delegates to the shared pending-IRQ token validator,
and returns typed masked-route labels without waking a waiter or advancing
delivery counts. This does not implement blocking userspace waiters,
hardware acknowledgement, real route mask/unmask mutation, or real interrupt
delivery.
Update 2026-04-28 22:02 UTC: normal shell client @... grants reject explicit
badge N selector syntax and preserve delegated client endpoint identity when
the selector is omitted. Low-level and hostile-path tests still carry explicit
selector fixtures.
Update 2026-04-29 05:59 UTC: the focused chat manifest now routes the kernel
singleton chat_endpoint through init to the resident chat server, and the
focused chat shell no longer receives a manifest-forwarded chat service
export. Its normal chat authority comes from the broker-issued operator shell
bundle, matching the default and remote shell paths while the resident bot
keeps its manifest service grant.
Update 2026-04-29 07:35 UTC: Session-Bound Invocation Context core gates are landed. Implemented pieces include the process-session invariant, endpoint caller-session metadata, stale normal endpoint rejection, transfer scopes, field-granular disclosure gating, session expiry for broker-issued shell bundle caps, guest bundle narrowing, chat session-keyed membership, and Aurelian player state keyed by live endpoint caller-session metadata.
Update 2026-05-01 08:47 UTC: default password-authenticated local operator
sessions now mint with no wall-clock expiration. Short-expiry operator proofs
remain available by setting a non-default sessionLifetimes.operatorMs in the
manifest.
Update 2026-04-29 09:44 UTC: later Gate 4 cleanup put terminal output behind
live caller-session dispatch, bound shell-serviced stdio bridge waits to opaque
live caller-session metadata, removed remaining badge-facing service-common
handler APIs from normal chat paths, and widened non-adventure endpoint
caller-session opaque references to 128 bits while preserving the
scoped_ref ABI field as the low half.
Update 2026-04-29 10:20 UTC: non-adventure endpoint caller-session references
now use an entropy-backed boot secret and HMAC-SHA256 over a non-reused
endpoint service-scope id plus kernel session id. The ABI layout is unchanged,
but scoped_ref is no longer value-compatible with the old unkeyed hash.
References rotate on reboot and endpoint object replacement.
Update 2026-04-29 21:40 UTC: Gate 4 of the Session-Bound Invocation Context
milestone is implemented and verified on mainline. Commit faeff80 at
2026-04-29 21:39 UTC records the final closeout: normal chat, adventure,
terminal, and stdio paths no longer expose caller-selected receiver identity,
focused make run-adventure passed with session-bound Adventure/chat service
grants, and focused make docs passed after docs PDF render hardening. The
paper/status alignment now records the session-bound shared-service evidence
as landed. Remaining work in this area is future stable service-audit identity
across upgrades, not additional shared-service migration.
Implemented
Visible Milestone Proofs
- First Packet: commit
b56a5c1at2026-04-24 15:37 UTC. - First HTTP: commit
a4f1722at2026-04-24 16:47 UTC. - The Unprivileged Stranger: commit
d4016abat2026-04-22 16:35 UTC. - Native Cap Shell: commit
f554e88at2026-04-23 08:41 UTC. - Boot to Shell: commit
e5adafbat2026-04-23 13:39 UTC. - The Revocable Read: commit
7f19af2at2026-04-23 16:15 UTC. - First Chat: commit
2cd85a8at2026-04-24 00:13 UTC. - Local MUD: commit
add7f9bat2026-04-24 01:40 UTC. - Verified Core: commit
d43b691at2026-04-23 22:09 UTC. - Ring as Black Box: commit
da5f5e9at2026-04-24 03:13 UTC. - First AP Scheduler: commit
d88bca7at2026-04-25 11:31 UTC. - Telnet Shell Demo: commit
2834bfcat2026-04-25 20:25 UTC. Demo scope: plaintext, loopback-only research demo proving theTerminalSession/SessionManager/AuthorityBroker/RestrictedShellLauncherboundary over a real TCP socket; not a shippable Telnet service. Production remote shell tracks the SSH Shell Gateway in SSH Shell Gateway. - Multi-Process SMP Concurrency: commit
3fb89923at2026-04-30 09:45 UTC. - Default Run Telnet Wiring Retired: commit
367117beat2026-05-01 16:54 UTCremoves default host-local Telnet gateway forwarding and the default manifest service. Currentmake runstarts the foreground shell, chat service, and remote-session CapSet gateway, and forwards only the remote CapSet endpoint. The plaintext demo was later retired with qemu-only kernel TCP listener removal;make run-telnetnow exits before QEMU with a retirement diagnostic. Earlier commit7a155f4at2026-04-26 21:02 UTCmoves Telnet IAC filtering into the kernel socket terminal (best-effort silent swallow for the BSD/netkit clients we test against; no WONT/DONT replies) so a normal Telnet client lands at the shell prompt without a userspace pre-handoff recv, and refactors the gateway to loop accept/handoff/launch_shell/wait per connection so repeated host Telnet connections succeed. - Service Object Routing/Lifecycle: commit
a4655f0at2026-04-28 14:10 UTC;make run-service-object-routingproves trusted service-object minting, receiver-cookie dispatch, payload-spoof rejection, copy/move IPC transfer, nested spawn delegation, generation-checked receiver cookies, close/revoke rejection, and stale-cookie rejection after record reuse. This is now historical low-level coverage: the implemented Session-Bound Invocation Context baseline gives normal workload processes one immutable session context, and endpoint subject disclosure is private by default. - Session Context Invariant: commit
3edee90at2026-04-28 16:26 UTCaddsmake run-session-context, proving every spawned process has one immutable session context, raw child spawns inherit the caller context, a copiedUserSessioncap cannot relabel invocation context, and a broker-issued launcher can select a validated child context while mismatched profile requests fail closed. Commit3469c27at2026-04-28 16:54 UTCextends the proof so expired guest-session bundle refreshes fail closed at the broker. Commit687511aat2026-04-28 17:43 UTCadds privacy-preserving endpoint caller-session metadata and stale normal endpoint rejection: endpoint servers receive only a service-scoped opaque caller-session reference, epoch, and live/stale flags by default, spoofeduser/session/rolepayload labels do not affect the delivered invocation context, and calls after the process session expires fail before transfer preparation or enqueue. Commitf0cb74bat2026-04-28 18:38 UTCadds session-aware cap transfer scopes: same-session-only caps cannot cross into another session, explicitly shareable caps may cross and then invoke under the receiver session, and service-regrant-only caps require a trusted fixed-session broker/launcher path. Commit0f92d77at2026-04-28 19:33 UTCadds explicit endpoint subject disclosure gating: request without scope and scope without request expose no subject fields, request plus matching scope exposes only allowed fields, and broader requests are narrowed. Commitdc7ece4at2026-04-28 20:06 UTCmigrates the chat demo to session-keyed membership: chat member state is keyed by the endpoint caller-session reference, the focused chat manifest no longer assigns static chat badges, andmake run-chatproves operator-session chat clients plus rejected delegated endpoint relabeling. A follow-up review fix keeps join handles as request data and uses service-assigned visible member labels. 2026-04-28 20:48 UTCnarrows guest shell bundles: guest sessions require an explicit manifest guest seed, guest bundles receive no default chat/adventure service endpoint caps, and guest launcher policy comes from resource-profilelauncherProfilerather than the full manifest binary list.
Boot and Kernel Baseline
- Limine boots the x86_64 kernel in QEMU.
- The kernel initializes dual UART output, GDT, IDT, LAPIC, syscall MSRs, memory management, page tables, heap allocation, and the global capability registry. The legacy PIC/PIT path remains as a fallback when LAPIC timer setup or PIT-based calibration is unavailable.
- User page-table map, unmap, and protect operations are routed through a TLB shootdown helper keyed by address-space CPU residency. Remote targets get pending full-TLB flush generations plus vector-49 IPIs, and the sender waits for observed target completion after ring dispatch releases address-space, cap-table, and scratch locks. Deferred queue slots are reserved before page-table mutation, and drains flush the current CPU before waiting. Delayed maskable interrupt delivery is covered by syscall-entry and flush-before-user-return hooks. Scheduler CR3 handoff marks the current CPU resident, including AP cpu=1 during the first AP scheduler-owner proof, so remote shootdown targets become active when an address space has run on more than one CPU.
- AP cpu=1 can own scheduler/user execution under
-smp 2: APs register theirPerCpurecords, program LAPIC timers from the BSP calibration, update AP TSS.RSP0 during context switches, and enter the scheduler from the AP idle loop when AP timer setup succeeds. This proof keeps one scheduler owner; when AP cpu=1 is online with a programmed timer, the BSP stays in kernel idle so the process-wide capability ring is not executed concurrently. - The Multi-Process SMP Concurrency milestone is complete at commit
3fb89923(2026-04-30 09:45 UTC). The scheduler tree includes a narrow reschedule-IPI wake path for halted scheduler-owner loops, andmake run-smp-process-scalebuildscapos-smp-process-scale.isofromsystem-smp-process-scale.cue, runs repeated-smp 1,-smp 2, and best-effort 4-vCPU QEMU cases, parses compact verified timing lines, stores raw serial logs undertarget/smp-process-scale/<timestamp>/, and enforces the 1.6x median speedup threshold when KVM-backed evidence is available. The accepted run intarget/smp-process-scale/cycle-balanced-default/recorded1.608x1-to-2 speedup. A latercapos-benchnested-QEMU/KVM rerun on GCEn2-highcpu-8at commit0d89a91b(2026-04-30 11:09 UTC) pinned QEMU to host CPUs0,1,2,3and recorded1.873xcapOS 1-to-2 speedup; the matching Linux guest baseline under the same CPU pinning recorded1.934x. The same run recorded capOSsmp4=1111scaled cycles, or1.475xfrom the 1-vCPU baseline but slower than the 2-vCPU median, while Linux recorded3.774x1-to-4 speedup; capOS therefore still claims only the 1-to-2 milestone gate. The closeout also reran ordinaryrun-smokeandrun-spawnunder-smp 2, with logs intarget/smp2-smokes/, covering the default manifest, ring, thread lifecycle, park cleanup, generic child waits, and process exit. - The kernel creates its own page tables with per-section permissions and keeps the higher-half direct map for physical memory access.
- SMEP/SMAP are enabled when the QEMU CPU advertises support.
Code: kernel/src/main.rs, kernel/src/arch/x86_64/, kernel/src/mem/.
Validation: cargo build --features qemu, make run-smoke.
Process and Userspace Runtime
- Processes have isolated address spaces, one or more internal Thread records with per-thread kernel stacks and saved CPU context, CapSet bootstrap pages, capability rings, and local capability tables.
- ELF loading supports static no_std userspace binaries and TLS setup.
capos-rtowns the userspace entry path, allocator initialization, ring-client access, typed clients, result-cap parsing, and owned-handle release.capos-rtis the only source owner for the userspace_start, panic, global allocator, raw syscall, andcapos_rt_mainhandoff surfaces; a source check guards this split.targets/x86_64-unknown-capos.jsondefines the capOS userspace target for bootedinit,demos,shell, andcapos-rtruntime builds; the kernel default remainsx86_64-unknown-none.- The 7.1.0 in-process threading contract defines the split between
process-owned address-space/capability state and thread-owned execution
state, plus thread/kernel-stack quotas and generation-checked waiter
identity. 7.2.0 moved saved context, kernel stack, FS base, and block state
into
Threadrecords; 7.2.1 schedules and wakes generation-checkedThreadRefvalues; 7.2.2 adds process-localThreadSpawnerandThreadHandlecaps plusThreadControl.exitThreadfor create, join, detach, self-join rejection, exit-code observation, and last-thread process exit; and 7.2.3 adds private ParkSpace wait/wake with timeout, wake, and reserved waiter completion semantics. SharedParkSpace park-words remain future work.
Code: kernel/src/spawn.rs, kernel/src/process.rs, capos-rt/src/,
init/src/main.rs, demos/, shell/, targets/x86_64-unknown-capos.json,
tools/check-userspace-runtime-surface.sh.
Design: In-Process Threading, Park Authority.
Validation: tools/check-userspace-runtime-surface.sh,
make capos-rt-check, make init-capos-build, make demos-capos-build,
make shell-capos-build, make capos-rt-capos-build, make run-smoke,
make run-spawn.
Programming Language Support
- Native capOS Rust is the only implemented booted Rust language path. It uses
#![no_std],alloc,capos-rt, static ELF binaries, and thetargets/x86_64-unknown-capos.jsoncustom target. - Native C boots through the libcapos C-substrate (Phase 0;
make run-c-helloexercises Console + Timer + EntropySource + a 4 KiB anonymous VM roundtrip) and through the POSIX adapter (Phase P1.2 Phase B; the historicalmake run-posix-dns-smokeresolvedexample.comover the qemu-only kernelUdpSocketcap via QEMU slirp DNS at 10.0.2.3:53, but that target is retired after kernel socket-owner removal;make run-posix-pipe-smokeandmake run-posix-spawn-smokeexercise pipe, fork-for-exec, directposix_spawn, minimal file actions, read, and waitpid overProcessSpawner/Pipe; the Console-backed stdio proof landed at commitaa6a56d7(2026-05-13 11:03 UTC) andmake run-posix-stdio-smokeexerciseswrite(1, ...)andwrite(2, ...)over a granted Console while provingread(0, ...)stays closed without a stdin grant; the file/directory fd closeout landed at commitf97d9833(2026-05-23 06:23 UTC) andmake run-posix-fileexercisesopen(),write(),lseek(),read(),opendir(),readdir(), andclosedir()over a granted rootDirectory;make run-posix-printfexercises the focused printf/string subset: formatted output, string/mem, numeric conversion, and ctype helpers;make run-posix-signal-timeexercises Timer-backedtime,nanosleep, andsleepplus the documented fail-closed signal-delivery stubs). Both bypass WASI – they are static ELF binaries linked againstlibcapos.aand, for POSIX smokes,libcapos_posix.a. POSIXposix_spawn()accepts argv/envp for source compatibility but does not deliver them until LaunchParameters / environment support lands. Broader C/libcapos surface and full POSIX adapter scope remain future design. - Sandboxed
wasm32-wasiis the first booted WASI-hosted language path. Phase W.5 (filesystem;capos-wasm/src/wasi/fs.rs) closed and is exercised bymake run-wasi-fs: thewasm-hostinstalls the manifest-granted rootDirectorycap as a preopened fd, the WASI payload writes and reads back a file throughpath_open/fd_write/fd_close/ re-open /fd_read, and the preopen sandbox refuses absolute paths and parent-escape..segments. The WASI host adapter closed Phase W.4 at commitb0f6939f(2026-05-07 20:09 UTC); Phase W.3 closed at commitca41ecc1(2026-05-07 18:29 UTC; the surrounding W.3 narrative stamps from2026-05-07 18:25 UTCpredate the feat commit by a few minutes); Phase W.2 closed at commit7bfcb1d8(2026-05-07 10:53 UTC): thewasm-hostuserspace binary (capos-wasm/ standalone crate over vendored wasmi 1.0.9) hosts WebAssembly modules whosewasi_snapshot_preview1imports are backed by typed capOS capabilities (Console + Timer + BootPackage, the per-instance argv text grant from W.3, the 2026-05-13 bounded environment text grant throughinitConfig.init.wasiEnv, and the optional W.4EntropySourcecap looked up from the per-instance CapSet under the well-known namerandom).make run-wasi-hello-rust,make run-wasi-hello-c,make run-wasi-cli-args, andmake run-wasi-envare the regression smokes;make run-wasi-randomis the W.4 granted gate (the payload reads N=64 bytes throughrandom_getand prints[wasi-random] entropy_bytes=64 entropy_bound_ok=true) andmake run-wasi-random-ungrantedis the matching refusal gate (the same payload observesERRNO_NOSYS = 52from the closed-fail branch when the manifest withholds the grant). A 2026-05-13 authority-free compatibility slice addsmake run-wasi-stdio-fd, whose direct-import payload provesclock_res_get(MONOTONIC),sched_yield,fd_fdstat_get(1/2), andfd_seek(1/2)no longer returnERRNO_NOSYS;make run-wasi-envproves one granted environment value reaches a WASI payload throughenviron_get/environ_sizes_get;make run-wasi-preview1-refusalsremains the storage/socket fail-closed gate forpath_open,fd_read,sock_send, andsock_recv. Wall-clock support stays deferred until capOS has a typedWallClock/RealTimeClockcap;clock_time_get(CLOCKID_REALTIME)keeps returningERRNO_NOSYSuntil that cap lands. - Rust
std, C++, Go, Python, JavaScript/TypeScript, and full POSIX shell/utilities are not implemented as supported capOS runtime paths. - Lua has a Phase 0 in-tree capability-aware Lua-subset interpreter under
demos/lua-smoke/(gated bymake run-lua-smoke); it validates the long-term capability-userdata host API design but is NOT a PUC Lua dialect-compatible runner. Dialect compatibility waits on the future C/libcapos PUC port. - The planned compatibility story is split by adapter type rather than one generic “compatibility layer”: native runtime adapters for languages such as Rust and Go, capability-native bindings over Cap’n Proto interfaces, POSIX compatibility adapters over scoped file/socket/process caps, and WASI host adapters backed by capabilities.
Design: Programming Languages, Userspace Runtime, Userspace Binaries, Go Runtime, Lua Scripting.
Validation: current native Rust validation uses
tools/check-userspace-runtime-surface.sh, custom-target userspace builds, and
the runtime QEMU smokes listed above. Native C/POSIX validation is through the
focused make run-c-* and make run-posix-* smokes named in this section,
including make run-posix-file for the File/Directory fd surface. WASI
filesystem validation uses make run-wasi-fs (Phase W.5 preopened-directory
round-trip and sandbox proof).
Capability Ring and IPC
- The shared ring ABI supports CALL, RECV, RETURN, RELEASE, CANCEL, NOP, and compact ParkSpace PARK/UNPARK transport operations.
cap_enterprocesses submissions and can block until completions arrive or a timeout expires.- Endpoints route ring-native IPC between processes.
- Direct IPC handoff lets a blocked receiver run before unrelated round-robin work after a matching CALL arrives.
- Transport errors and application exceptions are surfaced through CQEs and typed runtime client errors.
- Ordinary capability implementation errors, revoked ordinary/endpoint use,
live endpoint target errors after endpoint identification, and endpoint
RETURN application failures use serialized
CapExceptionpayloads when a caller result buffer can safely receive one. No-payload application failures reportCAP_ERR_APPLICATION_EXCEPTION_TRUNCATED; malformed transport metadata and unsafe result-buffer paths remain transport errors. - Endpoint RETURN can propagate a serialized
CapExceptionfrom a userspace endpoint server to the original cross-process caller. debug_tapbuilds export metadata-onlyringtap:records for observed SQEs and posted CQEs on the QEMU/debug UART. The format is fixed, bounded, and deliberately recordspayload_len = 0until a separate payload-capture authority lands.tools/ringtap-viewer/parsesringtap:logs into SQE/CQE summaries and can decode authorized Cap’n Proto payloads forCapException,TerminalSession.readLineparams, andProcessHandle.waitresults when future tap output includespayload_schemaandpayload_hexfields.make run-ringtap-failing-callboots the default shell smoke withdebug_tap, drives the knowntyped-callmethod-99 launcher failure, runs the viewer over the captured kernel log, and leaves offline inspection logs intarget/ringtap-failing-call-*.log.
Code: capos-config/src/ring.rs, kernel/src/cap/ring.rs,
kernel/src/cap/endpoint.rs, kernel/src/debug_tap.rs,
capos-rt/src/ring.rs, capos-rt/src/client.rs,
tools/ringtap-failing-call-smoke.sh, tools/ringtap-viewer/.
Validation: cargo test-ring-loom, make run-smoke, make run-spawn,
make run-smoke CARGO_FLAGS='--features debug_tap',
cd tools/ringtap-viewer && cargo test, make run-ringtap-failing-call.
Capabilities
Implemented kernel capabilities include:
- Console for debug UART output.
- TerminalSession for the separate session UART with line input/output,
bounded
readLine, visible/hidden echo, structured cancellation, and a single move-only foreground holder. - BootPackage for read-only, chunked boot manifest reads from init.
- FrameAllocator for typed
MemoryObjectframe ownership grants. - MemoryObject for owned physical frame ranges, caller-local map/unmap/protect, and final backing release after cap/mapping teardown.
- Endpoint for IPC rendezvous.
- VirtualMemory for anonymous user page map, unmap, and protect operations.
- Timer for monotonic tick/time reads and bounded sleep completions through the capability ring.
- ThreadControl for runtime-owned FS-base get/set and current-thread
exitThreadon the current thread. - ThreadSpawner and ThreadHandle for process-local in-process thread creation, one-shot join, exit-code observation, detach-on-release, and retained-status cleanup.
- ParkSpace for process-local private park wait/wake on 32-bit userspace words, with per-thread blocking and reserved waiter CQE credits.
- ProcessSpawner and ProcessHandle for init-driven child process creation and wait semantics.
- Retired NetworkManager, TcpListener, and TcpSocket qemu-only kernel socket capabilities. Their entry points now fail closed; the active TCP/UDP socket authority shape is the Phase C userspace network-stack path.
MemoryObject holders and anonymous VirtualMemory mappings charge the same
per-process ResourceLedger::frame_grant_pages quota. Mapping a held
MemoryObject records borrowed address-space pages and reserves mapping quota
until unmap so backing frames cannot stay pinned after the cap charge is
released.
Code: kernel/src/cap/console.rs, kernel/src/cap/terminal_session.rs,
kernel/src/cap/boot_package.rs, kernel/src/cap/frame_alloc.rs,
kernel/src/cap/endpoint.rs, kernel/src/cap/virtual_memory.rs,
kernel/src/cap/timer.rs, kernel/src/cap/thread_control.rs,
kernel/src/cap/thread_handle.rs, kernel/src/cap/process_spawner.rs,
kernel/src/cap/network.rs.
Validation: make run-smoke, make run-memoryobject-shared, make run-spawn,
make run-shell, make run-terminal, make run-net, cargo test-lib.
Capability Transfer and Release
- IPC CALL and RETURN support sideband transfer descriptors.
- Copy and move transfer are implemented.
- Move transfer reserves the sender slot until destination insertion and commit.
- Transfer result caps carry interface ids to userspace.
CAP_OP_RELEASEremoves local capability-table slots. Runtime owned-handle drop queues one local release, andRuntime::flush_releases()forces queued releases when code cannot wait for the next ring-client acquisition/drop.
Code: kernel/src/cap/transfer.rs, kernel/src/cap/ring.rs,
capos-lib/src/cap_table.rs, capos-rt/src/ring.rs.
Validation: cargo test-lib, make run-smoke.
Manifest Tooling and Smokes
tools/mkmanifestturnssystem.cueinto a Cap’n Proto boot manifest.- The build uses repo-pinned Cap’n Proto and CUE tool paths through the
Makefile; direct
mkmanifestinvocation also rejects missing, unpinned, or version-mismatched CUE compilers.mkmanifest cue-to-capnpextends the same pinned-tool policy to general CUE-authored data messages: it exports CUE as JSON, validatesCAPOS_CAPNP, and delegates arbitrary specified schema-rooted struct serialization tocapnp convert json:binary. - Default scripted QEMU smoke still uses the focused shell-led
system-smoke.cuepath: anonymous session on boot,loginprompting for username before hidden password entry, generic failed-auth output on a wrong password, successful operator login, broker upgrade to the operator bundle, child terminal isolation, stale-handle release, single-capos-shellinit boot, and clean halt. The default operator-facingsystem.cuepath is init-owned and is exercised bymake run. system.cueis now the default init-owned manifest. The kernel starts only the firstinitservice, and init startscapos-shell, the remote-session CapSet gateway, and the default demo services from the manifest service graph. The shell receives terminal/creds/sessions/audit/broker caps and mints its own anonymous session.system-shell.cueis the focused anonymous-shell proof (no verifier), which exercises the shell in its anonymous bundle and asserts that the anonymous launcher rejects spawns because its allowlist is empty.system-chat.cueis the focused First Chat prototype proof. It starts a residentChatendpoint service on the kernel singletonchat_endpoint, a resident bot participant, and the shell;make run-chatdrivesrun "chat-client"with explicitStdIOplus the broker-issuedchatendpoint grant, sends one line, and checks that the bot reply is printed by the foreground client.system-adventure.cueis the focused adventure prototype proof. It keeps adventure out of shell builtins and drivesrun "adventure-client"through explicitStdIO,adventure, andchatendpoint grants. See the Aurelian Frontier (proof slice) page for the current mission, commands, and transcript coverage.system-paperclips.cueis the focused clean-room Paperclips-style terminal demo proof. As of commit532207c1(2026-04-30 20:54 UTC), it boots Paperclips server services plus a terminal client. The server owns generated content, game state, regular timer cadence, unlock checks, game-rule mutation, and proof-command gating; the terminal client receives explicitStdIOplus aPaperclipsGameendpoint and renders server-modehelpfrom the server’s structured command specs. Commite9ae4e97(2026-04-30 22:02 UTC) adds structured plain-status snapshots, so server-mode plainstatusis rendered from the server’s structuredPaperclipsStatusSnapshot. Commit32462e9f(2026-04-30 22:32 UTC) adds the structured project-list follow-up: server-provided project entries for terminal-rendered plainprojects, whileproject <id>remains a raw text request that mutates server-owned game state.make run-paperclipsfirst proves that normal server authority rejectsrun <ms>fast-forward plus rejection of a forgedproof_accelerator: @timergrant, then relaunches against the proof server endpoint with the focused manifest’s explicitproof_acceleratorcap for transcript acceleration. The accelerated proof drives one-at-a-time manual production, locked-purchase and insufficient-funds refusal output, bulk-manual rejection, high-price zero-demand sale refusal, no-wire manual production refusal, explicit sales, immediate repeat-sale cooldown refusal, repeatable marketing, autoclipper unlock, real-time automation, generated Cap’n Proto content loading, first project completion, scaled business-phase production, thedesign-searchandforecast-engineproject chain,survey-drones, the visible== autonomous phase ==transition, then representative autonomous drone/factory scaling with local-matter conversion and additional clip production,mesh-coordination,seed-probes, the visible== cosmic phase ==transition, one bounded probe interval with cosmic matter conversion, probe replication, and additional production, then asserts a compactstatus --jsonmachine-readable status line and verifiesfinal-conversionremains locked before clean process exit through the native shell. Active schema, content, rules, and smoke sources use clean-room Strategy internals, and host tests reject explicit zero-count purchases without mutating state. Host tests cover the one-real-time-hour non-completion property under a generous normal-play creativity upper bound.demos/service-common/holds the shared caller-session endpoint loop and chat actor bootstrap/polling helpers used by the chat/adventure resident services, chat bot, and adventure NPC processes. New shared endpoint loop code usesEndpointUserData; the old badge-named user-data alias remains only for compatibility while peer branches migrate. Shared event queues remain deferred until another service has queue needs matching chat history/inbox behavior.system-spawn.cueremains the focused ProcessSpawner smoke for endpoint, IPC, VirtualMemory, Timer, ThreadControl, FrameAllocator cleanup, and hostile spawn inputs.make run-spawnasserts that the kernel boot-launches only the standaloneinit, that init validates BootPackage metadata, and that the init-owned manifest executor spawns and waits for every focused child service, including thetimer-smokemonotonic now/sleep proof,timer-floodper-process Timer sleep quota proof,runtime-fs-baseruntime-owned FS-base proof,single-thread-runtimeVirtualMemory plus Timer runtime checkpoint, andthread-lifecyclein-process thread/park proof.
Code: tools/mkmanifest/, system.cue, system-chat.cue,
system-adventure.cue, system-paperclips.cue, system-spawn.cue, demos/,
capos-rt/.
Validation: cargo test-mkmanifest, make generated-code-check,
make run-smoke, make run-chat, make run-adventure,
make run-paperclips, make run-spawn.
Partially Implemented
Login Boot and Init-Owned Spawn
Default make run now uses the init-owned default manifest. The kernel
validates the kernel-owned boot boundary, boot-launches standalone init, and
leaves the service graph plus login/session/broker flow in userspace. Init
starts the foreground capos-shell service, resident demo services, and the
host-local remote-session CapSet gateway; make run forwards host-local TCP
to guest port 2327 for the remote CapSet path only. The foreground shell mints
its own anonymous UserSession on boot; login and setup
commands drive CredentialStore/SessionManager/AuthorityBroker to upgrade
the session in place. Local password login is username-aware on the ordinary
foreground shell path, while durable multi-account credential storage remains
future work.
The plaintext Telnet gateway was only a focused
make run-telnet / system-telnet.cue research demo. That target is retired
after qemu-only kernel TCP listener removal, and the gateway demo, its
manifest, and the kernel SocketTerminalSession shim are removed; use the
in-guest login smokes for current shell coverage and rebuild any
socket-backed terminal proof on the Phase C userspace network stack before
using it as validation.
The focused init-owned spawn path remains under make run-spawn. There the
kernel boot-launches init with Console, BootPackage, and ProcessSpawner.
Parent endpoint facets used for later service-sourced imports are returned by
ProcessSpawner during child spawn, not granted at boot. init
performs metadata-only manifest validation, resolves kernel and service cap
sources, spawns children through ProcessSpawner, records exports, waits for
children, and reports failures through Console output. The QEMU target now
asserts the single-init boot markers, the three-cap init bundle, BootPackage
validation, child exit records, manifest child waits, spawn-loop completion,
and clean halt.
Measurement startup now follows the same boundary. make run-measure uses a
focused system-measure.cue manifest where the kernel boots standalone init
with Console, BootPackage, and ProcessSpawner, and init spawns ring-nop with
Console, FrameAllocator, the measurement-only NullCap, and the measurement-only
ParkBench cap through ProcessSpawner grants. It also spawns
thread-lifecycle with ThreadControl, ThreadSpawner, ParkSpace, and a
measurement marker cap. The demos print compact versus generic park-shaped
failed-wait/empty-wake cycle averages plus real ParkSpace blocked/resume
cycle averages before the measure-feature kernel prints segmented dispatch
counts, total cycles, and averages for SQE processing, validation, cap lookup,
capnp decode, method body dispatch, CQE posting, and waiter wake/check. Kernel
bootstrap now loads only
initConfig.init and validates only the kernel-owned manifest boundary;
mkmanifest and init own initConfig.services graph validation for focused
BootPackage executor manifests.
SSH Shell Gateway
The SSH Shell Gateway proof targets are implemented, covering the authority prerequisites and fixture authentication path that precede an encrypted SSH transport. Bounded QEMU smokes exist for:
- Host-key fixture signing (
make run-ssh-host-key): a development-only non-productionSshHostKeycap returns public metadata, signs bounded fixture exchange hashes for QEMU proof, fails wrong-algorithm requests closed, and does not leak the private host-key seed. - Authorized-key lookup (
make run-ssh-authorized-key): a manifest-seededAuthorizedKeyStorecap accepts configuredssh-ed25519public keys mapped to seed-account principals, denies unknown, disabled, and unsupported-algorithm keys, and does not leak private key material. - Public-key session minting (
make run-ssh-public-key-session,make run-ssh-public-key-auth):SessionManager.sshPublicKeyrechecks configured key records, verifies a boundedssh-ed25519signature over fixture authentication bytes, mints apublicKeyUserSessiononly after the signature succeeds, and logs stable audit reason codes for each denial path without leaking principal or profile metadata.UserSession.auditContextfails closed after logout through the sameensure_session_liveguard asinfo(). - Unsupported feature policy (
make run-ssh-feature-policy): acapos-config::ssh_policysurface classifies password auth, exec requests, SFTP, direct-tcpip, agent/X11 forwarding, env import, and multiple session or shell channels into stable audit reason codes; all denied paths produceevent=session result=denied reason=policyaudit records. - Restricted shell launcher (
make run-restricted-shell-launcher): a manifest-declaredRestrictedShellLaunchercap launches onlycapos-shell, injects supplied terminal/session caps plus child-local stdio, rejects session/profile mismatch and kernel-sourced or dangerous pass-through grant attempts, and strips hidden process-supervision result caps. - Bounded terminal-host proof (retired):
make run-ssh-gateway-terminal-hostwired scopedTcpListenAuthoritylisten, authorized-key lookup, public-keyUserSessionminting, broker profile matching, socket-to-TerminalSessionconversion, and restricted shell launch over a host-local plain TCP connection. It sat on the qemu-only kernel socket owner and the kernelSocketTerminalSession, both of which are retired with the userspace network-stack migration; the smoke now exits with a retirement diagnostic and a future terminal host must target the userspace network stack.
Encrypted SSH packet transport, OpenSSH-compatible key exchange and channel
handling, full SSH userauth transcript validation, channel binding,
TerminalSessionFromByteStream terminal-factory wiring, a terminal host over
the userspace network stack, and a production OpenSSH harness remain open.
The landed proofs use development/fixture key material; they are not a
production SSH service and are not safe for non-loopback deployment.
Design: SSH Shell Gateway.
Code: kernel/src/cap/ssh_host_key.rs, kernel/src/cap/authorized_key_store.rs,
kernel/src/cap/restricted_launcher.rs, capos-config/src/ (ssh_policy),
demos/ssh-*/, tools/qemu-ssh-*-smoke.sh.
Validation: make run-ssh-host-key, make run-ssh-authorized-key,
make run-ssh-public-key-session, make run-ssh-public-key-auth,
make run-ssh-feature-policy.
Hardware and Networking
The hardware bring-up path has bounded ACPI RSDP/RSDT/XSDT, MADT, MCFG, DMAR,
and IVRS diagnostics plus reusable PCI config-space access through legacy I/O
ports and Q35 PCIe ECAM, and the x86 path programs masked MADT-backed I/O APIC
routes for legacy IRQs while honoring source overrides. IOMMU reporting is
policy-only: malformed DMAR/IVRS structures fail closed, DMAR DRHD include-all
or single-hop PCI endpoint device-scope metadata can mark retained DMA-capable
PCI functions as IOMMU-attached/covered; bridge and multi-hop scopes remain
diagnostic-only until PCI topology traversal exists, and include-all fallback
fails closed when retained DMAR coverage metadata is capped. Direct DMA remains
blocked with zero trusted domains, and every retained DMA-capable prototype
function requires bounce buffering. The current staged domain-policy proof
also reports that future claimed DMA-capable devices use a
device-manager-owned per-device domain or trusted sharing group, exported
device addresses are IOVA-only, host
physical addresses are not user-visible, remapping tables are not programmed,
and production userspace hardware authority is still blocked. That
blocked-direct-DMA admission decision now runs through the host-tested
capos-lib::device_authority helper used for device-authority validation, so
the PCI proof line and diagnostics mirrors share the same fail-closed labels
for absent, malformed, unsupported, or retained-capped remapping metadata.
Active device-manager DMAPool policy records also carry a software
remapping-domain ledger staging record. The QEMU lifecycle/imported-live
proofs bind it to the active record and matching handle with
diagnostics-only static ACPI/PCI coverage, remapping_domain_owner=device-manager,
remapping_domain_ready=false, remapping_tables=not-programmed,
iova_export=disabled-future-only, direct_dma=blocked, and
host_physical_user_visible=0; no remapping tables, direct-DMA trusted
domains, host physical addresses, or IOVAs are exposed.
Bounded
manifest grants now exist for DMAPool, DeviceMmio, and Interrupt:
DeviceMmio exposes bounded .info, read-only userspace .map /
.unmap over boot-preseeded BAR pages, brokered read-only .read32 backed by
the same boot-preseeded 64-page kernel mapping cache, and bounded brokered
claimed-register .write32; Interrupt exposes bounded .info plus
admission-only .wait, .acknowledge, .mask, and .unmask, and
DMAPool reports conservative .info status and can mint eight fixed
manager-attached bounce-buffer DMABuffer result caps via
request-shaped allocateBuffer. Valid one-page bounce-buffer requests report
requested bytes, allocated bytes, page count, and request labels; zero-size
and over-bounce-buffer requests fail closed as
dmapool-allocation-request-invalid before result-cap or page mutation.
The same DMAPool.info result now exposes the attached manager record’s
owner/pool labels, live buffer/page/byte counts, in-flight submissions, and
committed/resident/unswappable/scrub-before-release flags. The bounded
manifest smoke proves zero-live accounting before allocation, one, two, three,
and eight live 4096-byte bounce pages while result DMABuffers are active, a
ninth-allocation full-pool rejection, restoration to the existing four-buffer
descriptor working set after freeing slots 4 through 7, three-live accounting
after freeing one descriptor-test slot, slot-0 reuse while the other three
descriptor-test slots remain live, and zero-live again after all typed
DMABuffer releases complete.
That DMABuffer now supports typed .freeBuffer, single-page userspace
bounce-buffer .map and .unmap, and
bounded manager-accounted .submitDescriptor / .completeDescriptor. The
.map path validates the live bounce-buffer epoch, accepts readable full-page
requests, maps the manager-owned bounce page into the caller’s userspace
address space, returns userspace-bounce-buffer-mapped,
single-page-bounce-buffer, user-vma-mapped, and a nonzero userspace
address, rejects zero-size, partial-page/out-of-range, and
executable/unknown protections with typed range/protection labels, and still reports
real_dma_mapping=not-programmed, direct DMA blocked, and no host physical
address exposure. The .unmap path validates the live buffer record first,
removes only that cap-owned borrowed VMA for the caller process, reports a
typed no-op when no mapping is present, and leaves page lifetime plus pool and
descriptor accounting unchanged. The descriptor paths carry
request labels plus the bounded proof counts (queue_count=4,
descriptor_count=8, buffer_bytes=4096): valid submits return
manager-inflight-recorded and raise the attached DMAPool.info
live_inflight count to 1, valid completions return
manager-inflight-completed and restore it to 0, and valid completions with
no outstanding submission return dmabuffer-no-inflight-submission with
side-effect-blocked. The manager record also tracks the single live
descriptor identity: duplicate submits for that live queue/descriptor return
dmabuffer-descriptor-already-inflight, and valid-shaped completions for a
different descriptor return dmabuffer-inflight-descriptor-mismatch; both
paths leave live_inflight=1 until the matching completion arrives.
Out-of-range queues/descriptors, zero submit lengths, submit lengths beyond
the bounce buffer, and completion lengths beyond the bounce buffer still fail
closed as dmabuffer-descriptor-request-invalid without mutating the counter.
The default manager-accounting descriptor path also preflights result
serialization before mutating accounting, and cap-table release drains bounded
in-flight accounting before detaching so a removed userspace cap cannot strand
the bounce buffer. The selected provider-TX exception for
make run-ddf-provider-consumer is narrower and runtime-visible: queue 1
submits may publish the selected eight-entry TX queue depth, descriptors 0..7,
into the existing kernel-owned virtio-net TX ring after the same DMABuffer
authority, bounce-scrub, and live notify_mmio policy gates; that selected path
then rings exactly one notify doorbell per accepted provider descriptor and
lets DMABuffer.completeDescriptor consume the stored software descriptor
generation from the real TX used ring before clearing each manager in-flight
record. Live tx_interrupt.wait calls over that selected route can observe
the ordered bounded completion events, and provider tx_interrupt release
proves bounded teardown by draining seven incomplete descriptor handoffs or
retiring seven delivered-but-unacked completion events with no pending provider
waiters. Wrong queue, stale
DMABuffer, stale notify policy, inflight publication, duplicate completion,
and stale tx_interrupt issue paths still fail closed before their guarded
side effects.
DMABuffer.freeBuffer, cap release, driver-crash, reset-disable, and drop
cleanup still revoke any remaining user mapping before scrub/free. None of these methods
program direct DMA, publish arbitrary CQ entries, transfer full virtio-net
ownership, or expose host physical addresses.
Allocations beyond the eight fixed bounce-buffer slots, DMA map/submit/complete
side effects outside the selected provider-TX proof, writable userspace BAR
mappings, arbitrary MMIO writes and doorbells, unbrokered register access,
blocking IRQ wait beyond the bounded selected-route completion waiter, real
hardware acknowledgement, hardware IRQ ownership, hardware mask/unmask,
hardware MSI/MSI-X programming, and general IRQ delivery remain blocked;
parent-first
DMAPool release
defers until all live DMABuffer slots release, or successful DMABuffer
driver-crash/reset-disable cleanup frees the remaining bounce pages and
completes the staged zero-live pool detach. make run-hardware-grant-cycle
proves
sequential DeviceMmio/Interrupt skeleton grants can release and reacquire
fresh DeviceMmio mapping generations while the Interrupt grant retains its
source generation and refreshes only the route generation, and its read-only
HardwareAuditLog.snapshot check decodes those two-cycle audit records through
the current volatile unsigned audit surface. make run-hardware-audit-interrupt-waiter also decodes recent boot-time
DmaBuffer, DmaPool, and Interrupt driver-crash / reset-disable
lifecycle records through that typed volatile snapshot path, and its cursor
snapshot requests from the first older retained DeviceMmio lifecycle
sequence to decode rows outside the default latest 16-record tail. make run-hardware-audit also proves below-oldest cursor clamping and past-end empty
cursor metadata on the overflowed volatile ring, and the QEMU-only local-ring
proof now checks those same cursor edges without mutating live audit records.
Unsafe retained
metadata fails closed, and prototype devices remain kernel-owned
bounce-buffer-only. The device-manager DMAPool
attachment path now stores
that explicit bounce-buffer policy on the attached pool record and the QEMU
lifecycle/imported-live proofs read it through the active manager record and
matching DmaPoolHandle; no new cross-manager lock was added, so the existing
PCI_DEVICE_MANAGER before DEVICE_INTERRUPT_ROUTES order remains unchanged.
PCI memory-BAR subregions are validated and mapped through a shared kernel
helper before in-kernel drivers use device MMIO, and PCI capability walking
reports non-programming MSI/MSI-X metadata for the QEMU virtio-net function.
make run-pci-nvme now applies the same metadata-only PCI path to a QEMU NVMe
controller: class/subclass/programming-interface, memory BAR, capability, and
MSI-X metadata are visible, while userspace device authority, DMAPool,
DeviceMmio, Interrupt, controller init, admin queues, I/O queues, MMIO
doorbells, and direct DMA remain not started or blocked.
make run-diagnostics now boots a
feature-gated COM1 early-boot diagnostics prompt before capability,
scheduler, timer, manifest, or userspace startup, with bounded commands for
status, CPU, memory, ACPI, PCI, IRQ, timers, devices, logs, reboot placeholder,
and halt. The ACPI and PCI diagnostics commands now also print bounded
MADT/MCFG/DMAR/IVRS record details, PCI function/config-header summaries, BAR
summaries, capability counts, MSI/MSI-X summaries, and bounded PCI
DMA-attachment policy counters/details when present; devices reports PCI
totals plus network/storage/display/bridge class counts and mirrors the
current DMA-domain policy without owner identity: direct DMA is blocked,
trusted-domain and ready-domain counts are zero, remapping tables are not
programmed, future exported device addresses are IOVA-only, userspace device
authority is not started, and prototype devices remain kernel-owned
bounce-buffer-only. The current runtime-state diagnostics slice also attaches
QEMU virtio-net plus the second virtio-rng proof device before the prompt and
reports virtio_net=ready,
bounded RX/TX virtqueue ring state, MSI-X route/vector/counter state, live
buffer state, the kernel-owned DMA owner/pool ledger, and device interrupt
route aggregates plus per-route delivery counters. Future driver extensions,
production teardown/lifecycle diagnostics, IOMMU remapping table programming,
production DMAPool, and userspace driver authority remain planned. The QEMU
virtio-net path has a
make run-net boot target, modern virtio PCI transport discovery for the
common, notify, ISR, and device-specific MMIO regions, feature negotiation, and
RX/TX split-virtqueue initialization, a TX descriptor completion proof, minimal
Ethernet ARP resolution, and ICMP echo validation against the QEMU user-mode
gateway. It no longer wraps the virtio driver in a kernel smoltcp interface or
performs a kernel TCP HTTP GET; the remaining run-net evidence is a
lower-layer QEMU fixture, and TCP/UDP socket proof lives under the Phase C
userspace network-stack gates.
QEMU currently exposes a transitional 1af4:1000 virtio-net function with
modern vendor capabilities; capOS accepts that shape only through the modern
capability layout and now selects a usable MSI-X capability for config/RX/TX
table entries, records kernel-owned MSI-X sources for config/RX/TX in the
device interrupt dispatch table, programs those entries through the typed PCI
MSI-X table helper using a bounded first-fit LAPIC device MSI vector pool,
lets the in-kernel virtio-net owner claim and unmask only its routes, assigns
the virtio common/config and queue MSI-X vector fields, and keeps descriptor,
ARP, and ICMP fixture evidence in make run-net after the kernel L4 owner is
retired. Device-autonomous virtio-net MSI-X delivery is covered by the dedicated
userspace-provider gates. A masked lifecycle probe on the unused virtio-net
MSI-X table entry proves claimed-route
reassignment, stale old-route rejection, old-vector unregistered delivery,
reassigned-vector masked delivery, unsupported-vector delivery, and release
before the live routes are registered. The QEMU virtio-rng second-device
metadata path also exercises the device-manager ownership model, the claimed
MSI-X route handoff, and a bounded teardown-trigger contract for cap release,
process exit, driver crash, reset/disable escalation, interrupt waiter, future
DeviceMmio, and future DMAPool trigger labels through the same
claim/transfer/revoke/release transaction. It also proves a bounded
manager-owned DeviceMmio record lifecycle: active-owner attach, stale and
owner-mismatch rejection, duplicate attach rejection, active RingDoorbell
validation through capos-lib::device_authority, binding to the first decoded
PCI memory BAR region from the tested PciDevice,
region_source=pci-decoded-memory-bar, region_bound_to_manager=true,
bar_present=true, bar_memory=true, bar_base, bar_length,
fail-closed wrong-BDF, wrong-BAR, and zero-length region metadata as
devicemmio-region-invalid, no invalid mapping created, negative side-effect
blocking,
stale-after-revoke rejection as devicemmio-stale-handle with
stale-owner-generation and side-effect-blocked, the
RevokingHandles -> MmioRevoked transition blocking while attached, bounded
detach, and no userspace handle, real BAR mapping, or doorbell write.
For the bounded userspace DeviceMmio.map path, the shared pure
capos-lib::device_authority validator now accepts only page-aligned in-BAR
read-only requests and denies writable, executable, unknown-protection,
zero-size, unaligned, offset-size overflow, and out-of-BAR requests before the
kernel maps anything. Accepted map requests install a caller-owned borrowed
read-only userspace VMA over boot-preseeded BAR pages only, and
DeviceMmio.unmap removes that borrowed VMA after checking the active manager
record and caller address space. Duplicate map, second unmap, stale release,
drop, driver-crash, and reset-disable paths all preserve fail-closed VMA
revocation and no-side-effect labels.
For the brokered userspace DeviceMmio.read32 path, the same authority layer
validates active handle/state/policy and in-BAR dword-aligned offsets before a
single kernel-side volatile read from the boot-preseeded cache. The cap call
path never installs new kernel mappings, returns typed
mmio-read32-range-invalid denials for unaligned, overflowing, and out-of-BAR
offsets, reports register_read=performed only for accepted reads, and fails
closed after cap release.
For the brokered userspace DeviceMmio.write32 path, the kernel runs the same
manager-attached identity, policy, and range checks before one volatile dword
write through the boot-preseeded kernel MMIO mapping cache, and it accepts
only the single PCI MSI-X metadata-derived provider-scoped masked
vector-control claim. The focused proof uses that idempotent dword on the virtio-rng
BAR, confirms it through both read32 and the read-only userspace VMA, and
proves an unclaimed message-address dword write leaves the original value
unchanged. Unclaimed, unaligned, overflowing, and out-of-BAR calls report typed
blocked labels before any write; stale or released handles fail closed before a
write and do not return a write32 result payload.
The same path proves a bounded zero-live DMAPool record lifecycle:
active-owner attach, stale and owner-mismatch rejection, duplicate attach
rejection, generation invalidation on revocation, DmaMappingsRemoved blocking
while attached, teardown detach, and terminal release. It now also issues and
records a bounded manager-attached DMA buffer handle under that attached pool,
validates active SubmitDescriptor through the pure DMA-buffer validator, and
records stale-after-revoke (stale-owner-generation), freed-buffer (freed),
and reused-slot (stale-slot-generation) rejection with
side-effect-blocked; the attached buffer record now also blocks zero-live
pool teardown as dmapool-buffer-attached, rejects a stale same-slot
proof-scoped FreeBuffer as dmabuffer-stale-handle with
stale-slot-generation and side-effect-blocked while preserving the
manager-owned buffer record, then validates an active FreeBuffer and
manager-owned buffer-record detach as ok, after which the existing
DMAPool detach succeeds. This still exposes no userspace handle and attempts
no real DMA. The pending IRQ token path now also delegates the
source/route-generation, masked, unregistered, invalid-owner, and malformed
identity decision to a host-tested capos-lib::device_authority validator
after snapshotting the live dispatch slot, while preserving the existing
stale-pending-irq-* QEMU labels. This remains validator/adapter evidence,
not production userspace interrupt waiter authority. The virtio-net smoke now also
derives an imported live-accounting
DMAPool record from the authoritative kernel-owned DMA ledger, records live
buffer/page count, live bytes, in-flight submissions,
committed/resident/unswappable flags, and scrub-before-release policy, and
proves both teardown detach and DmaMappingsRemoved fail closed while that
ledger is live. The live proof now consumes the device_dma teardown-evidence
API, records the expected authoritative-ledger-live block with matching
imported live accounting, and defers completion because this slice has no
authoritative zero-live/scrubbed evidence for the live virtio-net ledger and
does not attempt real DMA teardown, scrub, DmaMappingsRemoved, terminal
Dead, or release for the live virtio-net record. A separate scratch-ledger
proof reaches authoritative-ledger-zero-live only after both quiesce and
scrub markers are set, without touching live virtio-net DMA. Another scratch
proof covers stale DMA page handles by generation-tagging same-phys reuse and
rejecting stale, wrong-queue, wrong-label, and duplicate-free attempts without
mutating the active ledger. Userspace DMAPool/DeviceMmio/Interrupt
authority, real lifecycle hook plumbing, real page quiesce/scrub/release
cleanup, and broader driver interrupt dispatch remain planned.
The kernel negotiates VIRTIO_F_VERSION_1 plus MAC when safe and
VIRTIO_NET_F_MRG_RXBUF for QEMU’s merged-buffer virtio-net header, maps the
virtio MMIO regions after kernel paging is live, allocates
kernel-owned DMA pages for RX/TX descriptor, available, used rings, RX packet
buffers, and one-shot TX buffers, submits a descriptor proof frame, sends an
ARP request from 10.0.2.15 to 10.0.2.2, and observes the ARP reply in
make run-net. Those current DMA pages now pass through a bounded
kernel-owned device_dma pool ledger that proves live pool bytes, page counts,
page-rounded MMIO mapping bytes, config/RX/TX interrupt holds, RX/TX ring
depths, and RX/TX descriptor submission/completion accounting while no
userspace DMA/MMIO/interrupt handles are exposed. The net smoke also proves the
current kernel-owned budget/OOM policy with a scratch ledger: page and byte
allocation over budget, overlarge queue depth, duplicate and over-budget MMIO
holds, MMIO byte over budget, duplicate and over-budget interrupt holds, and
descriptor submission beyond queue depth all fail closed while the live
virtio-net ledger still validates normally, and the live device-manager record
proof is derived from that same ledger without zeroing the copied record as a
stand-in for real cleanup. The device-manager DMAPool record now also
carries that budget profile, and the lifecycle/imported-live proofs read it
through the active manager record plus matching DmaPoolHandle before
checking zero-live accounting or live aggregate in-flight accounting against
derived total budgets. The same scratch-proof pattern now covers zero-live
teardown evidence and stale DMA page handles. Production userspace DMAPool,
DeviceMmio, and Interrupt handles, production userspace DMA-buffer handles,
real page cleanup/reuse, real DeviceMmio mapping objects, cache
attributes/write policy enforcement, hostile stale-MMIO/DMA smokes, S.11.2
hostile smokes, and real doorbell writes remain unavailable; the current
malformed-region and manager-attached buffer proofs are only bounded
fail-closed metadata evidence in the manager proof path. The same smoke
sends
an IPv4 ICMP echo request to 10.0.2.2,
validates the echo reply identifier, sequence, payload, IPv4/ICMP checksums,
and addresses, and prints an icmp echo ok proof line. The former kernel
smoltcp TCP HTTP smoke, scheduler-polled smoltcp runtime, Phase B
NetworkManager/TcpListener/TcpSocket qemu-only cap objects, socket-backed
Telnet terminal handoff, and POSIX DNS UdpSocket smoke are retired. The
kernel no longer depends on smoltcp; qemu-only kernel TCP/UDP socket entry
points fail closed; and the corresponding Make targets exit before QEMU with
retirement diagnostics. Phase C (Networking
Part 3) moves TCP/IP behavior into a userspace network stack process and keeps
the kernel production surface focused on DMAPool/DeviceMmio/Interrupt
device capabilities.
The local serve-from-userspace proof now boots a non-qemu cloudboot manifest
where a userspace smoltcp service grants an application client a
TcpListenAuthority and serves TcpListener/TcpSocket caps for one hostfwd
TCP round trip. A later local DHCP/IPv4 proof now lands the first
lease/default-route/ARP configuration evidence on that userspace stack. Local
bounded ICMPv4 Echo Reply diagnostics are also proved through a local cloudboot
manifest, but remain diagnostic-only and outside the Web UI readiness ladder.
For the selected GCE Self-Hosted Web UI milestone, the evidence order is local
served TcpListenAuthority, local DHCP/IPv4, local Web UI L4, private GCE
reachability, then the separately authorized public ingress/TLS proof. The
legacy kernel socket owner no longer accepts non-qemu production manifest
grants; qemu-only fixtures keep their explicit kernel socket sources until the
broader Phase C exit cleanup removes that path.
Code: kernel/src/acpi.rs, kernel/src/diagnostics.rs,
kernel/src/pci.rs, kernel/src/device_interrupt.rs,
kernel/src/device_manager/, kernel/src/device_dma.rs,
kernel/src/virtio.rs, kernel/src/cap/network.rs,
kernel/src/cap/ring.rs, kernel/src/sched.rs, kernel/src/mem/paging.rs,
kernel/src/arch/x86_64/pci_config.rs, Makefile,
tools/qemu-diagnostics-smoke.sh, tools/qemu-iommu-acpi-smoke.sh,
tools/qemu-net-smoke.sh, tools/qemu-net-harness.sh.
Validation: make run-diagnostics, make run-iommu-acpi, make run-net,
make qemu-net-harness.
Security and Verification Track
The repo has Miri, proptest, fuzz, Loom, Kani, generated-code, dependency
policy, trusted-build-input, panic-surface, and DMA-isolation work. CI now runs
a bounded Kani gate for capos-lib bitmap, cap-table stale-handle, transfer
preflight, transfer rollback split between source-visible rollback and
destination-ledger restoration, and frame-grant accounting invariants. The
heavier prepare-copy to provisional-destination seam proof passed in the
high-memory make kani-lib-full Cloud Build gate, but coverage is not complete
for every trust boundary.
References: Trusted Build Inputs, Panic Surface Inventory, DMA Isolation, and Security and Verification Proposal.
Future Work
Future architecture includes service restart policy, capability-scoped system monitoring, notification objects, promise pipelining, service-facing SharedBuffer APIs on top of the MemoryObject substrate, scheduling-context donation, session quotas, SMP, storage and naming, userspace networking, cloud boot support, user identity, policy enforcement, multi-front-end terminal hosts, richer native command surfaces, and broader language/runtime support.
Design references:
- Service Architecture
- Storage and Naming
- Networking
- SMP
- Userspace Binaries
- Shell
- Boot to Shell
- System Monitoring
- User Identity and Policy
Changelog
A curated record of capOS’s shipped milestones: the significant, externally visible capabilities the system has demonstrated.
Each entry documents one landed milestone with the evidence that backs it – a shipped feature with measured behavior, a security finding closed with its fix and verification commands, a scaling proof with its data, or a benchmark with its host caveats – named, dated to the commit it landed at, and reproducible.
2026-06-09
Remote-session Web UI server-side session hardening – Review C high closed
- The capOS-served Web UI (
remote-session-web-ui) no longer derives itscapos_remote_sessioncookie from the accept counter. It now mints an opaque, high-entropy server-side session id – a one-way SHA-256 (domain-separated, base64url) over the kernel-CSPRNG backendSessionInfo.session_id– and a per-session double-submit CSRF token from the same seed under a distinct label. The raw backend id never crosses to the browser (the digest is one-way). Landed at91743ed4. - Server-side enforcement added before request dispatch: token rotation on
login/re-login, cookie expiry + fail-closed rejection on logout and on a
replayed rotated-out id, idle (30 min) and absolute (12 h) lifetime bounds via
absolute monotonic deadlines,
Host(DNS-rebinding) andOriginvalidation, and a requiredX-CSRF-Tokendouble-submit on state-changing requests. The session cookie isSecurewhenX-Forwarded-Proto: httpsreports HTTPS ingress; the plaintext loopback proof stays explicitly non-Secure. This matches the committed operator-bundle/host-bridge CSRF contract; no schema/kernel/ABI change. - Evidence:
make run-cloud-prod-remote-session-web-ui-l4(local QEMU/cloudboot) now drives stale-token, CSRF (missing/mismatch),Origin(missing/cross-site),Host, idle/absolute expiry, cookie-attribute, and login/re-login rotation denial gates, each failing closed before any backend-held capability call (report.jsonsessionHardening: all gates true,tokenLen43). Local proof only; not private GCE reachability, public ingress, or TLS.
2026-06-04
Userspace TCP over the capability NIC – TcpListener/TcpSocket round trip
- A userspace process now completes a full
TcpListener/TcpSocketround trip oversmoltcpdriven entirely through capabilities: frames cross theNiccapability, the kernel-owned keep-armed sustained-receive RX pool (Nic.receivePoll @4) feeds smoltcp’s RX token across the multi-frame TCP handshake, and no host-physical or device-usable address is exposed to userspace. This is the first userspace TCP (connection-oriented, multi-frame) path in capOS, landed at002c5927(Phase C slice 7c-iii). - It rests on the sustained-receive ABI that lifted the prior single-frame
Nic.receiveblocker (slice 7d,Nic.receivePoll @4, kernel-owned bounce pool with per-recycle scrub + slot-generation bump, no per-frame device reset). Evidence: therun-cloud-prod-network-stack-smoltcp-tcp-listener-roundtripQEMU proof exercises listen/accept/echo over the userspace stack. - Remaining for the full Phase C userspace L4 stack: the
cap/network.rsproduction-contract relocation (parent taskcloud-prod-userspace-network-stack-smoltcp-local-proof), after which the TLS-client handshake, self-hosted Web-UI L4, and IPv6-TCP tracks unblock.
2026-06-02
Real-GCE virtio-net NIC bind – the GCE Polling Path track closes
- The billable real-GCE proof passed: a real
e2-smallinstance (europe-west3-a, imagecapos-test-1780412056-e1cb, source commit1fb65683) booted the production non-qemucloud kernel from the legacy datapath manifest, the kernel-brokered legacy polled path bound the live GCE virtio 0.9 NIC (00:04.0,1af4:1000), and the run passed thetools/cloudboot/run-test.sh --require-provider-nic-proofgate. Run1780412056-e1cb;teardown_status=complete, no leaked sandbox resources. - This is the first real-hardware attestation of the legacy bind. Every stage
ran end to end on GCE: candidate select over PIO BAR0 (
iobase=0xc040), I/O + bus-master enable (command=0x0107), real GCE device MAC read (src_mac=42:01:0a:c8:00:12),NET_F_MACnegotiation (device_features=0x204399a7), full 4096-entry vring materialization (rx_queue_size=4096 tx_queue_size=4096 rx_vring_pages=28 tx_vring_pages=28– the ~110 KiB/28-page contiguousframe::alloc_contiguousper queue that QEMU cannot emulate, since QEMU caps queue size at 1024), a broadcast DHCP DISCOVER TX, and a real device->host RX DMA within the TSC-governed wall-clock budget (rx_used_len=532 ethertype=0x0800IPv4,rx_clock_usable=true,rx_iters=1). Marker:cloudboot-evidence: provider-nic-bound 0000.00.04.0-vendor.1af4-dev.1000-iobar.0-iobase.c040-usedidx.1-usedid.0-usedlen.532-ethertype.0800-txusedidx.1-srcmac.42010ac80012fromcap::provider_nic_bind_proof::report_real_completion_legacy. - The bind reached the device only after three distinct real-hardware premise
conflicts were closed by prior local slices, each found by a bounded billable
run: modern-only candidate select vs the legacy device (5b), the QEMU-SLIRP-only
RX stimulus vs GCE anti-spoofing (5c, real-MAC DHCP DISCOVER + accept-any
wall-clock RX), and the device’s 4096-entry queue exceeding the prior 1024 bound
(5d,
MAX_LEGACY_QUEUE_SIZEraised to the spec max 32768). - Honest scope: this is a kernel-brokered, polling-only data-path attestation
(
userspace_driver_authority=kernel-brokered-legacy-polled,interrupt_model=polled-no-msix,device_autonomous_raise=not-claimed,direct_dma=blocked,host_physical_user_visible=0). It is not a claim of userspace-driver authority, device-autonomous MSI-X delivery, an L4 socket round-trip (raw-frame reachability per the slice-5 Option-A decision; L4 is networking-proposal Phase C), or cloud storage readiness. It retires thecloud-gcp-virtio-net-nic-driverblocker. - Reproduce: build the cloudboot image with
make capos-cloudboot-image MANIFEST_SOURCE=system-cloud-provider-virtio-net-legacy-datapath.cue, confirmmake run-cloud-provider-nic-bound-legacy(andmake run-cloud-provider-nic-bound-legacy-large-queue) green on the build commit, thentools/cloudboot/run-test.sh --require-provider-nic-proof(BILLABLE; operator-authorized 2026-05-27, commit2aaeaa53).
2026-05-30
Device Driver Foundation – production bind-stack qemu-gate dissolution
- Umbrella
cloud-prod-ddf-bindstack-qemu-gate-dissolutionclosed at commitfdc8eb66. The production (non-qemu) cloud kernel’s device-authority surface is now always-built code fronted by fail-closed runtime capability probes, graduated off the overloadedqemugate and the per-proofcloud_*_prooffeature modules it previously hid behind, whileiommu.rsstays gated and brokered bounce-buffer-only DMA (no host-physical/IOVA export) is preserved. Landed as six reviewed slices:29a76850– RX MSI-XInterrupt.waitwaiter-wakeup determinism: the provider-consumer flake was a synthetic-dispatch ordering race, fixed by gating injection on the owner being parked incap_enter(sched::thread_blocked_on_cap_enter); 28/28make run-ddf-provider-consumer(baseline ~18% flake).ef2548b3– grant-source de-specialization: the prod{dmapool,devicemmio,interrupt}_grant_sourcestatics stage an arbitrary enumerated function through onestage_with_classentry point taking aProdGrantClassdescriptor (cap::prod_grant_source_class), bit-identical.b7d30ec3– MSI-X program/attach/arm/unmask + kernel-injected-dispatch wait graduated into always-builtcap::interrupt_programmed/device_interrupt::wait_kernel_injected_dispatch.82c2ed53/b2168e05/ad6da6ce– thedevice_managerbackend port: always-builtProductionDeviceTabledevice-record/handle backend, per-record bounce-buffer DMA-pool backend, and interrupt-route backend (parentcloud-prod-device-manager-backend-port).fdc8eb66– split test-harness affordances off theqemufeature.- Reproduction:
make run-cloud-devicemmio-grant,make run-cloud-dmapool-grant,make run-cloud-interrupt-grant,make run-cloud-provider-cap-waiter,make run-cloud-provider-nvme-readonly-bind,make run-ddf-provider-consumer,make run-net. Remaining DDF work – userspace virtio-net RX/multiqueue and NVMe I/O-queue provider readiness, plus live cloud bind – is tracked by the separate provider-parent tasks, not this umbrella.
2026-05-23
Device Driver Foundation – userspace virtio-net provider closeout
- Commit
c86374f8(2026-05-23 16:51 UTC) closes the first local bounded userspace virtio-net provider-driver proof for Task 6. The provider-consumer smoke now asserts one stable closeout line tying together selected queue1TX descriptor/avail/doorbell/used-ring/CQ ownership across the full QEMU TX queue depth, bounded queue0RX synthetic-token CQ identity, selected TX/RX MSI-X/LAPIC wait/ack/EOI, selected-route mask/unmask/reset/reassignment, teardown, stale-handle blocking, and no silent provider fallback. Reproduction:make run-ddf-provider-consumer,make run-net. This remains bounded local QEMU provider evidence over manager-owned bounce buffers; live hardware RX used-ring ownership, full virtio-net ownership, direct DMA/IOMMU, cloud NIC/storage readiness, and virtio block/storage drivers remain separate work.
Device Driver Foundation – provider TX full-depth CQ ownership
- Commit
e248d42b(2026-05-23 13:36 UTC) extends selected userspace virtio-net TX CQ ownership from the prior four-outstanding window to the full eight-entry TX queue depth used by QEMU. The smoke now proves eight live manager-owned bounce buffers, descriptor/avail publication and notify doorbells for descriptors0through7, wrong-order completion fail-closed at descriptor7, in-order CQ identity delivery/ack for all eight descriptors, ninth allocation rejection without pool expansion, teardown-only drain for seven incomplete descriptors, and release retirement of seven delivered but unacknowledged CQ events. Reproduction:make run-ddf-provider-consumer,make run-net.
Device Driver Foundation – provider RX wait/ack dispatch-token proof
- The provider-consumer smoke now promotes the provider RX interrupt grant’s
wait/ack path beyond the blocked skeleton for one selected RX dispatch token.
rx_interrupt.waitcan pend, stay unpromoted by generic route delivery-count advancement, and wake only after a selected RX MSI-X/LAPIC dispatch validates the live RX issue, selected RX source, source generation, route generation, virtio-net owner, and driver-unmasked route state. The pairedrx_interrupt.acknowledgeaccounts exactly one bounded RX hardware-dispatch ack for the delivered zero-CQ RX event; pre-event, masked-route, duplicate, and stale-after-release wait/ack attempts remain fail-closed. Reproduction:make run-ddf-provider-consumer. RX descriptor publication, RX CQ identity, real hardware IRQ acknowledgement/deferred EOI, direct DMA/IOMMU, full virtio-net ownership, cloud NIC/storage readiness, and production driver readiness remain open.
POSIX Adapter – File/Directory fd closeout
- Commit
f97d9833(2026-05-23 06:23 UTC): theposix-file-directory-client-capos-rttask closes the v0 File/Directory fd surface on top of the existing Storage Phase 3 RAM-backedDirectorycap.libcapos-posixnow implementslseek()over the per-fd file position andreaddir()as a lazyDirectory.listsnapshot, while preserving the existing pipe, UDP, Console, and TerminalSession fd paths. The newmake run-posix-fileproof boots a live C process that creates a file throughopen(), writes, seeks, reads, lists the root directory withopendir()/readdir(), closes both handles, and asserts relative paths still fail closed. Remaining P1.4 dash-port work is the printf/string subset, signal/time stubs, identity stubs, dash vendoring/patching, multi-TU C build, andrun-posix-shell-smoke.
Device Driver Foundation – provider TX release retires three unacked CQ events
- The provider-consumer smoke now extends the selected TX release-retirement
path to three delivered but unacknowledged bounded provider TX CQ events in
one live issue. The smoke completes descriptors
0,1, and2, consumes all three throughtx_interrupt.wait, skips all acknowledgements, proves the stale-bound in-flight descriptor remains in fixedDMABufferslot3, and asserts providertx_interruptrelease retires three pending provider completion acks without hardware acknowledgement. The claim remains bounded CQ teardown evidence only; deferred EOI, hardware acknowledgement, hardware IRQ ownership, direct DMA/IOMMU, full CQ ownership, full userspace virtio-net ownership, cloud NIC/storage readiness, and production driver readiness remain open. Reproduction:make run-ddf-provider-consumer.
Device Driver Foundation – provider TX release retires unacked CQ event
- Commit
11eeab2e: the provider-consumer smoke now proves release-time retirement for a delivered but unacknowledged bounded provider TX CQ event. The smoke drives a selected TX completion throughDMABuffer.completeDescriptorandtx_interrupt.wait, deliberately skipstx_interrupt.acknowledge, releases the providertx_interruptcap, and asserts the release proof records one pending provider completion ack retired from the ledger. The stale post-releaseacknowledgepath remains revoked; the completed buffer can still be freed normally; and deferred EOI, hardware acknowledgement, hardware IRQ ownership, direct DMA/IOMMU, full CQ ownership, full userspace virtio-net ownership, cloud NIC/storage readiness, and production driver readiness remain open. Reproduction:make run-ddf-provider-consumer.
Device Driver Foundation – provider RX descriptor boundary proof
- Commit
2bd5add5: the provider-consumer smoke now records an explicit provider RX queue0descriptor boundary while the live RX interrupt issue is active. The proof keeps the existingDMABuffer.submitDescriptor(queue=0)/completeDescriptor(queue=0)path as neutral bounce-buffer accounting and asserts that RX ring publication, provider CQ publication, provider IRQ delivery, hardware acknowledgement, and direct DMA remain blocked while kernel RX cohabitation is unresolved. Reproduction:make run-ddf-provider-consumer. Honest caveat: RX descriptor publication, RX CQ identity, RX waiter delivery, direct DMA/IOMMU, full virtio-net ownership, cloud NIC/storage readiness, and production driver readiness remain open.
2026-05-17
Kernel – scheduler/IPC recoverable-panic surface closed
- The scheduler and IPC hot-path
.expect()/.unwrap()sites that could panic on stale run-queue or thread-metadata invariants are now hardened across the seven hot-path functions.block_current_on_cap_enterlogs and returnsfalse(yieldingu64::MAXat the syscall boundary);next_start_contextlogs and returnsNoneso the caller’s retry loop (kernel_idle_entry,start_current_cpu,start_ap) selects another thread; andschedule(),exit_current(),exit_current_thread, andcapos_block_current_syscall()drop the scheduler lock andcrate::hcf()on the dispatch/exit/block paths that have no caller-side recovery, matching the canonical last-process-exited halt.retain_endpoint_queue(kernel/src/cap/endpoint.rs) breaks with a diagnostic kprintln on a queue-length mismatch instead of panicking onpop_front(). ThePhysFrame::from_start_address(cr3_phys)panics on the exit and syscall-entry paths are intentionally retained: corrupted CR3 is genuine memory-state corruption, not transient queue inconsistency. An explorer audit confirmed no recoverable panic surface remains (2026-05-17 00:41 UTC).
Kernel – resource quota fields fully wired
- All three sub-items from the prior “partially wired” resource-quota
finding are now enforced (closed 2026-05-16 19:19 UTC). The per-process
carrier (
capos_config::ResourceProfileonProcess::resource_profile) lands the profile at spawn fromSessionMetadata::profileviaRamAccountStore.ringScratchLimitBytessizes the per-process input/output/reply scratch buffers and rejects oversize CALLs withCAP_ERR_INVALID_REQUEST;replyScratchLimitBytesclamps the exception reply-scratch buffer to the profile ceiling (closed 2026-05-16 20:52 UTC), fixing the #175 asymmetry that produced spuriousCAP_ERR_APPLICATION_EXCEPTION_TRUNCATEDon small-ring processes;endpointQueueLimitandinFlightCallLimitcarry the owner profile’s values intoEndpoint::try_new, clamped by the kernel ceilingsMAX_QUEUED_CALLS=32/MAX_IN_FLIGHT_CALLS=32. Scope caveat: the endpoint-scoped bounds are per-endpoint relative to the owner profile, not a strict per-process counter across all endpoints the owner holds. ResourceProfileRecord@13is tombstoned asretired13. Reproduction:make run-ring-scratch-limit,make run-reply-scratch-limit,make run-endpoint-queue-limit,make run-in-flight-call-limit.
Device Driver Foundation – Hardware-audit userspace service (durable-audit Step 2b)
- Commits
037256ce(initial slice) and the remediation follow-up on the sameddf-audit-userspace-servicebranch: durable-audit Step 2b of 4 deliversdemos/hardware-audit-service, which pollsHardwareAuditLog.drainwith the cursor protocol from Step 2a, accumulates records in memory, and serves a typedHardwareAuditReader.snapshotover a kernel-allocated Endpoint retagged for the consumer. The reader cursor is fail-closed: anexpectedSequenceoutside{0, current drain cursor, any retained record's sequence}– or a malformed-capnp param payload – is rejected with a typedInvalidArgumentexception, mirroring the kernel-sideHardwareAuditLog.drainrejection so a stale or forged cursor cannot silently skip or repeat records. The schema docstring forHardwareAuditReadernow states this contract explicitly. Reproduction:make run-ddf-audit-service-smokeproves boot-record accumulation, a service-handoff snapshot, a release-triggered follow-up snapshot,signatureStatus = "unsigned", and the negative cursor-mismatch rejection (matched on both the service-sidesnapshot-rejected exception_type=invalid-argumentand the consumer-sidecursor-mismatch rejected ok exception_type=invalid-argumentmarkers). Steps 3 (segment signing with key management routed through the cryptography proposal) and 4 (durable Store-backed persistence with a defined rotation contract) remain open follow-ons.
2026-05-16
POSIX Adapter – P1.4 Slices 3 and 4: functional file I/O end-to-end
- P1.4 Slice 3 (FdBacking File/Directory/Terminal variants and the
make run-posix-file-backing-smokeproof of Terminal routing) landed atae58f936(closing merge4c70a03d). P1.4 Slice 4 (absolute-path resolver, functionalopen()/opendir()over the bootstrap-granted rootDirectorycap, per-fd file position tracked acrossread()/write(), and themake run-posix-open-smokeproof of create+write+close, then re-open for read, plus relative-path rejection) landed at94b29177(closing mergede4235f9).closedir()releases the local slot only. This is the first non-shell POSIX subsystem to reach functional parity:open(path, flags)–read/write–closeworks end-to-end throughlibcapos-posixon top of the Storage Phase 3 RAM-backedDirectoryauthority. Reproduction:make run-posix-file-backing-smoke(Slice 3),make run-posix-open-smoke(Slice 4), and the existingmake run-posix-stdio-smokeregression remains green. Honest caveat: the remaining v0 dash port work is stdio adoption (Slice 5), env vector (Slice 6), printf/string subset (Slice 7), signal/time/identity stubs (Slices 8-10), and dash vendoring + smoke (Slices 11-13).
Scheduler – Phase F remote-CPU nohz activation via reschedule IPI
- Commit
8c1601ac: the Phase F auto-nohz preflight no longer requires the lease’s target CPU to be the current CPU for thenamedRing = nonecompute-lease shape. When the single-CPUallowedCpuMasktargets a different scheduler CPU, the kernel parks a bounded remote-activation request in the target CPU’s per-CPU slot and sends a reschedule-style IPI; the target CPU drains the request from its IPI handler (timer-handler backstop) and re-runs the full disqualification check locally undertry_lockbefore arming its own one-shot deadline – remote activation is never trusted blind. Reproduction:make run-scheduler-cpu-isolation-lease.
Device Driver Foundation – virtio-net provider RX bootstrap-grant skeleton
- Commit
b710d4fd: the provider RX path now mirrors the provider TX bootstrap-grant authority at the skeleton level over the selected virtio-net RX MSI-X route. Addsvalidate_provider_rx_interrupt_routeas the receive-queue counterpart of the TX route validator (same admission shape, same active/resetting state gate, sameinterrupt_owner_for_device_ownermapping; only thePciMsixInterruptRole::RxQueuerole tag differs) plus matchingcap::interrupt_grant_sourceinit/build/release entry points. Skeleton only: live RX DMA, completion delivery, and hostile-smoke coverage remain open.
Device Driver Foundation – provider RX selected-route MSI-X control
- Commits
5ea850c3,1d2be684, and9f3f8a8c: the provider RXrx_interruptcap validates the live RX issue and selected virtio-net RX route before bounded mask/unmask of the selected RX MSI-X table vector-control bit and route state. The provider-consumer smoke asserts vector-control readback, delivery-count preservation, stale methods after release, and release-while-masked cleanup back todriver-unmasked; cleanup failure leaves the live issue uncleared so future RX cap issuance stays blocked on uncertain route state. RX wait/ack, descriptors, provider CQ identity, hardware acknowledgement, deferred EOI, full RX ownership, direct DMA/IOMMU, cloud readiness, and production userspace driver readiness remain open.
SSH session – explicit UserSession.logout failure-path proof
- Commit
9e7328e6:test(ssh-public-key-session)proves the explicitUserSession.logoutfailure path, closing part of the open REVIEW_FINDINGS Low item.
Storage Phase 2 closed in docs
- Storage & Naming Phase 2 (schema BlockDevice/File/Directory
interfaces) was marked done in
docs/tasks/done/2026/(commit0551941c) after Phase 3 slices 1–3 shipped on 2026-05-14. A separate task-state reconciliation sweep (ad280bf2, closing merge189d4af2) realigneddocs/tasks/directory state withdocs/tasks/README.mdground truth.
Production Provenance Milestone
- Landed across
1feee12b..6f775925(2026-05-16). All GitHub Actions steps pinned to immutable commit SHAs, Rust nightly pinned to an exact date, and OVMF/qemu-system-x86/xorriso pinned to exact apt versions in the QEMU smoke apt-install. Each CI run publishes a build-provenance artifact and PRs run an advisory cross-run compare against the base-branch artifact. Reproduction:make build-provenanceproduces the provenance artifact locally; CI artifacts appear asbuild-provenance-<sha>per qemu-smoke run. Full pin inventory and bump procedure:docs/trusted-build-inputs.md. Honest caveats: the PR compare step is advisory-only (not PR-blocking); URL-based download-and-verify for OVMF and other pre-built tool binaries (Option B) remains future hardening.
2026-05-14
Device Driver Foundation – IOMMU VT-d remapping closed across A1/A2/B/C
- The QEMU Intel IOMMU remapping milestone closed across four reviewed slices
in a single day: A1 active-programmed legacy-mode table (
3a60a401), A2 hardware-DMA translation proof through virtio-rng with an observed VT-d fault on an unmapped IOVA (dfedf574), B register-based context-cache / IOTLB invalidation with an ordered scrub-after-invalidate revocation cycle (24eb587e), and C two-phase revocation with hostile stale-handle / stale-completion smokes (closing merge274ff63f, follow-up873eef56). Reproduction:make run-iommu-remappingassertstable program proof,invalidation proof,hostile stale-handle proof, andhostile stale-completion proofall asproof_result=ok(QEMU 8.2.2). Honest caveat: QEMU-only evidence (hostile_hardware_isolation=not-claimed); the live virtio-net DMA path still uses bounce buffers, and production userspace-driver IOMMU authority remains open.
Device Driver Foundation – virtio-net provider four-outstanding TX window
- The userspace virtio-net provider TX path reached a four-outstanding
completion-queue window with full provider descriptor/avail publication,
one notify doorbell per descriptor, real IRQ-dispatch-backed
tx_interruptwait/ack/mask/unmask, per-event provider CQ identity, and a four-slot bounce-buffer pool. The provider notify doorbell write moved off the broker into the provider’s own scoped notify-MMIO cap (ef979d17), and the harness now proves a generalwrite32verb fails closed on that same notify cap (95a65e99). Reproduction:make run-ddf-provider-consumer,make run-net.
Scheduler – per-CPU CPL0 kernel idle thread closed
- The user-mode idle process was replaced with a per-CPU CPL0 kernel idle
thread across Increments 1a..2e (final merge
2bba8d11). All four idle dispatch sites route through the CPL0 idle context:schedule()timer path,capos_block_current_syscall(block path),exit_currentandexit_current_thread(two exit paths). Each scheduler CPU slot owns a dedicated CPL0 idle kernel stack andCpuContext; the synthetic idleProcessrecord is retained only so the idleThreadRefresolves through the scheduler’s ThreadRef-centric bookkeeping. A TLB-flush drain inkernel_timer_interrupt_handlercloses the CPL0 idle-residency gap. Reproduction:make run-scheduler-cpu-isolation-leaseassertsidle_path=cooperative-cpl0for the boot/AP loop andidle_path=cpl0-dispatch-{timer,block,exit}for the dispatch sites.
Scheduler – per-CPU nohz tick suppression and SQPOLL-driven activation
- Phase F got its first real nohz increments: per-CPU periodic-tick
suppression for the single-runnable window (commit
9e269e31) and SQPOLL-driven auto-nohz activation for ring-coupled leases (commit8edd2314), built on per-CPU CPL0 idle-thread context infrastructure (commit5ac6b08f). Generic full-nohz and timeout-based auto-revoke remain future work.
Storage – Phase 2 schema + Phase 3 slices 1–3
- Phase 2 added schema-only
BlockDevice/File/Directory/DirEntryinterfaces (commit4c0d940c). Phase 3 then delivered three kernelCapObjectslices the same day: a RAM-backedFilecap with read/write/stat/truncate/sync/close and 64 KiB per-call inline-payload bound (slice 1,d06dff6b); a RAM-backedDirectorycap with subtree bring-up + grant path + QEMU smoke (slice 2,b11ec9e4); and a content-addressed RAM-backedStore+ name->hashNamespacepair (slice 3,804a3f41). Reproduction:make run-file-server-smoke,make run-directory-server-smoke,make run-store-namespace-smoke. This unblocks POSIX adapter Phase P1.4 (dash port) and WASI host adapter Phase W.5 (filesystem surface), both previously blocked on the cap shape.
Remote session – self-served web UI is the default boot
- Commit
5594c9efwires the capOS-served remote-session web UI into the default operator boot (cue/defaults/defaults.cue, scoped loopback listener cap,make runhost-port line). The Rust backend owns connection and session state; the browser receives view models and redacted transcript rows. Reproduction:make run, or the focusedmake run-default-web-ui.
2026-05-13
Device Driver Foundation – provider TX three-outstanding + MSI-X mask + ack
- Provider TX CQ window expanded from two to three outstanding descriptors
(
d6458381,6ee96d29), descriptor-issue-bound completions (f894ee1a,bc3280be), MSI-X mask/unmask control with atomic rollback on failure (b5af7335,3c8dc627,ca9beb73,be506075), real IRQ-dispatch wake of provider TX waiters (203461da), and hardware dispatch acknowledgement accounting (caaa388c,bc11c1aa). Provider CQ teardown gained an end-to-end quiesce/retire proof (d8760182,dff94930,235ed1e8). Reproduction:make run-ddf-provider-consumer.
Scheduler – SQPOLL producer-wake progress
- Scheduler added the bounded SQPOLL producer-wake increment
(commit
0dbb5542) with preserved wake-result state across stop (commitdbf8e0ff) on the Phase F prerequisite path that the 2026-05-14 SQPOLL-driven auto-nohz activation builds on.
Kernel – endpoint / park-waiter rollback hardening
- IPC rollback paths hardened: endpoint pending-recv rollback
(
e46b52dc), preserved endpoint recv rollback capacity (db454b59), kept endpoint recovery live across revocation (a1ccbda1), preserved park wake status during retry (d37edf54), and drained private park waiters after unmap-retry (1e0ce242). Bounded recovery surface; no new authority.
Language adapters – POSIX stdio, WASI env, libcapos C pipe + entropy
- POSIX adapter routed stdio writes to the Console cap (
aa6a56d7,d442a3b7) and exposedposix_spawnfile actions (b8fb3131). WASI added bounded per-instance environment grants (5f5028e7,987e7814), promoted stdio-compatibility Preview 1 imports (1a79037b), and gained an unauthorized-import refusal harness (c803565f,756b5ba8,1b53acd8). Native libcapos shipped a C pipe smoke (b6c2d4bb) and anEntropySource.fillwrapper (b1f7a3c1,6b3b7425).
Remote session – Paperclips Path B + Tauri scaffolding
- Paperclips launch wiring Path B (worker + gateway + bridge) landed
(
701522b9), with host chat RPC facade over DTO (7cf4cf2c) and DTO schema for Path A (e159adb8). A remote-session Tauri wrapper scaffold (5691ec2a) plus capability policy hardening and preflight (eff47eb3,b41ba656,ff58cacf) prepare the desktop wrapper path.
Build hygiene – build-provenance compare harness
- New build-provenance comparison
maketarget (07722584,a34cb441,b3ed23cf,779f3ce0) records runner identity in provenance (00272130) so two runners can be compared offline.
Docs sweep – proposal cross-link + last_reviewed refresh
- Substantial documentation refresh across ~40 proposals and architecture
docs: cross-links sharpened,
last_reviewedstamps refreshed, and proposals updated against shipped state (POSIX P1.1/P1.2/P1.3, Scheduler Phase D/E close and Phase F SQPOLL, error-handling against current ExceptionType + CapException, networking proposal against transitional kernel state, userspace-binaries Parts 4/5, scheduler Phase F.5 cross-link). The 29-page research index landed (18fbaf35). No behavior change.
2026-05-12
Scheduler – Phase F nohz scaffolding + CpuIsolationLease
- Phase F infrastructure landed across the day: nohz telemetry
(
aeb0f4d2), the clockevent/deadline substrate (268b44c2),CpuIsolationLeasescaffold (e9ab9e46,97f958a7), bounded SQPOLL ring mode (6dcbb69a,07578bec,cdbb45be), and housekeeping/ deferred-work placement (c7580873). These are the prerequisite chain the 05-14 SQPOLL-driven auto-nohz activation depends on.
Device Driver Foundation – provider TX completion delivery + IRQ regrant
- Provider TX gained completion-event delivery (
334818af), hardened interrupt delivery (dfebd411), serialized wait posting against IRQ race (bbe2fea7), and reset-disable lifetime hardening (50c6c8cd). Provider TX interrupt teardown event proof (9205af3e,d8760182), interrupt reset reassignment proof (8aee7d42), and a disabled Intel QEMU IOMMU scaffold with MMIO status diagnostics (02688941,41314553,277dbd26) prepared the 05-14 IOMMU Slice A1.
Userspace runtime – libcapos fail-closed on C-runtime threading
- libcapos now fails closed on C-runtime threading attempts (
57ad4bfa), closing the remediation entry tracked indocs/proposals/libcapos-c-substrate-proposal.md.
Remote session – local UI bridge hardening
- Local UI bridge was hardened (
cab6f791) with refreshed evidence (21a6a16a); login auth denial is now treated as a login failure rather than a transport error (58d198f3,b8486eac).
2026-05-11
Device Driver Foundation – provider/consumer split + DmaPool manager-authoritative
- The selected virtio-net userspace provider path proof landed
(
9ca39ff8,e6fb4c91): provider-consumer authority smoke (3d59cb8b), provider shadow descriptor side effect (db5c4995,c91b1477), selected provider metadata gated to the TX queue, MMIO/IRQ smoke (c52064c0,d1c6cece), and DMA accounting extension (40a77ae0). DmaPool lifecycle became manager-authoritative (96e1107e): blocks mapped DMABuffer manager-free (149ef53e), keeps DMABuffer unmap live in the manager (d6b7a292), validates cap identity before stale side effects (4f404f44), and propagates budget checks through release paths (877ed956,01c078f5,45903e5e). The provider notify-MMIO no-write cap grant landed (78c627a9,d177ffa1,b8f7b83f) with stale admission proof (0ec23fe6,a5a722de,fefd5267,3ed4418f) and submit-path carry (54b1499f).
Scheduler – one SQ consumer + session-logout hooks + context bind/revoke
- One SQ consumer owner is now enforced (
c427f3d9,b503b640) – the Phase F prerequisite that ring-coupled nohz depends on – with matching scheduling-context ring/SQ owner proof stabilizations (e5ec448c,811a4976). Scheduling contexts can donate over endpoints (d2eab605,6343ec84) with block/settle around donation (93706ecb,d202ca31,599f4649), bind/revoke generation (3b6e1bb5), identity-aliasing fix (fe4e340b), notification cells (c0e4470c), and exit-cleanup budget preservation (f30a9bb3,8b00c480,3725c87e,eaeb1071). Session-logout cleanup hook marks bound contexts stale (594f1353), propagates shell exit to session logout (0d9b3d90,0e9c8dd1), proves logout stale contexts fail closed (59dca4e8), donated logout skip policy (49de54f2), and blocks stale-snapshot budget refresh (9d1cd80b).
Device Driver Foundation – device-manager kernel refactor
- The device-manager kernel module was split into focused submodules
(
99c37592,734383f9,af539f6c,9c0a5183,98dddb72,bfdb78a0), reducing the proof-stack footprint and exposing shared authority admission helpers ahead of Slice A1’s per-domain remapping work. Process-exit teardown smokes for DMA/MMIO and Interrupt landed (b452f18e,746c1742), and a stage for the DDF IOMMU remapping-domain ledger went in (636edfb2).
Remote session – observable self-served UI proof
- Self-served remote-session web UI gained an observable proof
(
e4ab7b41,0eb68aa8,971d8ce8,65fe4bf7,28db3277,505b553c,f0254b02), closing the read-side gate before its 05-14 default-boot promotion.
2026-05-10
Scheduler Phase D closed
-
2026-05-10 19:39 UTC, mainline commit
77caafc0(closeout1a08ec23): Phase D is closed. Accepted WFQ slice coversSchedulingPolicyCapweight/latency-class authority, per-thread weighted vruntime and per-enqueuevirtual_finish_ns, per-CPU WFQ run queues, bounded steal/migration invariants, fairness/interactive/ weight-change smokes, and the controlled Task 6 thread-scale gate.Phase D thread-scale benchmark (five runs, KVM, physical-core logical CPUs
0,1,2,3, blocking parent join, 262,144 blocks / 16 MiB,work_rounds=64):Comparison capOS Phase D WFQ Linux pthread capOS gate 1->2 work 1.809x1.996x>= 1.6x1->2 total 1.774x1.995x>= 1.6x1->4 work 3.088x3.974xdiagnostic target >= 2.5x1->4 total 2.700x3.850xdiagnostic baseline > 1.538xThe 1->2 work/total rows passed the harness-enforced gate; the 1->4 rows were manually accepted from recorded diagnostics. Raw artifacts:
target/thread-scale/20260510T193200Z/andtarget/linux-thread-scale/20260510T194600Z/. Reproduction intools/qemu-thread-scale-harness.sh(viamake run-thread-scale) and the matchingmake run-linux-thread-scale-baseline.Bottleneck analysis. Linux pthread scaled near-linearly on the same physical CPU set, so the workload shape is sound and the remaining 1->4 gap is a capOS scheduler/runtime cost. The dominant contributors visible in measure-mode and post-thread-scale review are:
- Global
Schedulerlock contention. Per-CPU WFQ run queues exist, but several scheduler decisions (cross-CPU wake targeting, direct-target stale cleanup, queue reservation accounting) still funnel through oneSchedulermutex. Total-time scaling regresses faster than work-window scaling because exit/join/block/schedule paths spend disproportionate time inside that critical section. - Process-wide capability ring under one SQ consumer. A multi-thread process has one ring endpoint owned by one SQ consumer at a time. Completions, waker resolution, and direct IPC all serialize there, even when scheduler dispatch is per-CPU.
- Temporary four-owner scheduler-CPU assumption. The selected scheduler topology is currently hardcoded to four owner CPUs; the boot-time CPU set is not yet discovered, so workloads larger than four cores cannot be admitted at all.
- Periodic-tick service tax. Non-isolated CPUs still pay the periodic timer-tick cost on every scheduler tick, even when no thread is ready to run; nohz suppression today only fires inside the narrow single-runnable window and the SQPOLL-coupled lease.
Planned architecture changes that should improve SMP / threading scalability. The roadmap response is concrete:
- Scheduler Phase F.5: Full-SMP 16/32-core scalability
(
docs/backlog/scheduler-evolution.md, cross-linked fromdocs/proposals/smp-proposal.md). Replaces the four-owner assumption with dynamic CPU topology discovery, adds the x2APIC backend needed for higher APIC ids, shrinks scheduler shared-state serialization so local pick/requeue can avoid the global lock, and adds topology-aware placement plus an observable migration policy. A 1/2/4/8/16/32-worker hardware benchmark suite against a matching native-Linux baseline is the gate. - Ring v2: per-thread capability ring ownership
(
docs/proposals/ring-v2-smp-proposal.md). Completions route byThreadRef -> RingEndpoint, removing the single-SQ-consumer bottleneck and unblocking concurrent scheduler-owned work on more than one CPU per process. Needs TLB shootdown + cross-CPU cleanup review. - Generic full-nohz + generic SQPOLL nohz
(
docs/architecture/scheduling.mdPhase F follow-ons). Extends the bounded SQPOLL-driven activation (2026-05-14, commit8edd2314) to arbitrary rings and threads, retiring the periodic-tick service tax on non-isolated CPUs and enabling realtime islands. - EEVDF policy evaluation (deferred behind Phase F). Tracked as a follow-on dispatcher policy, not a Phase D blocker.
SchedulingContextover endpoints (Phase E, see 2026-05-14 entry above). Already-landed; lets a server inherit a caller’s reservation, reducing IPC scheduling round-trips for the same workload class.
- Global
Device Driver Foundation – DDF authority surface broadened
- DDF capability surface broadened in one day:
DeviceMmio.mapreturns a read-only userspace BAR VMA with theunmapcleanup paths exercised across release/drop/driver-crash/ reset-disable;DeviceMmio.write32exposes typed admission;DMAPool.allocateBufferreaches a three-slot bounce-buffer pool with manager-owned generation;DMABufferaccounting tracks per-slot in-flight descriptor identity withlive_inflightaggregation; andInterrupt.mask/unmaskperform bounded manager-mediated route-state control. Typed denials for invalid protections (mmio-map-prot-invalid) and invalid allocate-buffer requests (dmapool-allocation-request-invalid) now route through the no-result-cap admission path. Reproduction:make run-devicemmio-grant,make run-dmapool-grant,make run-interrupt-grant,make run-hardware-grant-cycle. This remains bounded manager accounting and admission proof: no direct DMA, doorbells, IOVA exposure, IOMMU programming, or production driver consumer.
2026-05-09
Device Driver Foundation – Interrupt/DMABuffer admission + audit hardening
- The
Interruptskeleton gained typed admission acrossacknowledge,wait,mask, andunmask(admission-check-only,*-not-attempted,side-effect-blocked), with the pending-IRQ token validator factored intocapos-lib::device_authority.DMABuffer.submitDescriptor/.completeDescriptorfollow the same admission pattern with per-slot in-flight accounting and typedfreeBuffer. The manifest-grantedInterruptretains its claimed MSI-X source across cap releases for sequential grant-source reuse (commit681e48ac). HardwareAuditLog.snapshotexposes the volatile snapshot contract as typed result metadata (bounded-volatile-ring-drop-oldest,volatile-only,unsigned,production-admission-policy-not-implemented) plus typed truncation labels (commite4cea6ff), astartSequencecursor for the retained ring, and cursor edge-case metadata for the below-oldest / past-end cases. Hardware audit smokes assert grant-source acquire/ release identity forDeviceMmio,Interrupt,DmaPool, andDmaBuffer; cap-audit assertion sets are now exact-count anchored so boot-timevirtio-rngproof records cannot satisfy the grant audit checks.- Reproduction:
make run-interrupt-grant,make run-devicemmio-grant,make run-dmapool-grant,make run-hardware-audit,make run-hardware-grant-cycle. Bounded manager accounting and read-side audit only; production userspace-driver authority remains open.
Device Driver Foundation – DMAPool parent-first release ordering
- Commit
29b4dde5:make run-dmapool-grantreleases the parentDMAPoolbefore the resultDMABuffer. Parent release stages a pending detach while the proof buffer is still attached; theDMABufferrelease then frees the proof page and completes the staged zero-live pool detach as the singleDmaPoolcap-op-releaseaudit.DMABufferdriver-crash and reset-disablerun-netproofs also complete any pending parent release instead of orphaning it.
2026-05-08
Device Driver Foundation – DMAPool grant + hardware audit cap
KernelCapSource::DmaPoolnow grants the bounded single-proof-buffer allocation path (commitf95c6cf8):capos-rtexposes typedDmaPoolClient::allocate_buffer_waitandDmaBufferClient::info_wait, andmake run-dmapool-grantproves the manager-attached buffer lifecycle plus the matchingdmapool/dmabufferaudit records. An earlier info-only grant variant routed throughkernel/src/cap/dmapool_grant_source.rsand reused the virtio-rngManagerGrantSourcedevice handle.KernelCapSource::HardwareAuditLogexposes a read-onlyHardwareAuditLog.snapshotcap backed by a bounded volatile drop-oldest ring inkernel/src/cap/hardware_audit.rs, so userspace observes hardware-cap audit records without parsing COM1 text. Reproduction:make run-hardware-audit.Interruptwaiter teardown trigger now routes through the stale-safe detach helper used by cap release / driver-crash / reset-disable cleanup (commitaeef8b41).make run-netasserts theinterrupt waiter hook proofline and an exact-onecap-audit: cap=interrupt event=interrupt-waitercount.- Hostile-smoke gate hardened:
tools/qemu-net-smoke.shanchors the remaining proof-line assertions in the S.11.2 hostile-smoke gate with exact-count guards + anchored suffix assertions.
Cloud boot – GCP imported-image serial boot recorded
- Cloudboot run
1778230874-715a(2026-05-08 09:06 UTC) against source3951e275:make cloudboot-testbuilt the 10 GiB GCE-compatible disk tarball, uploaded it to the staging bucket, created a temporary GCE image +e2-smallinstance, and observed thecapos kernel startingserial landmark on poll attempt 2. Serial evidence shows SeaBIOS booting from Google Persistent Disk virtio-scsi (10240 MiB), 2 vCPU / 2 GiB RAM discovery, Google RSDT/MADT tables, fail-closed IOMMU policy (no MCFG/DMAR/IVRS), masked I/O APIC routing, AP online, manifest load, init start, and shell spawn. The harness copied artifacts totarget/cloudboot-evidence/run-1778230874-715a/before deleting the temporary instance, image, and staged tarball.
2026-05-07
WASI Host Adapter Phase W.2 – C and Rust hello-wasi smokes closed
- Commit
7bfcb1d8: WASI host adapter Phase W.2 closed. Both Rust (wasm32-wasip1) and C (wasm32-wasi)hello, wasipayloads run inside the wasmi interpreter under thewasm-hostcapOS process and print through the host’s granted Console cap via the Preview 1fd_write(1, ...)surface. Closed in four sub-slices: (1) thewasm-hostuserspace binary +system-wasm-host.cue+make run-wasm-hostempty-module instantiation, carrying a one-time userspace ABI bump for wasmi’s ~3 MiB BSS; (2) the Preview 1 stdout-only import resolver incapos-wasm/src/wasi/preview1.rs(46 imports,args_get/environ_getempty,clock_time_getbacked byTimer,proc_exitviacapos_rt::syscall::exit,fd_write4 KiB iov-total + 1 KiB per-call ceiling throughConsole; everything else returnsERRNO_NOSYS = 52); (3) the Rustdemos/wasi-hello-rust/crate withsystem-wasi-hello-rust.cue+ the manifest-supplied payload reader incapos-wasm/src/payload.rs; (4) the Cdemos/wasi-hello-c/smoke built directly against system clang-18 + wasi-libc (nolibcapos/POSIX work needed – the wasm-host payload-load path from sub-slice 3 carries the C.wasmpayload unchanged). Reproduction:make run-wasi-hello-rust,make run-wasi-hello-c, andmake run-wasm-hostfor the empty-module regression. Phase W.3 (per-instance CapSet + LaunchParameters) is the next selectable phase.
2026-05-03
System Configuration Slice 3 closed
- 2026-05-03 21:54 UTC, commit
a50f610d: the System Configuration and Operator Extensibility track’s Slice 3 closed. Every owned focused-proof manifest in the inventory declares its own CUE package and importscapos.local/cue/defaults; the manifest decoder rejects unknown document-root fields with typedError::UnknownField(pinned bysystem_manifest_rejects_unknown_root_fieldandsystem_manifest_accepts_only_known_root_fields); and the operator overlay worked example covers every defaults-package extension hook (MOTD, console password verifier, additional authorized SSH keys, additional seed accounts, additional resource profiles, additional binaries, additional services), verified by a 1808-bytemanifest.bindelta when the worked-example overlay is dropped at repo root andmake manifestis rerun in package mode. Reproduction:cargo test-config(348 tests),make manifest,make run, and the per-manifestmake run-*targets named in the Slice-3 inventory table. Residual successor scope:system-measure.cuemigration is owned bydocs/backlog/scheduler-evolution.md;system-paperclips.cueandsystem-adventure.cueare demo-owned.
2026-05-02
Thread-Scale Honest Scaling Proof
-
2026-05-02 21:38 UTC, against
maincommit374f8556: the formal capOS+Linux thread-scale evidence pair was collected on the benchmark VM as the gate before Phase D. Both runs pinned to physical-core logical CPUs0,1,2,3on a 4-core/8-threadn2-highcpu-8host with KVM, five runs per case, same repaired benchmark shape (blocking parent join, 262,144 blocks / 16 MiB,work_rounds=64).Comparison capOS Linux pthread capOS gate 1->2 work 1.883x1.988x>= 1.6x1->2 total 1.787x1.987x>= 1.6x1->4 work 1.566x3.963x>= 1.6x(diagnostic)1->4 total 1.538x3.858x>= 1.6x(diagnostic)The 1->2 gates passed against the then-current single-global-queue scheduler. The 1->4 rows are the bottleneck-attribution diagnostic that justified Phase D’s fair-share enqueue policy: Linux scaled near-linearly on the same physical CPU set, so the workload shape was sound and the gap was a capOS scheduler bottleneck. Phase D later reduced the gap (see 2026-05-10 entry above for the post-Phase D result and the Bottleneck analysis + planned-architecture-changes block).
Raw artifacts:
target/thread-scale/20260502T213544Z/andtarget/linux-thread-scale/20260502T213445Z/. Reproduction:make run-thread-scale(withCAPOS_THREAD_SCALE_RUNS=5etc.) andmake run-linux-thread-scale-baseline. Host: internal benchmark VM in single GCP zone,n2-highcpu-8, nested virtualization, kernelLinux 6.17.0-1012-gcp x86_64, CPUIntel(R) Xeon(R) CPU @ 2.80GHz,qemu-system-x86_64 8.2.2,rustc 1.97.0-nightly (c935696dd 2026-04-29).
Measure Mode Repair
- 2026-05-02 20:23 UTC, commit
08c54075:make run-measureis green again. Two cumulative regressions: thethread-lifecyclemeasure-mode binary started requiringvm(VirtualMemory) andframes(FrameAllocator) caps when the park unmap/reuse smoke landed (a7af0e37,765c6c26) butsystem-measure.cuewas never updated; therun_park_process_exit_cleanuppath calledcapos_rt::syscall::exit(0)to terminate the entire process, but that syscall became per-thread in214c8e11, so the parent thread exited while the parked child kept the process alive. Repair: add the missing cap entries, bump the smoke assertion from5 capsto7 caps, retire the broken park-exit path, and route measure-mode exits through the sameexit_last_threadThreadControl flow as the spawn smoke. Closeout:make run-measureexits 0 in 32s with the fullmeasure: ...segment/scheduler/timer/lock attribution intact; all other validation gates passed.
2026-05-01
In-Process Threading Scalability
- 2026-05-01 14:58 UTC, commit
136b72de: the In-Process Threading Scalability milestone reached accepted controlled evidence only after the benchmark shape was repaired (the old 1 MiB / spinning-parent shape failed to scale even on Linux pthread at four workers). Harness defaults are now blocking parent join, 262,144 blocks (16 MiB),work_rounds=64. Controlled native-Linux evidence on a physical CPU set validated the repaired shape (1->21.991xwork /1.990xtotal; 1->43.958x/3.834x). Controlled capOS evidence on the same CPU set passed both enforced 1->2 gates with1.828x/1.687xwork/total. Unsuppressed 1->4 diagnostic recorded3.029x/2.386x; switch-log-suppressed3.272x/2.303x, showing serial scheduler switch logging materially distorts four-worker work timing. Four-core capOS scaling was not declared a closed claim – guest-measure evidence showed remaining globalSchedulerlock contention plus exit/join/block/schedule overhead in total time. This was the diagnostic stepping-stone that motivated Phase D’s WFQ run queues and the Phase F.5 architecture work listed in the 2026-05-10 entry. - Same branch tightened caller-aware child publication for the repaired blocking-parent benchmark: publication avoids the caller only when another active ready scheduler CPU has a strictly lower non-idle dispatch load; equal-load ties keep an active-ready caller CPU instead of falling through to CPU0.
Diagnostics and Scheduler Support
- 2026-05-01 07:28 UTC, commit
d8d9dab1: benchmark attribution added guest-measure phase counters, host-summary work/total speedup gates, guest PC sampling, benchmark-only userspace symbol maps, resolveduser-pc-symbols.logreports, the Linux pthread baseline, larger workload / Amdahl controls, logging-suppression A/B support, and first-slice shared-kernel lock counters for frame-allocator and ring-dispatch paths. - 2026-05-01 05:24 UTC, commit
a88e7906: scheduler support landed as incremental slices, not milestone closeout – bounded per-scheduler-CPU runnable queues, queue reservation accounting, bounded idle-to-runnable wake targeting, wake/reschedule attribution, stale runnable / direct target cleanup proofs, aSchedulerDispatchsubstate separating dispatch ownership from shared thread metadata, and per-thread runtime/virtual-runtime accounting.
2026-04-30
Multi-Process SMP Concurrency
- 2026-04-30 09:45 UTC, commit
3fb89923: Multi-Process SMP Concurrency closed. Worker elapsed reporting uses scaled user-mode cycle counts; prime-counting ranges remain contiguous while balancing upper-range cost. Accepted KVM-backed run intarget/smp-process-scale/cycle-balanced-default/: medianssmp1=1693,smp2=1053,smp4=2314, or1.608x1-to-2 speedup. Ordinaryrun-smokeandrun-spawnunder-smp 2passed.
2026-04-28 / 2026-04-29
Session-Bound Invocation Context core gates
- 2026-04-29 08:40 UTC: Session-Bound Invocation Context landed its core gates: process-session invariant, default endpoint caller-session metadata, stale normal endpoint rejection, transfer scopes, field-granular disclosure gating, session expiry for broker-issued shell bundle caps, guest bundle narrowing, chat membership keyed by opaque caller-session references, Aurelian player state keyed by live endpoint caller-session metadata, and terminal output liveness checks. Terminal/stdio bridge completion and final service-scoped reference derivation/rotation remained open.
2026-04-25
SMP Phase C – multi-CPU scheduling proof
- 2026-04-25 11:47 UTC: SMP Phase C AP scheduler-owner proof closed.
AP cpu=1 can run scheduler-owned user contexts while the BSP stays in
kernel idle behind a one-way scheduler-owner latch (review-fix commit
d88bca7). Per-CPUKernelGsBase+swapgs, PIT-calibrated xAPIC LAPIC timer/IPI, resident-mask TLB shootdown (vector 49), and split scheduler current-thread tracking landed across the day on thesmp-phase-c-*branches (swapgs,lapic-ipi,tlb-shootdown,scheduler-ownership). Per-CPU run queues, reschedule IPIs, concurrent scheduler-owned work on more than one CPU, and per-thread rings (Ring v2) remained Phase C follow-ups.
Telnet Shell Demo
- 2026-04-25 20:25 UTC, reviewed merge
2834bfc: Telnet Shell Demo closed. Addstelnet-gateway,system-telnet.cue,make run-telnet,make qemu-telnet-harness, proving QEMU host-local forwarding from127.0.0.1:2323to guest port 23, password login,caps,session, and clean exit through a socket-backedTerminalSession. The child shell transcript proves no rawNetworkManager,ProcessSpawner, TCP, or unknown capability interfaces. Scoped gateway authority follow-up remains open.
SMP Phase B – APs running
- 2026-04-25 06:59 UTC: SMP Phase B closed. AP startup uses Limine
MpRequest/MpInfo::bootstrap, stable AP records, AP-owned kernel/IST stacks, AP-local GDT/TSS state, capOS kernel PML4 handoff, AP-owned kernel RSP handoff, shared IDT,KernelGsBase, syscall MSRs, SMEP/SMAP state, and a parked interrupt-disabledhltloop.
SMP Phase A and user-buffer protection
- 2026-04-25 05:36 UTC: SMP Phase A closed. The BSP has a concrete
PerCpufor syscall-stack state and current-thread mirroring; kernel-entry stack updates flow through one per-CPU hook. - 2026-04-25 04:00 UTC:
workplan/user-buffer-validation-protectionclosed the private process-buffervalidate_user_bufferTOCTOU finding.AddressSpacenow owns validation plus HHDM-backed user copy/read helpers under the process address-space mutex. - 2026-04-25 03:36 UTC: final review of
workplan/futexspace-private-wait-wakefixed a park-ownership bug where a sibling thread could drain a park SQE and park the wrongThreadRef. Fix requiresCAP_SQE_THREAD_OWNEDplus the owning thread id forCAP_OP_PARK.
2026-04-24
- 2026-04-24 22:41 UTC: in-process threading design freeze closed. Thread/process ownership and park authority contracts frozen; review findings fixed before merge.
- 2026-04-24 20:53 UTC: runtime prerequisites for threading and Go
closed, with follow-up rejecting writable-executable user mappings in
anonymous
VirtualMemory/MemoryObjectpaths and QEMU smoke coverage. - 2026-04-24 16:45 UTC: kernel networking smoke closed – QEMU virtio-net path proves modern transport discovery, virtqueue setup, descriptor completion, ARP, ICMP echo, smoltcp handoff, static IPv4, and host-backed TCP HTTP GET.
- 2026-04-24 13:11 UTC: custom userspace target closed. Userspace
artifacts build through
targets/x86_64-unknown-capos.json; kernel stays onx86_64-unknown-none. - 2026-04-24 11:25 UTC: boot-manifest parser scope tightened by
KernelBootstrapManifest, which decodes only kernel-owned fields and avoids materializing the init-owned service graph. Boot package boundary cleanup followed (10:53 UTC). - 2026-04-24 03:06 UTC: Ring-as-Black-Box closed. QEMU debug-tap builds
export bounded metadata-only ring records;
tools/ringtap-viewer/renders correlated SQE/CQE evidence offline. - 2026-04-24 02:16 UTC: shared service harness extraction closed for non-speculative duplicated demo-service pieces.
- 2026-04-24 00:34 UTC: dependency policy gate restored by allowing
BSD-3-Clausefor the current Argon2 closure with rationale indocs/trusted-build-inputs.md.
2026-04-23
- 2026-04-23 22:05 UTC: Verified Core closed.
make kani-libruns the bounded local/GitHub Kani model-checking gate;make kani-lib-fulladds the high-memory transfer model-checking gate. These are bounded model checks (small input sizes such as <=8 frames and 63 ELF bytes), proving the harnessed invariants within those bounds rather than for all inputs. The companion Loom check (cargo test-ring-loom) exercises a bounded concurrency model of the ring protocol, not the shippedkernel/src/cap/ring.rs. - 2026-04-23 21:30 EEST: boot-to-shell milestone closed. Default
make runreaches setup/login, volatile credential creation, password-authenticated session minting, broker-issued shell bundles, redacted auth/session audit records, and an interactive native shell REPL over serial.capos-shellis init on shell-led manifests; anonymous shell starts on boot;login/setupmint authenticated operator bundles. - 2026-04-23 16:34 UTC: split UART shell session closed.
make runpresents login/native shell on terminal UART while kernel/debug output goes totarget/qemu-console.log. The revocable read milestone closed in the same window –make run-revocable-readproves a parent can revoke a child-localBootPackagegrant throughCapabilityManager.
2026-04-20 To 2026-04-22
- 2026-04-22 23:50 UTC: AP-independent review remediation closed issues around endpoint owner cleanup, ProcessSpawner badge attenuation, ProcessSpawner heap-OOM paths, queued release semantics, pinned CUE toolchain enforcement, stale authority docs, spawn hardening stability, NMI IST coverage, MemoryObject replacement for raw frame grants, generated capnp ownership, and manifest validation modularization.
- 2026-04-21 22:21 EEST: VirtualMemory quota review finding resolved with per-address-space ownership tracking, holder quota, bounded auto-placement probes, owned-range checks, and QEMU coverage.
- 2026-04-21 18:46 EEST: smoke demos moved into a nested
demos/userspace workspace;system.cuepackages each demo as a distinct release-built binary/service. - 2026-04-21 16:56 UTC: cross-process
CAP_OP_CALL/RECV/RETURNflow closed. Allocation-free synchronous ring dispatch (12:28 UTC), SMEP/SMAP,cap_enterblocking waits with timeout, Endpoint, and RECV/RETURN routing (01:15 UTC) closed in the same window. - 2026-04-20 23:05 EEST: Phase 0 and Phase 1 cleanup closed – dead-code
cleanup, ELF validation hardening, deterministic error paths,
corrupted ring recovery policy,
capos-libsplit, host tests.
2026-04-05
- 2026-04-05 16:39 EEST: Stage 4 and Stage 5 direction took shape.
Capability invocation moved from direct calls toward the
shared-memory ring and
cap_enter; preemptive scheduling was documented after the PIT/context-switch scheduler landed; stalecap_callproposal text was replaced with the ring-based model. - 2026-04-05 17:08 EEST: planning surface matured. The roadmap moved
out of
README.md, review findings split into their own log, and the project added userspace-binaries, SMP, Go runtime, cloud deployment, storage/naming, service architecture, GPU, error-handling, and persistence planning. - 2026-04-05 10:35 EEST: design grounding expanded through prior-art research on seL4, Zircon, Plan 9/Inferno, EROS/CapROS/Coyotos, Genode, and LLVM target customization. That research fed the interface-as-permission decision.
- 2026-04-05 02:17 EEST: manifest/config and init-side planning advanced with a no-std manifest config loader, hardening tests, and early init-side manifest parsing demos.
2026-04-04
- 2026-04-04 21:02 EEST: capOS bootstrapped as a Limine-loaded Rust kernel with serial output, then gained the first Cap’n Proto capability invocation path and a staged implementation roadmap.
- 2026-04-04 23:12 EEST: Stage 1 through Stage 3 landed in rapid succession – virtual memory with kernel remapping and isolated process address spaces; Ring 3 user-space transition through GDT/TSS/syscall setup; and process abstraction with ELF loading, per-process address spaces/cap tables, static init, and QEMU auto-exit proof.
- 2026-04-04 23:57 EEST: the first major design proposals appeared, including userspace TCP/IP networking and capability-based service architecture. Networking was split into its own proposal after review.
Build, Boot, and Test
The commands below are the current local workflow for the x86_64 QEMU target.
The root Cargo configuration defaults to x86_64-unknown-none, so host tests
must use the repo aliases instead of bare cargo test.
Prerequisites
Expected host tools:
- Rust nightly from
rust-toolchain.toml makeqemu-system-x86_64xorrisocurl,sha256sum, and standard build tools for pinned tool downloads- Go, used by the Makefile to install the pinned CUE compiler when needed
- A Telnet client for the optional focused loopback shell demo
- Chromium, Chromium Browser, or Google Chrome for the optional remote-session CapSet browser UI automation
- Optional policy and proof tools for extended checks:
cargo-deny,cargo-audit,cargo-fuzz,cargo-miri, andcargo-kani
The Makefile pins and verifies:
- Limine at the commit recorded in
Makefile - Cap’n Proto compiler version
1.2.0 - CUE version
0.16.0
Pinned repo-selected tools are installed under CAPOS_TOOLS_ROOT, which
defaults to the per-user $HOME/.capos-tools cache. Override
CAPOS_TOOLS_ROOT=/path/to/cache when a host needs a different cache
location.
Build the ISO
Use the default target when you need the current bootable capOS image.
make
This builds:
- the kernel with the default bare-metal target;
- the standalone
inituserspace binary used by focused spawn proofs; - release-built demo service binaries under
demos/; - the
capos-rtuserspace binaries, including the shell proof; manifest.binfromsystem.cue;capos.isowith Limine boot files.
Relevant files: Makefile, limine.conf, system.cue, tools/mkmanifest/.
Compare Build Provenance
Use make build-provenance to write the local build record at
target/build-provenance.txt. To compare two retained records locally:
make build-provenance-compare \
BASE_PROVENANCE=path/to/base-build-provenance.txt \
CANDIDATE_PROVENANCE=path/to/candidate-build-provenance.txt
The comparison ignores the generated timestamp and allowed local path-root
movement under worktree target/ directories or .capos-tools/ caches. It
fails for material provenance drift, including source commit changes, manifest
or artifact hash changes, embedded binary hash changes, OVMF identity/hash
changes, Rust compiler date/commit changes, host-tool version or package
identity changes, and operating-system identity changes.
For PR base-vs-head CI comparison, use the environment policy:
make build-provenance-compare \
BUILD_PROVENANCE_COMPARE_POLICY=ci-environment \
BASE_PROVENANCE=path/to/base-build-provenance.txt \
CANDIDATE_PROVENANCE=path/to/candidate-build-provenance.txt
That mode allows expected source and artifact hash changes while still failing for runner, GitHub-hosted image, Rust, selected-tool, package-identity, OVMF selection, and OVMF hash drift. It is a PR environment gate, not a production reproducibility claim.
Local synthetic comparison checks may create scratch records under
target/provenance-fixtures/ or Python bytecode caches under tools/. Clean
those scratch artifacts with:
make build-provenance-compare-clean
Boot QEMU
Use the default run targets to boot either the operator-facing system or the scripted login-path smoke.
make run
make run-smoke
make run is the operator-facing boot path. It builds the ISO with the qemu
feature, boots QEMU with the interactive terminal UART on stdio, attaches
virtio-net with host-local remote CapSet forwarding, and writes the separate
kernel/debug UART log to target/qemu-console.log. The run output prints the
actual forwarded port as remote CapSet: tcp 127.0.0.1 <port> -> guest :2327.
The plaintext loopback Telnet research demo was a Phase B fixture, not part of
the default operator path. make run-telnet and make run-telnet-vm are now
retired because they depended on the removed qemu-only kernel TCP listener. Use
the SSH gateway smokes for current remote-shell coverage and rebuild any
socket-backed terminal proof on the Phase C userspace network stack before using
it as validation.
The same make run boot starts the remote-session CapSet gateway. To run the
host CLI against it, use the printed port:
cargo run --manifest-path tools/remote-session-client/Cargo.toml \
--target x86_64-unknown-linux-gnu \
--bin remote-session-client -- --host 127.0.0.1 --port <port>
Add --launch-adventure to that command when you want the CLI to start the
default-manifest Adventure service graph through serviceLaunch and require a
running status.
To run the trusted local web bridge against the same QEMU instance:
CAPOS_REMOTE_SESSION_PORT=<port> make remote-session-ui
Then open http://127.0.0.1:3337/. The Rust bridge holds the TCP stream,
remote session state, and backend-held remote CapSet; the browser receives
only view models, launch/status descriptors, denial diagnostics, call results,
and redacted transcript rows. The former automated focused proof,
make run-remote-session-capset-ui, is retired because it depended on the
removed qemu-only kernel TCP listener. The replacement browser proof belongs to
the future Phase C Web UI L4 gate.
A Tauri desktop wrapper is available as a repo-local check/dev layer over the
same Rust backend. The repo-local make remote-session-tauri target first
runs a policy preflight over the reviewed scaffold, then checks for the Tauri
CLI and Linux build prerequisites, reports dependency and scaffold status, and
runs a deterministic wrapper check when those prerequisites are present. It
follows the official Tauri v2 Linux prerequisite shape, including WebKitGTK
4.1, libxdo, OpenSSL, AppIndicator, and Rsvg development packages where
applicable. Missing dependencies fail with explicit diagnostics and point back
to the supported local web bridge. The operator command shape is:
CAPOS_REMOTE_SESSION_PORT=<port> make remote-session-tauri
Set CAPOS_REMOTE_SESSION_TAURI_MODE=dev to launch cargo tauri dev.
CAPOS_REMOTE_SESSION_TAURI_MODE=policy tools/remote-session-tauri.sh runs
only the scaffold guardrail and does not require Tauri system packages or a
desktop session. CAPOS_REMOTE_SESSION_TAURI_MODE=package and
CAPOS_REMOTE_SESSION_TAURI_MODE=automation are intentionally blocked with
diagnostics describing the remaining packaging and desktop-automation review
work. This policy preflight proves only that the current wrapper remains a
check/dev scaffold with packaging disabled, the loopback URL pinned, a single
main window, default core:default permission scope, and no app-specific
Tauri command/plugin authority. It is not a distributable packaging or desktop
automation proof. make remote-session-ui remains the supported fallback host
UI path and uses the same backend-held authority boundary.
Default make run starts chat, the remote-session gateway, and shell
services. It embeds Adventure server/NPC/client binaries and the terminal
Paperclips binary. The current remote-session Adventure slice makes
serviceLaunch a real restricted backend launch in that default manifest:
the trusted backend/gateway starts adventure-server plus simple NPC
companions through an approved service-runner profile and attaches or retains
backend-held descriptors/caps for the Adventure/chat-facing services. Run it
by starting make run, noting the printed remote CapSet forwarding port, and
then using either the host CLI or
CAPOS_REMOTE_SESSION_PORT=<printed-port> make remote-session-ui.
make run-paperclips remains the focused authoritative Paperclips server
proof; default-manifest Paperclips launch is not implemented by this slice.
Raw ProcessSpawner, process owner handles, endpoint owner caps, local cap
ids, result-cap slots, and browser-held capOS caps are non-goals for this UI
path. Process handles stay backend-local.
GCE Web UI Proof Target Map
Use the selected-milestone proof targets below to choose the narrowest evidence gate for the GCE self-hosted Web UI ladder. Local QEMU/cloudboot targets do not prove live provider reachability, and private GCE targets do not authorize public ingress or TLS exposure.
| Proof class | Target or command shape | Proves | Closest non-goal |
|---|---|---|---|
| Landed local Phase C L4 substrate | make run-cloud-prod-userspace-network-stack-smoltcp | A non-qemu cloudboot kernel under QEMU starts the userspace smoltcp network-stack process and completes one hostfwd TCP request/response through a userspace-served TcpListenAuthority. See cloud-prod-userspace-network-stack-smoltcp-local-proof. | Does not prove DHCP/IPv4 configuration, remote-session-web-ui, live GCE reachability, or public ingress. |
| Landed local IPv4 configuration | make run-cloud-prod-network-stack-dhcp-ipv4-config | The userspace network-stack process acquires the QEMU SLIRP DHCPv4 lease, serves NetworkManager.getConfig, installs the default route, and resolves gateway plus same-subnet ARP neighbors. See cloud-prod-network-stack-dhcp-ipv4-config-local-proof. | Does not prove a Web UI listener bound through that route, live GCE reachability, DNS, TLS, or public exposure. |
| Retired legacy local self-served Web UI | make run-remote-session-self-served-web-ui | Pre-Phase-C proof that served the immutable full UI bundle from a focused QEMU manifest through the kernel tcp_listen_authority socket owner. The target is not current production L4 evidence after cloud-prod-phase-c-kernel-smoltcp-virtio-net-removal retires that kernel owner. | Does not prove the non-qemu cloudboot Phase C L4 path, and should not be used as a passing selected-milestone gate until rebuilt on the userspace network-stack substrate. |
| Landed cloudboot Web UI authority inventory | No run target; docs-status contract | The Gate 1B inventory records the required and forbidden remote-session-web-ui grants, trusted listener/source metadata, browser-visible forbidden markers, and expected local L4 proof markers. See remote-session-webui-cloudboot-authority-inventory. | Does not prove runtime listening, browser automation, GCE reachability, or public operator access. |
| Landed local cloudboot Web UI L4 proof | make run-cloud-prod-remote-session-web-ui-l4 owned by cloud-prod-remote-session-web-ui-l4-local-proof | Proves remote-session-web-ui listens on guest port 8080 through the Phase C L4 path on the non-qemu cloudboot kernel under QEMU: the userspace network-stack process serves the scoped TcpListenAuthority, the Web UI serves the full fixed-name bundle, login, one backend-held capability call, logout, stale-call failure, the manual viewer, and a cloudboot-evidence: remote-session-web-ui-l4 marker. | Local cloudboot/QEMU evidence only; it does not prove live GCE NIC reachability, private provider probing, public ingress, TLS, or production release authority. |
| On-hold private GCE Web UI proof | Future tools/cloudboot/run-test.sh --require-web-ui-proof gate owned by cloud-gce-private-self-hosted-webui-proof | Must launch the self-hosted Web UI cloudboot image in the no-public-IP GCE posture, use a reviewed private probe that crosses the live GCE virtual network boundary, record the private endpoint and Web UI/L4 markers, and tear down all created resources. | On hold (2026-06-09): the cloudtest credential lacks the firewall IAM a private same-VPC probe needs against GCE default-deny ingress, and the live legacy virtio 0.9 GCE NIC has no reviewed userspace-stack serving story. It will not create public IPs, public firewall rules, DNS, TLS certificates, or browser-facing public operator ingress. |
| On-hold public ingress/TLS proof | Future tools/cloudboot/run-test.sh --require-public-web-ui-proof gate owned by cloud-gce-public-self-hosted-webui-ingress-tls | After explicit authorization, must prove the selected GCE external HTTPS load-balancer ingress posture, Google-managed certificate termination, browser-session hardening, and teardown evidence. | On hold. No local target, private proof, or selected milestone status grants public exposure, broad firewall changes, certificate issuance, TLS key custody, or release authority. |
make run-smoke preserves the focused legacy shell-led system-smoke.cue
verification path. It drives the login and shell session through the terminal
UART, captures the kernel log and terminal transcript separately, and checks
that the kernel boot-launched only the first init service (capos-shell),
granted only the shell bootstrap cap bundle, and then reached the expected
audit, shell-bundle, child-isolation, stale-handle, and no-password-echo
assertions before QEMU exits. This is distinct from the default system.cue
path, where the kernel boot-launches standalone init and init starts
operator-facing services.
Spawn Smoke
Use the spawn smoke when changes affect manifest-owned process creation, ProcessSpawner behavior, or bootstrap capability wiring.
make run-spawn
This boots with system-spawn.cue, the focused init-owned manifest retained
for ProcessSpawner checks. Only init is boot-launched by the kernel; init
uses ProcessSpawner to launch endpoint, IPC, VirtualMemory, Timer,
ThreadControl, the single-thread runtime checkpoint, FrameAllocator cleanup,
and hostile spawn demo children, wait for ProcessHandles, and exercise hostile
spawn inputs. The target captures the kernel log separately and runs
tools/qemu-spawn-smoke.sh to assert the single-init boot markers,
BootPackage validation, child spawn/exit records, Timer now/sleep and
per-process sleep quota proof lines, runtime FS-base proof lines, the
single-thread runtime map/protect/unmap plus park-fallback checkpoint,
manifest child waits, and clean halt.
Shell and Terminal Smokes
Use these focused QEMU smokes for shell, terminal, credential, and login paths.
make run-shell
make run-terminal
make run-credential
make run-login
make run-login-setup
make run-shellboots the focusedsystem-shell.cuemanifest (no pre-provisioned verifier) and exercises the shell entirely in its anonymous session: CapSet listing, typed capability inspection, typed application-error display, anonymous-session metadata, the anonymous launcher rejectingspawn-testbecause its allowlist is empty, and clean exit.make run-terminalboots the focusedsystem-terminal.cuemanifest and exercises theTerminalSessionsubstrate: visible and hidden echo input, boundedreadLine, structured cancellation, and stale-input scrubbing between prompts.make run-credentialboots the focused CredentialStore proof manifest.make run-loginboots the focused password-login manifest and proves the shell’slogincommand prompting forusername>before hiddenpassword>, failing generically on a wrong password, succeeding for the demo account, swapping from the anonymous bundle to the operator bundle, and performing exact-grant child launch plus stale-handle release.make run-login-setupboots the no-password first-boot setup manifest and proves thatsetupcreates a volatile credential, discloses that volatility, chains into the login upgrade path, and reaches the same narrow operator shell bundle.
Durable account storage and multi-verifier local accounts are still future
work; the current username-aware login path selects the manifest-seeded
operator-kind account and any volatile first-boot credential record that
setup creates.
Focused Service Smokes
Use these targets to prove resident services and demo clients still launch through the intended shell-granted authorities.
make run-chat
make run-adventure
make run-paperclips
make run-revocable-read
make run-memoryobject-shared
make run-ringtap-failing-call
make run-chatboots the focused First Chat manifest and proves a shell-spawned client can send a line through the resident singleton chat service using the broker-issued operatorchatendpoint and observe the resident bot reply.make run-adventureboots the focused adventure manifest and proves the shell-spawned client can drive the current scripted mission through explicitStdIO,adventure, andchatendpoint grants.make run-paperclipsboots the focused Paperclips terminal demo manifest, authenticates the shell, starts Paperclips server services, first launches the clean-room terminal client with explicitStdIOplus the normalPaperclipsGameendpoint, proves normal server authority cannot invokerun <ms>, rejects a forgedproof_accelerator: @timergrant, then relaunches against the proof server endpoint with the explicitproof_acceleratorproof authority for the accelerated transcript. The server owns generated content, game state, regular timer cadence, unlock checks, and game-rule mutation, and server-mode client help is rendered from structured server command specs. That transcript rejects an early locked autoclipper purchase, rejects an over-budget wire purchase, rejects bulk manual production, rejects a high-price sale with zero current demand, rejects manual production after automation drains wire, drives one-at-a-time manual production, explicit sales, repeatable marketing, autoclipper unlock, real-time automation, generated typed Cap’n Proto content loading, scaled business-phase production,precision-rollers,design-search,forecast-engine, thesurvey-dronestransition to== autonomous phase ==, representative autonomous drone/factory scaling with local-matter conversion and additional clip production, themesh-coordinationandseed-probescosmic transition, bounded probe replication and production, lockedfinal-conversion, and clean client/shell exit.make run-revocable-readexercises the revocation transcript for endpoint and boot-package authority loss.make run-memoryobject-sharedproves MemoryObject-backed parent/child sharing and cleanup.make run-ringtap-failing-callenablesdebug_tap, drives a known typed launcher failure, and runs the ringtap viewer over the captured log.
Networking and Measurement Targets
Use these targets for the current network proof path and benchmark-only measurement image.
make run-net
make qemu-net-harness
make run-measure
make run-netattaches a QEMU virtio-net PCI device and exercises current PCI enumeration, virtio transport setup, and TX descriptor completion diagnostics, plus ARP resolution and ICMP echo validation against the QEMU user-mode gateway.make qemu-net-harnessruns the scripted net smoke path.make run-measureenables the separatemeasurefeature for benchmark-only counters and cycle measurements. It bootssystem-measure.cue, where init spawnsring-nopand grants the measurement-only NullCap and ParkBench caps through ProcessSpawner. The demo prints ring/NullCap baselines plus a park-shaped comparison between compact authority-checked SQEs and generic Cap’n Proto methods. The kernel summary includes per-segment dispatch counts, total cycles, and averages for SQE processing, validation, cap lookup, capnp decode, method body dispatch, CQE posting, and waiter wake/check. Do not treat it as the normal dispatch build.
Formatting and Generated Code
Use these local checks before claiming source formatting or generated artifacts are current.
make fmt
make fmt-check
make generated-code-check
make fmtformats the kernel workspace plus standaloneinit,demos, andcapos-rtcrates.make fmt-checkverifies formatting without modifying files.make generated-code-checkverifies checked-in Cap’n Proto generated code against the repo-pinned compiler path and checks generated adventure plus Paperclips content against their CUE sources.
Host Tests
Use these host-side checks for shared logic and userspace build surfaces that do not require a QEMU boot.
cargo test-config
cargo test-ring-loom
cargo test-lib
cargo test-mkmanifest
tools/check-userspace-runtime-surface.sh
make capos-rt-check
make init-capos-build
make demos-capos-build
make shell-capos-build
make capos-rt-capos-build
cargo test-configruns shared config, manifest, ring, and CapSet tests on the host target.cargo test-ring-loomruns the bounded Loom model for SQ/CQ protocol invariants.cargo test-libruns host tests for pure shared logic such as ELF parsing, capability tables, frame allocation, and related property tests.cargo test-mkmanifestruns host tests for manifest generation.tools/check-userspace-runtime-surface.shverifiescapos-rtowns the userspace entry, panic, allocator, and raw syscall surface.make capos-rt-checkbuilds the standalone runtime smoke binary againsttargets/x86_64-unknown-capos.json, matching the userspace target used by the boot image.make init-capos-build,make demos-capos-build,make shell-capos-build, andmake capos-rt-capos-buildexpose focused custom-target build wrappers for the booted userspace crates and runtime smoke binary.
Extended Verification
Use the extended verification set for shared logic, dependency policy, fuzz targets, and bounded proof gates that are heavier than the normal host-test loop.
make dependency-policy-check
make fuzz-build
make fuzz-smoke
make kani-lib
cargo miri-lib
These require optional tools. Use them when changing dependency policy,
manifest parsing, ELF parsing, capability-table/frame logic, or proof-covered
shared code. make dependency-policy-check covers Rust deny/audit checks and
the docs Node lockfile/audit gate with npm lifecycle scripts disabled. See the
Security and Verification Proposal
for the rationale behind the extended verification tiers. make kani-lib
runs the bounded mandatory cap-table/frame gate.
Validation Rule
For behavior changes, a clean build is not enough. The relevant QEMU process
must exercise the behavior and print observable output that proves the path
works. make run-smoke is the default login-path gate; make run-spawn,
make run-shell, make run-terminal, make run-credential,
make run-login, make run-login-setup, make run-chat,
make run-adventure, make run-paperclips, make run-revocable-read,
make run-memoryobject-shared, make run-net, make qemu-net-harness,
make run-ringtap-failing-call, or make run-measure are additional gates
for their specific features.
Benchmarks
capOS benchmark rows are evidence records. Each row should say what workload ran, what was verified, how time was measured, what machine envelope was used, and where the raw artifacts were stored. A faster row whose verifier did not complete is not a performance result.
The broader benchmark model is in System Performance Benchmarks. Future parallel-pattern coverage is in HPC Parallel Processing Patterns.
Current CPU Workloads
capOS currently has two CPU-scaling workloads:
| Workload | Target | Timed region | Verifier | Primary use |
|---|---|---|---|---|
run-smp-process-scale | Independent worker processes | worker compute only, after setup and before result reporting | aggregate prime count and checksum | Exercises multiple process-owned rings running CPU work on more than one scheduler CPU. |
run-thread-scale | Sibling threads in one process | checksum work window, separate from spawn/join/shutdown totals | deterministic root checksum and metadata checks | Measures same-process thread scheduling, per-thread rings, and scheduler overhead. |
Both workloads keep serial and harness artifacts under target/. The capOS
rows below were collected under QEMU/KVM. The matching Linux rows use the same
workload shape where possible, but units differ by harness and should not be
compared directly across systems. Compare speedup ratios within a row.
Process-Scale SMP
make run-smp-process-scale boots a focused manifest, runs independent worker
processes, and times the CPU-bound worker window. Each worker owns its own
process ring. The timed section avoids syscalls and serial output; the
coordinator verifies the aggregate result after workers finish.
The current workload counts primes over 2..3_000_000 using balanced
contiguous splits. capOS reports a worker-side user-mode cycle counter shifted
right by 20 bits. Linux reports guest clock_gettime nanoseconds.
Controlled benchmark-VM reruns were recorded on GCE n2-highcpu-8 at capOS
commit 0d89a91b (2026-04-30 11:09 UTC) with nested QEMU/KVM on Ubuntu
6.17.0-1012-gcp, QEMU 8.2.2, Rust nightly 1.97.0-nightly
(c935696dd 2026-04-29), and host logical CPUs 0,1,2,3 mapped to distinct
physical cores with SMT siblings 4,5,6,7.
| System | smp1 median | smp2 median | smp4 median | 1-to-2 speedup | 1-to-4 speedup |
|---|---|---|---|---|---|
| capOS | 1,639 scaled cycles | 875 scaled cycles | 1,111 scaled cycles | 1.873x | 1.475x |
| Linux | 1,275,187,210 ns | 659,218,025 ns | 337,877,986 ns | 1.934x | 3.774x |
The capOS 4-vCPU row improved over the 1-vCPU row but was slower than the
2-vCPU row. Linux continued improving through 4 vCPUs under the same pinning
and workload. Raw capOS artifacts are under
target/smp-process-scale/pinned-20260430T1113Z/; raw Linux artifacts are
under target/linux-smp-process-scale/pinned-20260430T1118Z/.
SMT Run
The same harness can run an eight-logical-CPU case on the benchmark VM. That
machine exposes four physical cores and eight SMT threads, so the smp8-smt
row is an SMT measurement on a 4-core host.
The SMT run was recorded at commit 7c15dd47
(2026-04-30 11:45 UTC) with QEMU pinned to logical CPUs
0,1,2,3,4,5,6,7.
| System | smp1 median | smp2 median | smp4 median | smp8-smt median |
|---|---|---|---|---|
| capOS | 1,500 scaled cycles | 787 scaled cycles | 1,052 scaled cycles | 1,595 scaled cycles |
| Linux | 1,274,507,854 ns | 647,611,418 ns | 337,479,795 ns | 198,903,231 ns |
| System | 1-to-2 speedup | 1-to-4 speedup | 1-to-8 speedup |
|---|---|---|---|
| capOS | 1.906x | 1.426x | 0.940x |
| Linux | 1.968x | 3.777x | 6.408x |
Raw capOS SMT artifacts are under target/smp-process-scale/smt8-20260430T1148Z/.
Raw Linux SMT artifacts are under
target/linux-smp-process-scale/smt8-20260430T1151Z/.
In-Process Thread Scaling
make run-thread-scale runs sibling threads inside one process. Child threads
use per-thread rings. The workload computes fixed-size checksum blocks; the
default shape is a blocking parent join, 262,144 blocks (16 MiB), and
work_rounds=64.
The harness records both a work-window time and a total time. The work window brackets the checksum computation. Total time includes thread startup, synchronization, shutdown, and join overhead. For scheduler analysis, both numbers matter: work speedup shows CPU placement and dispatch during the syscall-free section, while total speedup shows the cost of the surrounding thread lifecycle.
The old 1 MiB workload with a spinning parent is historical only because the matching Linux pthread baseline also stayed flat at four workers. The current rows use the repaired 16 MiB blocking-parent shape unless noted.
Recorded evidence:
| System / mode | Placement | Runs | 1-to-2 work | 1-to-2 total | 1-to-4 work | 1-to-4 total | Notes |
|---|---|---|---|---|---|---|---|
| Linux pthread baseline (benchmark VM, 2026-05-10 19:46 UTC) | physical-core logical CPUs 0,1,2,3 | 5 | 1.996x | 1.995x | 3.974x | 3.850x | Same checksum workload and pin set as the 2026-05-10 capOS row. |
| capOS (Phase D WFQ, benchmark VM, 2026-05-10 19:32 UTC) | physical-core logical CPUs 0,1,2,3 | 5 | 1.809x | 1.774x | 3.088x | 2.700x | Per-thread weights/latency classes, per-CPU WFQ queues, bounded steal path. |
| Linux pthread baseline (benchmark VM, 2026-05-02 21:34 UTC) | physical-core logical CPUs 0,1,2,3 | 5 | 1.988x | 1.987x | 3.963x | 3.858x | Same repaired workload before Phase D. |
| capOS (single global queue, benchmark VM, 2026-05-02 21:35 UTC) | physical-core logical CPUs 0,1,2,3 | 5 | 1.883x | 1.787x | 1.566x | 1.538x | Shows the four-worker cost of the single global runnable queue. |
| Linux pthread baseline (2026-05-01 report) | physical-core logical CPUs | 5 | 1.991x | 1.990x | 3.958x | 3.834x | Repaired-shape baseline recorded in docs/changelog.md; target artifact directory is not named in the source record. |
| capOS (pre-collapse placement, 2026-05-01 report) | physical-core logical CPUs | 5 | 1.828x | 1.687x | 3.029x | 2.386x | Commit 136b72de; per-CPU placement model later replaced by the queue-collapse cleanup; target artifact directory is not named in the source record. |
| capOS, switch logs suppressed (pre-collapse, 2026-05-01 report) | physical-core logical CPUs | 5 | 1.913x | 1.636x | 3.272x | 2.303x | Same commit and model with scheduler switch logs suppressed; target artifact directory is not named in the source record. |
| capOS (post-collapse, single global queue, 2026-05-02 10:42 UTC) | physical-core logical CPUs 0,1,2,3 on the benchmark VM | 3 | 1.890x | 1.792x | 1.504x | 1.436x | Queue-collapse row recorded in docs/backlog/scheduler-evolution.md; target artifact directory is not named in the source record. |
The 2026-05-10 Phase D WFQ row uses the same repaired shape as the 2026-05-02
pair: blocking parent join, 262,144 blocks, work_rounds=64, five runs,
KVM-backed QEMU pinned to physical-core logical CPUs 0,1,2,3, and a matching
Linux pthread baseline on the same pin set. Raw capOS artifacts are under
target/thread-scale/20260510T193200Z/; raw Linux artifacts are under
target/linux-thread-scale/20260510T194600Z/.
The 2026-05-02 capOS/Linux pair used main commit 374f8556; raw capOS
artifacts are under target/thread-scale/20260502T213544Z/, and raw Linux
artifacts are under target/linux-thread-scale/20260502T213445Z/.
The row improved the four-worker work window from 1.566x to 3.088x and
the four-worker total window from 1.538x to 2.700x compared with the
single-global-queue row. Linux on the same host and pin set recorded
3.974x work and 3.850x total at four workers. The remaining difference is
the scheduler/runtime optimization target for later work.
Guest-side attribution is available with
CAPOS_THREAD_SCALE_GUEST_MEASURE=1. It emits aggregate and per-phase
measurements for spawn_ready, work, shutdown, and final_total,
including scheduler choice, lock, timer, TLB, serial, shared-kernel-lock,
network-poll, thread-placement, and sampled user-PC buckets. Host-side QEMU
profiling is available with CAPOS_THREAD_SCALE_PROFILE=1.
Interpreting CPU Counts
CPU-count rows are meaningful only with a recorded topology:
- Physical-core rows require enough physical cores for the vCPU count.
- SMT rows should say they are SMT rows and list the logical CPU set.
- Pinning QEMU with
tasksetis useful, but it is not CPU isolation by itself. Stronger runs should recordisolcpus/nohz_full/rcu_nocbs, cpuset, or systemd affinity policy when used. - Pinning QEMU to fewer host logical CPUs than guest vCPUs measures oversubscription behavior, not core scaling.
- Current QEMU/KVM results should stay separate from future direct cloud or bare-metal runs.
The current capOS benchmark table reaches four physical-core rows and an eight-logical-CPU SMT row on a 4-core/8-thread VM. It does not yet measure 16-core or 32-core systems.
Next CPU-Scaling Work
The next CPU-scaling milestone should be designed around direct hardware or a dedicated perf runner rather than nested QEMU as the primary evidence source. The benchmark suite needs:
- hardware discovery records for socket/core/SMT topology, APIC mode, timer source, frequency policy, memory size, and firmware/device model;
- workload rows at 1, 2, 4, 8, 16, and 32 workers where the machine has enough physical cores, plus separately labeled SMT rows;
- at least one static map/reduce checksum workload, one uneven dynamic-task workload, one barrier-heavy phase loop, and one IPC/service-bound workload;
- work-window and total-time reporting for every workload;
- matching Linux native baselines on the same hardware where a comparable workload exists;
- scheduler/runtime counters for queue depth, migrations, steals, reschedule IPIs, TLB shootdowns, timer ticks, lock wait/hold time, blocked time, and runnable but not running time;
- raw artifacts with source commit, toolchain, kernel config, host topology, run count, warmup policy, and verifier output.
QEMU should remain useful for boot and regression coverage, but it should not be the primary source for a 16/32-core SMP scalability milestone.
Commands
Run the capOS process-scale workload:
make run-smp-process-scale
Run the process-scale workload with QEMU pinned to selected host CPUs:
CAPOS_SMP_SCALE_QEMU_TASKSET_CPUS=0,1 make run-smp-process-scale
Run the process-scale SMT row on a host with at least eight logical CPUs:
CAPOS_SMP_SCALE_INCLUDE_SMT=1 \
CAPOS_SMP_SCALE_QEMU_TASKSET_CPUS=0,1,2,3,4,5,6,7 \
make run-smp-process-scale
Run the thread-scale workload:
CAPOS_THREAD_SCALE_RUNS=5 \
CAPOS_THREAD_SCALE_QEMU_TASKSET_CPUS=0,1,2,3 \
make run-thread-scale
Run the larger-workload Amdahl row:
CAPOS_THREAD_SCALE_RUNS=5 \
CAPOS_THREAD_SCALE_TOTAL_BLOCKS=1048576 \
CAPOS_THREAD_SCALE_QEMU_TASKSET_CPUS=0,1,2,3 \
make run-thread-scale
Run a one-sample host-side QEMU profiling pass:
CAPOS_THREAD_SCALE_PROFILE=1 \
CAPOS_THREAD_SCALE_RUNS=1 \
CAPOS_THREAD_SCALE_QEMU_TASKSET_CPUS=0,1,2,3 \
make run-thread-scale
Run a one-sample guest-side measurement pass:
CAPOS_THREAD_SCALE_GUEST_MEASURE=1 \
CAPOS_THREAD_SCALE_RUNS=1 \
CAPOS_THREAD_SCALE_QEMU_TASKSET_CPUS=0,1,2,3 \
make run-thread-scale
Run only the host summary parser against an existing results.csv without
booting QEMU:
CAPOS_THREAD_SCALE_SUMMARY_ONLY=1 \
CAPOS_THREAD_SCALE_SUMMARY_CSV=<results.csv> \
CAPOS_THREAD_SCALE_SUMMARY_KVM_EVIDENCE=1 \
CAPOS_THREAD_SCALE_QEMU_TASKSET_CPUS=0,1,2,3 \
CAPOS_THREAD_SCALE_TOTAL_BLOCKS=262144 \
CAPOS_THREAD_SCALE_PARENT_WAIT=join \
CAPOS_THREAD_SCALE_WORK_ROUNDS=64 \
tools/qemu-thread-scale-harness.sh
Run the native Linux pthread baseline for the thread-scale checksum workload:
LINUX_THREAD_SCALE_TASKSET_CPUS=0,1,2,3 \
make run-linux-thread-scale-baseline
Run the Linux process-scale comparison:
LINUX_SMP_SCALE_KERNEL=target/linux-smp-process-scale/kernel/vmlinuz \
tools/linux-smp-process-scale-baseline.sh
On hosts where /boot/vmlinuz is not readable by the current user, copy a
kernel image into ignored target/ storage first through the host’s normal
administrative path, then pass it as LINUX_SMP_SCALE_KERNEL. The script does
not invoke sudo itself.
Configuration
The default capOS boot manifest (system.cue at the repo root) is layered
on a shared scaffold in cue/defaults/defaults.cue.
Operators can extend it without forking either file by dropping a
system.local.cue overlay next to system.cue. The overlay is
gitignored, so each developer/host can carry their own extensions
without conflicting with git pull.
This document is the current operator-facing design for the configuration surface. The historical proposal and closeout rationale live in System Configuration and Operator Extensibility.
How the layering works
mkmanifest --package capos system.cue manifest.bin invokes
cue export .:capos --out json against the repo root. CUE’s package
mode unifies every non-hidden .cue file in that directory that
declares package capos — currently system.cue (committed) and any
system.local.cue (gitignored) the operator drops in. The shared
scaffold is imported by system.cue:
import defaults "capos.local/cue/defaults"
#Manifest (the value system.cue exports) inherits all defaults from
defaults.#DefaultSystem, then applies any operator overrides declared
in system.local.cue. The kernel decoder reads concrete fields at the
document root (schemaVersion, binaries, initConfig,
kernelParams); #Manifest is documentation-only.
The decoder rejects any other top-level field name with a typed
error. For an unknown field named kernelParameters the rendered
message is:
unknown field `kernelParameters` at $; expected one of `schemaVersion`, `binaries`, `initConfig`, or `kernelParams`
CUE definitions (#Foo) and hidden fields (_foo) are stripped by
cue export and never reach the decoder, so this only fires when
the manifest projects an unintended visible name onto the document
root — a typo such as kernelParameters: … instead of
kernelParams: …, or a stale overlay field that was renamed in the
defaults package. Fix it by renaming the projected field to one of
the four accepted names, by moving the value under
kernelParams.…, or by hiding the auxiliary value with a _/#
prefix.
Quick start
Copy the committed example and edit:
cp system.local.cue.example system.local.cue
$EDITOR system.local.cue
make run
The Makefile picks up the new file automatically — no flag, no include
line. make re-evaluates the manifest because system.local.cue is a
prerequisite of the manifest rule.
Common Overlay Tasks
The examples below are complete system.local.cue fragments for common local
configuration changes. Each fragment is intended to be copied as a starting
point and adjusted before running make run.
Override the MOTD
package capos
#Manifest: kernelParams: motd: """
hello, capOS dev box.
type 'login' to authenticate.
"""
The defaults package declares motd: string | *"...", so a concrete
overlay value wins under CUE unification (a more concrete value is
strictly more specific than a default).
The system hostname is set the same way via kernelParams.hostname
(defaults to capos); it is served by SystemInfo.hostname and shown by
the shell hostname command. Bootstrap validation rejects whitespace,
control characters, and values longer than 255 bytes.
#Manifest: kernelParams: hostname: "web-01"
Add an authorized SSH key for the host operator
The default manifest declares a single host-operator seed account with
the canonical 32-byte principal id local-operator-principal-default.
Bind any number of authorized keys to that principal:
package capos
#Manifest: extraAuthorizedSshKeys: [{
keyId: "host-laptop-ed25519-2026-04"
principalId: "local-operator-principal-default"
algorithm: "ssh-ed25519"
publicKey: "<32-byte ed25519 public key as ASCII hex>"
fingerprintSha256: "<32-byte SHA-256 of the public key as ASCII hex>"
allowedShellProfiles: ["operator"]
source: "manifest"
comment: "host laptop"
}]
Convert an existing ~/.ssh/id_ed25519.pub line to the manifest hex
fields (Ed25519 example):
# extract the base64-encoded SSH wire format and decode the embedded key
ssh-keygen -e -m PKCS8 -f ~/.ssh/id_ed25519.pub | \
openssl pkey -pubin -outform DER 2>/dev/null | \
tail -c 32 | xxd -p -c 64
# fingerprintSha256 — SHA-256 over the same 32-byte raw public key:
ssh-keygen -e -m PKCS8 -f ~/.ssh/id_ed25519.pub | \
openssl pkey -pubin -outform DER 2>/dev/null | \
tail -c 32 | sha256sum | awk '{print $1}'
Use the printed hex as the publicKey and fingerprintSha256 strings.
The proposal explicitly avoids auto-ingesting ~/.ssh/*.pub from the
Makefile. Manual conversion gives the operator control over which keys
are trusted by the boot manifest.
Add a non-operator principal
The single-account-multi-auth invariant fixes the host operator at
kind: "operator"; slice 2 rejects manifests with multiple operator
seeds. Additional principals must use kind: "guest" or
kind: "service":
package capos
#Manifest: extraSeedAccounts: [{
name: "kiosk-guest"
displayName: "Kiosk Guest"
principalId: "kiosk-guest-principal-32-bytes-x" // exactly 32 bytes
kind: "guest"
credentialRefs: []
resourceProfile: "operator-default"
}]
Each seed account’s principalId must be unique and exactly 32 bytes;
each must reference an existing resourceProfile (either
operator-default from the defaults package or one declared in
extraResourceProfiles).
Add a custom resource profile
package capos
#Manifest: extraResourceProfiles: [{
name: "kiosk-guest-profile"
homeQuotaBytes: 0
tempQuotaBytes: 1048576
processLimit: 2
threadLimit: 4
capLimit: 24
memoryCommitLimitBytes: 16777216
frameGrantLimitPages: 64
endpointQueueLimit: 8
inFlightCallLimit: 4
ringScratchLimitBytes: 16384
logQuotaBytesPerWindow: 32768
networkProfile: "none"
cpuBudgetUsPerWindow: 10000
cpuWindowUs: 100000
timerWaiterLimit: 2
launcherProfile: "bootstrap-guest"
}]
Reference the profile name from extraSeedAccounts[].resourceProfile.
Add a binary and an init-launched service
The defaults package exposes extraBinaries and extraServices hooks.
The first embeds an additional binary into manifest.bin; the second
appends an entry onto initConfig.services so init launches it after
the base service graph. Build the binary as part of the operator
workflow — the default Make targets only build the binaries already
listed in the defaults package.
package capos
#Manifest: extraBinaries: [{
name: "site-monitor"
path: "demos/target/x86_64-unknown-capos/release/capos-demo-site-monitor"
}]
#Manifest: extraServices: [{
name: "site-monitor"
binary: "site-monitor"
restart: "never"
caps: [{
name: "console"
source: kernel: "console"
}, {
name: "timer"
source: kernel: "timer"
}],
}]
extraServices is concatenated onto _baseServices (base-first, then
operator-extra), so the operator service starts after the defaults’
chat server, remote-session gateway, and shell are launched.
Override the console password verifier
The defaults package ships a development-only Argon2id PHC for the plaintext “capos”. Any non-research deployment should mint a fresh verifier and override it:
package capos
#Manifest: kernelParams: consolePasswordVerifierPhc:
"$argon2id$v=19$m=19456,t=2,p=1$<salt-base64>$<hash-base64>"
Generate a verifier with the standalone argon2 tool
(argon2 "<salt>" -id -t 2 -m 19 -p 1 -e) or from any Argon2id
implementation that emits a PHC string with m=19456,t=2,p=1. The
canonical 32-byte local-operator-principal-default operator
principal id is unchanged; only the verifier rotates.
Host-user injection (@tag(user))
make run exports CAPOS_CUE_USER=$(USER), and mkmanifest forwards
it as --inject user=.... When CAPOS_CUE_DISPLAY_NAME is unset,
mkmanifest derives displayName from the same account’s first
GECOS/comment field in /etc/passwd and forwards it as
--inject displayName=.... If the passwd comment is unavailable or
empty, displayName falls back to the account name.
Other Make targets leave the structured tag variables unset, so
untagged system.cue keeps the canonical operator account name.
Focused demo and smoke manifests pin their own demo fixtures. The
audit-correlatable principalId is fixed to the canonical 32-byte
value regardless of host user, so audit history is stable across
$USER changes.
mkmanifest also keeps the generic CAPOS_CUE_TAGS comma-separated
escape hatch for additional key=value tags. The Makefile sets the
structured variables target-scoped to make run only:
run: CAPOS_CUE_USER = $(USER)
Set additional tags via
make USER=alice CAPOS_CUE_DISPLAY_NAME='Alice Smith' CAPOS_CUE_TAGS=region=eu-west run
or by passing --tag key=value to mkmanifest directly. system.cue
consumes user and displayName today; user must be a valid
manifest seed account name. Future tags can carry hostname, locale, or
other build-environment-derived values without adding new mechanisms.
Tools-root cache
CAPOS_TOOLS_ROOT defaults to $HOME/.capos-tools. The pinned
toolchain (capnp, cue, mdbook, typst, limine) lives under that path so
multiple capOS clones share a single download. Override with
CAPOS_TOOLS_ROOT=/path/to/cache make ... for non-default placement.
The Makefile and mkmanifest’s expected_cue_path follow the same
default; mismatched CAPOS_CUE / CAPOS_CAPNP env values are still
rejected by mkmanifest and make generated-code-check.
Schema-aware data conversion
mkmanifest cue-to-capnp converts CUE-authored data messages into arbitrary
specified Cap’n Proto struct roots without routing them through the boot
manifest ABI:
make cue-ensure capnp-ensure
CAPOS_CUE="$(make -s cue-path)" \
CAPOS_CAPNP="$(make -s capnp-path)" \
cargo run --manifest-path tools/mkmanifest/Cargo.toml --target "$(rustc -vV | awk '/^host:/ {print $2}')" -- \
cue-to-capnp --import-path schema input.cue schema/example.capnp Example output.bin
The subcommand accepts the same CUE --package, --tag, and
CAPOS_CUE_TAGS inputs as the manifest builder. It also accepts repeated
--import-path <dir> or -I<dir> arguments plus --no-standard-import, which
are passed to capnp convert as process arguments, not through a shell. The
input CUE is first exported to JSON, then the pinned Cap’n Proto tool validates
that JSON against the named schema and root struct.
This is the right path for configuration blobs, demo fixtures, or future
schema-defined records that are not SystemManifest. It still cannot encode
live capOS capability table entries or meaningful Cap’n Proto interface
objects; authority transfer remains an IPC/runtime concern.
Limits and non-goals
- A second
kind: "operator"seed account is rejected by the kernel in slice 2; multi-operator support is tracked in User Identity and Policy. - The slice-2 overlay is not a replacement for cloud-instance configuration; cloud-metadata-driven manifest deltas are designed in Cloud Metadata.
- The overlay does not auto-ingest
~/.ssh/*.pub; conversion is manual by design (security review on which keys count). - Focused-proof manifest migration onto the defaults package (slice 3,
Task 2) is complete: every repo-root
system-*.cuemanifest declares its own CUE package and imports the defaults package, exceptsystem-paperclips.cueandsystem-adventure.cue(demo-owned, package-less but still importing defaults) andsystem-measure.cue(held by the measure-mode-repair plan). The Slice-3 inventory table in System Configuration and Operator Extensibility records the per-manifest status, package, andmake run-*target.
Repository Map
This map names the main source locations for the current system. It is not an ownership file; use it to find the code behind architecture and validation claims.
Root Files
README.mdgives the compact project overview.docs/roadmap.mdrecords long-range stages and broad feature direction.docs/tasks/state.tomlrecords the current selected milestone.docs/tasks/README.mddefines the task-ledger schema and dispatch semantics.docs/tasks/*.md,docs/tasks/on-hold/,docs/tasks/active/,docs/tasks/review/, anddocs/tasks/done/carry task lifecycle records.docs/tasks/**carries open review-finding remediation records;REVIEW_FINDINGS.mdis a tombstone for pre-migration links.REVIEW.mddefines review expectations.Makefilebuilds pinned tools, userspace binaries, manifests, ISO images, QEMU targets, formatting checks, generated-code checks, and policy checks.rust-toolchain.tomldeclares the Rust nightly channel, required targets, andrust-src; it does not pin an exact nightly by date or commit..cargo/config.tomlsets the default bare-metal target and useful cargo aliases.
Schema and Shared ABIs
docs/abi-evolution-policy.mddefines compatibility classes, schema ordinal rules, ring-layout rules, version negotiation, and deprecation windows for externally visible ABI changes.schema/capos.capnpdefines capability interfaces, manifest structures, exceptions, ProcessSpawner, ProcessHandle, and transfer-related schema.capos-abi/src/lib.rsdefines small no_std ABI/policy constants shared by crates that should not depend on schema/config internals, including process quotas and credential policy limits.capos-config/src/manifest.rsdefines the host and no_std manifest model.capos-config/src/ring.rsdefinesCapRingHeader, SQE/CQE structures, opcodes, flags, and transport error constants shared by kernel and userspace.capos-config/src/capset.rsdefines the read-only bootstrap CapSet ABI.capos-config/src/cue.rssupports evaluated CUE-style manifest data.capos-config/src/credential_policy.rsre-exports credential policy limits; full PHC parsing is enabled by thecredential-validationfeature for bootstrap validators that need credential checks.capos-config/tests/ring_loom.rsmodels bounded ring protocol behavior with Loom.
Validation: cargo test-config, cargo test-ring-loom,
make generated-code-check.
Shared Pure Logic
capos-lib/src/elf.rsparses ELF64 images for kernel loading and host tests.capos-lib/src/cap_table.rsimplementsCapId, capability-table storage, stale-generation checks, grant preparation, transfer transaction helpers, commit, rollback, and the CapTable quota constants sourced fromcapos-abi.capos-lib/src/frame_bitmap.rsimplements the host-testable physical frame bitmap core.capos-lib/src/frame_ledger.rscontains a bounded frame-grant helper kept for host-test coverage; current MemoryObject accounting chargesCapTable::ResourceLedger.capos-lib/src/lazy_buffer.rsprovides bounded lazy buffers used by ring scratch paths.capos-lib/src/iso9660.rsis the pure ISO 9660 primary-volume-descriptor and directory-record parser the kernel boot-ISO driver (kernel/src/iso/) delegates to; fuzz targetiso9660_volume.capos-lib/src/storage_format.rsholds the pureCAPOSRO1(rofs),CAPOSST1(disk_store), andCAPOSWF1(writable_fs) mount parsers the kernel storage cap backers delegate to, including the shared record-layout constants the kernel writers reuse; fuzz targetsstorage_rofs_mount,storage_disk_store_mount,storage_writable_fs_mount.
Validation: cargo test-lib, cargo miri-lib, make kani-lib, fuzz targets
under fuzz/fuzz_targets/.
Kernel
kernel/src/main.rsis the boot entry point, hardware setup sequence, manifest parsing path, and boot-launched service creation path.run_initresolves PID 1 from the kernel-embeddedboot::INIT_ELFwheninitConfig.init.binary == capos_config::RESERVED_INIT_BINARY_NAME("init") and otherwise fromSystemManifest.binaries; for the embedded case it also injects the embedded image into theProcessSpawnerbinary set under the reserved name so child spawns ofinitresolve.kernel/src/boot.rsexposesboot::INIT_ELF: &[u8], the PID 1 init image packaged at build time.kernel/build.rsreads the prebuiltinit/artifact (CAPOS_INIT_ELF, with a conventional-path fallback) and generates theinclude_bytes!static;init/stays a standalone crate (byte packaging, not linker merging).kernel/src/spawn.rsloads user ELF images, creates process state, maps bootstrap pages, and enqueues spawned processes.kernel/src/process.rsdefinesProcess,Thread,ThreadState, per-thread kernel stacks, park waiter storage, and userspace CPU context.kernel/src/sched.rsimplements the single-CPU scheduler, timer-driven preemption, blockingcap_enter, direct IPC handoff, ParkSpace wait/wake, and deferred cancellation wakeups.kernel/src/serial.rsimplements COM1/COM2 UART setup, manifest-driven console-vs-terminal routing, and kernel print macros.kernel/src/pci.rsimplements early PCI config-space access through legacy I/O ports and ACPI MCFG/PCIe ECAM, with QEMU diagnostics for the current virtio-net and Q35 discovery paths, plus reusable memory-BAR subregion validation, kernel MMIO mapping helpers for in-kernel drivers, and MSI/MSI-X capability metadata discovery plus typed MSI-X table programming.kernel/src/device_interrupt.rsrecords the current kernel-owned virtio-net MSI-X config/RX/TX sources, their generation ids, route state, in-kernel driver owner, lock-free bounded device MSI vector-pool dispatch slots, and claimed-route reassignment/release without exposing userspace interrupt authority.kernel/src/device_dma.rsholds the kernel-owned, fixed-size DMA pool accounting ledgers. The net-keyedVIRTIO_NET_DMA_POOLbacks virtio-net’sDmaPagepath; a focused single-queueVIRTIO_BLK_DMA_POOL(reusing the sharedActivePage/QueueAccounttypes, same generation-checked handle and scrub-before-free invariants) backs the virtio-blk request buffer. Each device’sVirtqueueDmaseam impl delegates to its own pool’s keyed API.kernel/src/dma_backend.rs(always compiled) records the boot-time IOMMU probe verdict and resolves the fail-closed DMA backend selection (direct IOMMU remapping only with a verified probe, else kernel-owned bounce buffers) per the “Cloud DMA Backend” contract indocs/dma-isolation-design.md, emitting the boot proof line.kernel/src/device_manager/holds bounded in-kernel PCI device ownership records. The full DDF surface (device records, DMA pools/buffers, MSI-X interrupts, NVMe brokered controller registers, IOMMU domain ledgers, virtio ring publication, proofs) compiles only undercfg(feature = "qemu")inqemu_full.rs; the MMIO-only surface used bycap::device_mmioexists in both builds, dispatching tostub.rs(one-slot parked-regionDeviceMmiorecord) in the production non-qemubuild.kernel/src/nvme_storage_backend.rs(cfg(not(feature = "qemu"))) is the fail-closed activation gate for the always-built NVMeBlockDeviceread arm: modeled ondma_backend, it resolves a production handle only when a brokered controller was discovered and a livedevice_mmiogrant is staged, otherwise theblock_devicegrant fails closed with a typed error.kernel/src/virtio_transport.rs(always compiled) is the device-agnostic virtio modern-PCI transport host surface: capability/region discovery constants and bounded volatile MMIO accessors usable outside theqemu-gated legacy virtio path.kernel/src/virtio.rs(cfg(qemu)) holds the legacy in-kernel virtio transport, now a qemu-only fixture: the non-qemuproduction build compileskernel/src/virtio_stub.rsinstead, whose typed negative results keep stale or fixture-only kernel networking call sites failing closed. It includes the virtqueue drivers used by the IOMMU remapping proof. Itspub(crate) mod transportis the device-generic layer: split-ring/common-config constants, theMmioRegionaccessor, theVirtqueueDescriptorTracker, theVirtqueueDmaDMA/notify seam, the seam-drivenVirtqueue/DmaPagewith their poll/submit/complete loop and the multi-descriptorsubmit_request_chain, and the device-id-parameterizeddiscover_modern_transport. virtio-net is one seam caller (VirtioNetDma); virtio-blk is a second (VirtioBlkDma+VirtioBlkDriver,diagnose_virtio_blk_transport, theblock_device_*request API behind theBlockDevicecap). Net-specific provider/proof methods stay in the parent module asimpl Virtqueue<VirtioNetDma>.kernel/src/iommu.rs(cfg(qemu)) programs the Intel VT-d legacy-mode remapping tables, drives the hardware-DMA translation/fault proof, and runs the register-based invalidation revocation cycle.kernel/src/iso/(cfg(boot_iso_read)/cfg(boot_iso)/cfg(qemu)) is the boot-time ISO reader for the Boot Binary ISO Layout track.AtapiDevice(gate 1) locates the legacy IDE ATAPI device and exposes a boundedread_sectors(lba, count, buf)over polled-PIOREAD(12)packet commands with range/length validation.IsoFs(gate 2) is a read-only ISO 9660 driver layered on it: it parses the primary volume descriptor, walks directory records, and servesopen_file(name) -> (lba, size)under/boot/bins/, validating every directory record and derived extent against the volume size before use (fail-closedBadVolume/NotFound/NotDirectory).boot_read_proof()reads the PVD (CD001) andboot_fs_proof()walks to/boot/bins/PAYLOAD.BINand verifies its content, both behindboot_iso_readas themake run-boot-iso-readproof. Theboot_sourcesubmodule (gate 4,cfg(boot_iso)) builds a validated(name, lba, size)registry from every declared manifest binary name (mapping each name to the ISO 9660 d-character form, e.g.capos-shell->/boot/bins/CAPOS_SHELL) and reads ELF bytes on demand behind a device mutex;run_initandProcessSpawnerCapconsume it so theboot_isokernel loads binaries from the ISO instead of embeddedNamedBlob.data. Proofs:make run-boot-isoand the defaultmake run-smoke. Undercfg(qemu)the always-onAtapiDevice/IsoFssurface (plus a qemu-gatedblock_size()/list_boot_bins()enumeration helper) also backs the read-only install-source fixture cap (kernel/src/cap/installable_image.rs).
Validation: cargo build --features qemu, make run-smoke, make run-spawn,
make run-net, make run-iommu-remapping.
Kernel Architecture
kernel/src/arch/x86_64/gdt.rssets up kernel/user segments and TSS state.kernel/src/arch/x86_64/idt.rshandles exceptions and timer interrupts; CPL3 #PF/#GP/#UD/#DB/#BP faults terminate the whole owning process throughsched::exit_current_thread_terminating_process(deferred whole-process termination when sibling threads are live; proofmake run-user-fault), while CPL0 faults still halt the machine.kernel/src/arch/x86_64/syscall.rsimplements syscall MSR setup and entry.kernel/src/arch/x86_64/context.rsdefines timer context-switch state.kernel/src/arch/x86_64/pic.rsandpit.rsconfigure legacy interrupt hardware.kernel/src/arch/x86_64/ioapic.rsmaps MADT I/O APICs and programs masked legacy IRQ routes from interrupt-source overrides.kernel/src/arch/x86_64/lapic.rsprograms the xAPIC LAPIC timer and IPIs.kernel/src/arch/x86_64/smap.rsenables SMEP/SMAP and brackets user memory access.kernel/src/arch/x86_64/tls.rshandles FS-base/TLS support.kernel/src/arch/x86_64/pci_config.rsprovides legacy PCI config I/O used by the higher-level PCI module alongside its ECAM backend.kernel/src/arch/x86_64/percpu.rs,smp.rs, andtlb.rsprovide per-CPU data, AP startup, and TLB shootdown for the SMP scheduler.
Kernel Memory
kernel/src/mem/frame.rswraps the shared frame bitmap with Limine memory map initialization and global kernel access.kernel/src/mem/paging.rsmanages page tables, address spaces, permissions, user mappings, W^X enforcement, and address-space teardown.kernel/src/mem/heap.rsinitializes the kernel heap.kernel/src/mem/validate.rsvalidates user buffers before kernel access.
Related docs: DMA Isolation, Trusted Build Inputs.
Kernel Capabilities
kernel/src/cap/mod.rsinitializes kernel capabilities and builds the first service’s kernel-sourced bootstrap capability table.kernel/src/cap/table.rsre-exports shared capability-table logic and owns the kernel-global table.kernel/src/cap/ring.rsvalidates and dispatches ring SQEs.kernel/src/cap/transfer.rsvalidates transfer descriptors and prepares transfer transactions.kernel/src/cap/endpoint.rsimplements Endpoint CALL, RECV, RETURN, queued state, cleanup, and cancellation behavior.kernel/src/cap/console.rsimplements serial Console.kernel/src/cap/terminal_session.rsimplements the session-scoped TerminalSession line-oriented terminal with boundedreadLine, echo modes, and cancellation.kernel/src/cap/boot_package.rsimplements the read-only BootPackage manifest-size/chunked-read capability.kernel/src/cap/manual.rsimplements the read-only Manual capability: it parses the boot-packagedManualCorpusblob (embedded as themanual-corpusnamed binary) and answerspage/apropos/topics/section/describe/buildInfo.kernel/src/cap/log.rsimplements the Phase 1 monitoring log surface:LogSink(write) andLogReader(read) over a shared bounded, drop-oldest kernel recent-record ring. The sink drops records below the boot-seededSystemConfig.logLevelthreshold and forwards accepted records to serial; the reader returns records at/after a cursor withLogFilter(minLevel/componentPrefix),nextCursor, anddropped(docs/proposals/system-monitoring-proposal.md).kernel/src/cap/block_device.rsimplements theBlockDeviceCapObject(readBlocks/writeBlocks/info/flush). In the non-qemuproduction build theblock_devicesource resolves to the userspace-brokered NVMe arm (BlockDeviceBackend::NvmeBrokered, gated bykernel/src/nvme_storage_backend.rs); theqemubuild routes bounded inline-Datasector I/O to the kernel-owned virtio-blk driver inkernel/src/virtio.rsas a named fixture, not production storage (proofmake run-virtio-blk). The cap is scoped to onedevice_index: theblock_devicesource reaches the resolved non-target boot/storage disk, andblock_device_target(KernelCapSource.blockDeviceTarget @44) reaches the manifest-selected PCI identity when it names a bound non-boot virtio-blk disk. A cap for one disk grants no authority over another. The kernel binds up todevice_dma::MAX_VIRTIO_BLK_DEVICES(currently 2) virtio-blk devices, each with an independent driver/DMA-pool/interrupt-route instance (VirtioBlkDriver<const DEV>/VirtioBlkDma<const DEV>overVIRTIO_BLK_DMA_POOLS[DEV]);kernel/src/pci.rsenumerates each device with a device index (proofmake run-multi-virtio-blk). Target grants fail closed when the selector is absent, mismatched, or names the resolved boot disk. Counts are bounded to one bounce-buffer page.kernel/src/cap/readonly_fs.rsimplements the read-only filesystem service:ReadOnlyFsDirectoryCap/ReadOnlyFsFileCapparse a fixedCAPOSRO1on-disk layout read through the kernel-owned virtio-blk driver and serveDirectory.list/open+File.read/stat; every mutating method fails closed. Granted via theread_only_fs_rootKernelCapSource(returns a rootDirectorycap; qemu-gated, mounts at grant resolution and fails closed on a malformed/absent image). Host image buildertools/mkstore-image --readonly-fs; proofmake run-storage-fs.kernel/src/cap/persistent_store.rsimplements the disk-backed persistentStore:DiskStoreCapserves theStoreinterface (put/get/has/delete) over a fixedCAPOSST1on-disk layout read and written through a read+writeBlockSourceseam.putbump-allocates a data extent, writes the blob and entry record, then rewrites the superblock last as the durability commit point;deletetombstones the entry slot, and a later space-exhaustingputcompacts live entries through a shadow generation before recommitting the canonical front generation; the mount validates the superblock and every entry extent in-bounds and fails closed on a malformed image. TheVirtioBlockSource(qemu kernel) routes to the kernel-owned virtio-blk driver byte-identically (folding in thedata_region_base_lba()offset) and mounts eagerly at grant resolution; theNvmeBlockSource(built undercloud_persistent_store_over_nvme_proof) reads/writes through a granted NVMeBlockDevicewindow op and defers its mount-parse to the firstStorecall. Granted via thepersistent_storeKernelCapSource(virtio arm qemu-gated; the third NVMe-proof arm resolves the livedevice_mmiohandle). Host image buildertools/mkstore-image; reboot proofmake run-storage-persist(two QEMU passes on one disk image); NVMe put-then-get proofmake run-cloud-provider-persistent-store-over-nvmeviakernel/src/cap/persistent_store_over_nvme_proof.rs.kernel/src/cap/writable_fs.rsimplements the disk-backed writable filesystem service:WritableDirectoryCapserveslist/open/mkdir/remove/rename/createandWritableFileCapservesread/write/stat/truncate/sync/closeover a fixedCAPOSWF1on-disk layout (a flat node-record array with parent pointers + a bump-allocated data region) written through aBlockSourceseam. The RAM tree is the working copy; each mutation write-through-commits in the order data sector → node-record sector → superblock. A filesystem-wide fail-closed single-writer policy admits one writer at a time. TheVirtioBlockSource(qemu/installable kernels) routes to the kernel-owned virtio-blk driver byte-identically (folding thedata_region_base_lba()offset) and mounts the singleton eagerly; theNvmeBlockSource(built undercloud_writable_fs_over_nvme_proof) reads/writes through a granted NVMeBlockDevicewindow op and defers the singleton mount-parse to the firstDirectory/Filecall. Granted via thewritable_fs_rootKernelCapSource(virtio arm qemu-gated; the third NVMe-proof arm resolves the livedevice_mmiohandle), which mounts the process-wide singleton volume once and hands each grant a distinct writer id; fails closed on a malformed image. The NVMe write-then-read durability proof (make run-cloud-provider-writable-fs-over-nvmeviakernel/src/cap/writable_fs_over_nvme_proof.rs, which supersedes and drops the persistent-store-over-NVMe proof) exercises bothBlockDevicearms with the single-writer policy intact. The combined image buildertools/mkstore-image --writableco-locates theCAPOSST1Storesub-volume (LBA 0) and theCAPOSWF1filesystem sub-volume on one disk; reboot proofmake run-storage-writable(two QEMU passes: mutate then verify both the filesystem and the store survive). A slot becomes live on the next mount only once the superblock’s bumpednode_countis observed, so a poweroff in the record-written / superblock-pending window leaves an orphan slot the mount ignores. The proof-onlystorage_writable_recoveryfeature arms an induced forced poweroff in exactly that window (recovery_crash_after_record); bounded recovery proofmake run-storage-writable-recovery(pass 1 commits then iskill -9d mid-allocation, pass 2 verifies recovery to a consistent tree with the interrupted allocation atomically absent). The same crash window is proven over the NVMeBlockDevicearm bymake run-cloud-provider-writable-fs-over-nvme-recoveryviakernel/src/cap/writable_fs_over_nvme_recovery_proof.rs(a recovery cap-waiter clone that implies and supersedes the happy-path proof module/route/init); thecloud_writable_fs_over_nvme_recovery_prooffeature widens thestorage_writable_recoverycrash-window cfg gate, and the host-built NVMe image (tools/mkstore-image --writable-nvme, empty superblock + root-only node table) is booted twice with-device nvme(no@20seed).writable_fs::mount_config_root(qemu-gated) scopes a writableDirectoryto thesystem/configsubtree for the boot-time data-region grant below.kernel/src/cap/installable_image.rsimplements the read-only install-source fixture (Installable System track item 5b):InstallableImageDirectoryCapserveslist/openandInstallableImageFileCapservesread/stat/closeover the booted CD-ROM ISO 9660/boot/bins/tree, reading through thekernel/src/iso/boot_isoATAPI/ISO 9660 driver behind a single shared-device mutex (so PIO does not interleave across CPUs). Every mutating method fails closed; a past-EOF read clamps to empty and an absent name is rejected, reusing the driver’svalidate_extent/read_sectorsrange checks. Granted via the qemu-gatedinstallable_image_sourceKernelCapSource(mounts the ATAPI volume and validates/boot/bins/at grant resolution, failing the spawn closed on an absent/malformed medium). Physically scoped to the ATAPI CD-ROM, so it cannot reach the writable virtio-blk target disk (block_device_target/writable_fs_root). Consumer demodemos/installable-image-source/; manifestsystem-installable-image-source.cue; proofmake run-installable-image-source.demos/installable-system-install/implementscapos-system-install, the Installable System install flow (track item 6): under the read-onlyinstallable_image_sourceDirectoryand the target-scopedblock_device_targetBlockDeviceselected by manifest PCI identity, it copies the packaged bootable boot-region head (BOOTHEAD.BIN) to LBA 0, writes the backup GPT (BOOTGPT.BIN) at the LBA read from the primary GPT header, and initializes an empty data region (DATAIMG.BIN,tools/mkstore-image --writable --empty-config) at the fixedcap::data_region_base_lba, validating ranges and verifying the read-back. It reads packaged files in 32 KiB windows (under the read-path reply scratch bound; seedocs/tasks/done/2026-06-05/storage-file-read-reply-scratch-clamp.md) and zero-skips the FAT free space.tools/split-boot-region.pysplits the mkdiskimage boot image into the head + backup GPT so only the populated prefix is packaged. Pass-1 installer manifestsystem-installable-install.cue; pass-2 installed manifest (baked into the boot region)system-installable-install-target.cue; harnesstools/qemu-installable-install-smoke.sh; proofmake run-installable-install(pass 1 installs into a second virtio-blk disk, pass 2 boots it standalone).kernel/src/cap/mod.rsgrant_data_region(proof-onlyinstallable_data_regionfeature) is the Installable System boot-time data-region mount:run_initbest-effort grants init asystem/configDirectory(data-config) plus the persistentStore(data-store) over the auto-attached data disk, failing closed wholesale to the base manifest (caps unchanged, “no data region; base floor” diagnostic) when the disk is absent, malformed, or missingsystem/config. No new cap type or schema change. Proofmake run-installable-data-region(seeded disk prints resolved contents; no disk and zeroed-superblock disk hit the base floor).- Installable System config-overlay compose/merge (track item 3): the
SystemConfigOverlaycapnp object +SystemManifest.extensionPoints(ManifestExtensionPoints) live inschema/capos.capnp; the typed decode, content-hash check, andcompose_ontoprecedence (base-pins-win / overlay-adds-within-declared-extension-points / no-new-authority) live incapos-config/src/manifest.rs.init/src/main.rsapply_config_overlayreadssystem/config/overlay.binfrom the granteddata-configDirectory, composes the overlay over the base plan, and falls closed to the base floor with[init] overlay rejected: <reason>. Thetools/mkmanifestmkoverlaybin encodes overlays (filling the canonical hash) andtools/mkstore-image --writable --seed-overlayseeds them. Proofmake run-installable-overlay. - Installable System generations + rollback + failed-boot auto-fallback (track
item 4): userspace-only over the already-granted Store + writable
system/configDirectory, no schema or kernel change.init/src/main.rsrun_generation_rollback_checks(gated by a base service namedgeneration-proof) represents system-config generations as content-addressedStoreobjects keyed by SHA-256, tracks the known-goodactivepointer and a staged/attemptingcandidatepointer as monotonic-epoch marker files (gen-active/gen-candidate) in the writable config region, records a boot attempt durably before applying a candidate, auto-falls-back to the known-good generation when a candidate is left unconfirmed (the brick-proofing guarantee), promotes a confirmed candidate, rolls config back to a retained prior generation, and rejects a stale/replayed (lower-or-equal-epoch) pointer. A present-but-undecodablegen-candidatemarker (the torn size-0 file a poweroff inside the CREATE|TRUNCATE rewrite window leaves, or garbage bytes) is discarded with a loud diagnostic and boot falls back to the known-good generation, while a corruptgen-activemarker takes a distinct loud FATAL refuse-to-boot path (the known-good generation is genuinely unknown). Manifestsystem-installable-generation.cue; proofmake run-installable-generationboots a--seed-configdisk three times (boot 1 exercises the mechanism and leaves an unconfirmed candidate; boot 2 proves across-reboot auto-fallback to the known-good generation, then leaves a torn size-0 candidate marker; boot 3 proves torn-marker recovery). - Installable System integrated bootable disk (track item 5, proof-only
installable_diskfeature, impliesinstallable_data_region): one disk carries the boot ESP (GPT partition 1) and the co-locatedCAPOSST1Store +CAPOSWF1writable data region (GPT partition 2).kernel/src/cap/mod.rsdata_region_base_lbareturns the fixed partition-2 base LBA (264192) under the feature (0 otherwise), applied at the singlepersistent_store/writable_fsread_range/write_rangechoke points so the kernel reads the region at that fixed tool/kernel-contract LBA without parsing the GPT.tools/mkdiskimage.sh--data-image/--data-offset-bytesfold thetools/mkstore-image --writableimage into partition 2 and derive the ESP size from--esp-sectors(integrated disk uses the same 128 MiB ESP as the raw disk-image targets so a debug kernel fits). Manifestsystem-installable-disk.cue; proofmake run-installable-diskboots one virtio-blk disk and asserts the data region mounts from the boot disk and a data-region-only overlay service runs. kernel/src/cap/frame_alloc.rsimplements FrameAllocator and MemoryObject.kernel/src/cap/virtual_memory.rsimplements per-process anonymous memory operations.kernel/src/cap/timer.rsimplements monotonicnowand boundedsleep.kernel/src/cap/wall_clock.rsimplements the read-onlyWallClock.wallTimecap: UTC over a fixed boot base layered on the monotonic timebase, reporting the fail-closeduntrustedClockProvenance(Phase 1 fixed-boot-base variant;docs/proposals/time-and-clock-proposal.md).kernel/src/cap/park_space.rsimplements the process-local ParkSpace marker capability used by compact park (CAP_OP_PARK/CAP_OP_UNPARK) opcodes.kernel/src/cap/network.rsimplements the qemu-only NetworkManager, TcpListener, TcpSocket, and UdpSocket fixture caps. The kernel no longer depends onsmoltcp; non-qemumanifests reject the kernelnetwork_manager/tcp_listen_authoritygrant sources (fail closed), and the production socket path is the Phase C userspace network-stack process. The socket-backedSocketTerminalSessionshim is retired:TcpSocket.intoTerminalSessionfails closed in every dispatch path.kernel/src/cap/process_spawner.rsimplements ProcessSpawner and ProcessHandle.kernel/src/cap/provider_cap_waiter_proof.rs(non-qemu,cloud_provider_cap_waiter_proofCargo feature) stages a fully-programmed-route bootstrapInterruptgrant source and theInterruptCapWaiterProofcap whoseInterrupt.waitinjects onedevice_interrupt::handle_lapic_deliverydispatch and whoseInterrupt.acknowledgeretires the deferred LAPIC EOI; the cap’son_releaseruns the masked-no-wake + reassign + stale-handle assertion chain before emittingcloudboot-evidence: provider-cap-waiter <token>. Mutually exclusive withcap::interrupt_grant_source_prod(default cloudboot path) and skipscap::provider_nic_bind_proof/cap::storage_bind_proofto keep the bound route live for the userspace cap-waiter handoff. Proof:make run-cloud-provider-cap-waiter.kernel/src/cap/virtio_net_device_bringup_proof.rs(non-qemu,cloud_virtio_net_device_bringup_proofCargo feature; mutually exclusive withcloud_provider_cap_waiter_proofand the userspace selected-write handshake proof) drives the bounded virtio status sequence kernel-side over the picked virtio-net PCI function (vendor0x1af4, device0x1000/0x1041): resolves the modern virtio PCI transport regions throughvirtio_transport::parse_modern_pci_transport_capabilities, maps the common configuration window throughpci::map_bar_region, and drives reset → ACKNOWLEDGE → DRIVER → feature discovery + driver-feature selection (VIRTIO_F_VERSION_1only) → FEATURES_OK → DRIVER_OK with a trailing reset on every exit path. Inline assertions gate the headlinecloudboot-evidence: virtio-net-device-bringup <token>on the negotiated feature set,COMMON_NUM_QUEUES >= 2, DRIVER_OK observation, and the final reset returningdevice_statusto 0. Marker carriesqueue_setup=not-attempted,tx_descriptor=not-published,userspace_cap=not-issued,msix_function_enable=not-toggled,device_autonomous_raise=not-attempted,live_cloud=not-attempted. Proof:make run-cloud-provider-virtio-net-bringup.cloud_virtio_net_userspace_features_ok_proof(non-qemu; proofmake run-cloud-prod-nic-driver-userspace-features-ok) is Phase C slice 1 of the userspace NIC relocation track. It makescap::devicemmio_grant_source_prodstage the picked virtio-net modern common-config window as a selected-writeDeviceMmiocap withregisterWrite=selected-write-common-config-handshake; the userspace smoke drives reset -> ACKNOWLEDGE -> DRIVER -> FEATURES_OK overDeviceMmio.write32and proves queue-address writes remain fail-closed. It is mutually exclusive with the kernel-owned virtio-net bringup, bundle, and queue-materialization proof chain over the same BDF/grant path, and with thecloud_nvme_readonly_bind_proofdescendant chain because both stage a proof-specific productionDeviceMmiogrant source.kernel/src/cap/virtio_net_tx_authority_bundle_proof.rs,kernel/src/cap/virtio_net_tx_queue_materialization_proof.rs, andkernel/src/cap/virtio_net_msix_function_enable_proof.rsare the decomposed userspace-TX track. Each is non-qemuand gated by its own focused-proof Cargo feature (cloud_virtio_net_tx_authority_bundle_proof,cloud_virtio_net_tx_queue_materialization_proof, andcloud_virtio_net_msix_function_enable_proofrespectively; the last implies the second so the bundle observer + production grant-source pickers + userspace bundle smoke stay compiled in across the chain). The bundle proof observes the three production grant sources (devicemmio_grant_source_prod,dmapool_grant_source_prod,interrupt_grant_source_prod) issuing one cap each into the spawned userspace bundle smoke and asserts same-BDF; the queue-materialization proof drives the kernel-side modern-virtio status sequence throughDRIVER_OKand materializes one manager-owned TX virtqueue from three zeroed brokered frames, asserting register read-backs and post-reset clearance; the MSI-X function-enable proof extends that sequence with one canonical mask-first PCI MSI-X function-level enable (setFUNCTION_MASK, thenENABLE, then clear both) plus best-effort cleanup on every exit path. Each child emits its own headline marker (cloudboot-evidence: virtio-net-tx-authority-bundle <token>,cloudboot-evidence: virtio-net-tx-queue-materialization <token>, andcloudboot-evidence: virtio-net-msix-function-enable <token>); when the later feature is active the earlier markers are intentionally suppressed because their discipline labels would be inaccurate. Proofs:make run-cloud-provider-virtio-net-tx-authority-bundle,make run-cloud-provider-virtio-net-tx-queue-materialization,make run-cloud-provider-virtio-net-msix-function-enable.kernel/src/cap/virtio_net_userspace_rx_dma_proof.rs(Phase C slice 4a-ii, gatedcloud_virtio_net_userspace_rx_bringup_proof) drives the first real RX DMA from the shim-owned vring:post_rx_descriptorwrites the RX descriptor- avail over the shim’s retained RX vring physes at
DMABuffer.submitDescriptortime, anddrive_rx_dma(reached from the now-liveprovider_notify_doorbell_write_for_cap) rings the RX doorbell, submits a kernel-half SLIRP TX ARP stimulus over the retained TX physes, polls one real device->host completion, and resets the device (clearing the retainedenabledflags to release the ring-buffer pins). Self-contained byte-level vring helpers are duplicated fromvirtio_net_polled_providerto protectrun-net. The notify region is mapped kernel-side + the per-queue notify slot offsets captured bycap::devicemmio_grant_source_prod(rx_dma_notify_state). Proofmake run-cloud-prod-nic-driver-userspace-rx-bringup(extended).
- avail over the shim’s retained RX vring physes at
kernel/src/cap/null.rsimplements the measurement-only NullCap.kernel/src/cap/park_bench.rsimplements the measurement-only ParkBench authority used bymake run-measure.
Related docs: Capability Model, Authority Accounting.
Userspace
init/is the standalone init process. In the spawn smoke, it uses ProcessSpawner, grants initial child capabilities, waits on ProcessHandles, and checks hostile spawn inputs.capos-rt/src/entry.rsowns the runtime entry path and bootstrap validation.capos-rt/src/alloc.rsinitializes the userspace heap.capos-rt/src/syscall.rsprovides raw syscall wrappers.capos-rt/src/capset.rsprovides typed CapSet lookup helpers.capos-rt/src/ring.rsimplements the safe single-owner ring client, out-of-order completion handling, transfer descriptor packing, and result-cap parsing.capos-rt/src/client.rsimplements typed clients for Console, TerminalSession, BootPackage, ProcessSpawner, ProcessHandle, and Timer. The client-side methods are generic overTransport; result-cap-adopting methods stay on the concreteRuntimeRingClient.capos-rt/src/transport.rsdefines theTransportseam (the client-sideCALL/completion/RELEASEring operations) and the in-systemRingTransport(RingClientviewed through the seam). A host remote transport is a later slice; seedocs/backlog/capos-sdk-dual-transport.md.capos/is the front-door SDK facade crate: for the defaultringfeature it re-exports thecapos-rtruntime, typed clients, theentry_point!macro, and aprelude. Theremotefeature is reserved. Standalone, likecapos-rt.capos-rt/src/pollselect.rsis the pure POSIXpoll/selectbridge:SocketReadiness->pollrevents(POLLIN/POLLOUT/POLLHUP/POLLERR/POLLNVAL) andselectset membership, plusunsupported_request_bitsfor fail-closed flag handling. Shared by thelibcapos-posixC surface and theposix-socket-poll-select-smokeproof. Proof:make run-posix-socket-poll-select.capos-rt/src/panic.rsprovides the emergency Console panic output path.capos-rt/src/bin/smoke.rsis the runtime smoke binary used by focused runtime proofs rather than the default boot manifest.capos-service/src/lib.rsis the standaloneno_stdservice lifecycle layer abovecapos-rt; slice 1 exposesServiceMain,ServiceRuntime, and ordered initialize/dependency-wait/ready/run/drain/shutdown/cleanup phases.shell/src/main.rsis the native capability shell, built as the standalonecapos-shellcrate and packaged bysystem.cue,system-shell.cue, and the focused login manifests.
Validation: make capos-rt-check, make run-smoke, make run-spawn,
make run-shell, make run-terminal. The former Telnet fixture is retired
with the qemu-only kernel TCP listener.
Standalone C and WASI Substrates
These are standalone crates (not workspace members) built by the Makefile.
libcapos/buildslibcapos.a, ano_stdRust staticlib exposing the capos-rt syscall/ring/CapSet path and typedConsole/Timer/EntropySource/VirtualMemorywrappers plus C heap shims to C consumers. Public header atlibcapos/include/capos/capos.h. No POSIX surface.libcapos-posix/buildslibcapos_posix.a, ano_stdRust staticlib layering a POSIX adapter overlibcapos: per-process fd table, errno cell, historical UDP socket wrappers over the retired qemu-only kernelUdpSocketcap, clock overTimer,pipe/dupoverPipe,poll/select(poll.rs,<poll.h>/<sys/select.h>) over thecapos-rt::pollselectreadiness bridge with fail-closed unsupported-flag /EBADF/EINVALhandling, andfork/execve/waitpidvia the recording-shim ProcessSpawner Move-grant path, plus the libc surface the dash port needs: stdio/string/stdlib/ctype helpers,strerror/qsort/umask/abort/strtoll/strpbrk/lstat/getgroups/wait3/vfork, byte-order helpers (inet.rs),getrlimit/setrlimit(resource.rs),setlocale(locale.rs),times+tcgetattr, C-localewchar/wctypemultibyte (wchar.rs), theenvironpointer, and thesys_siglistarray. C headers (the namespaced source of truth) underlibcapos-posix/include/capos/posix/– including the dash-neededsys/types.h,termios.h,sys/resource.h,sys/times.h,wchar.h,wctype.h,locale.h,inttypes.h, and the decl-onlysys/ioctl.h/sys/mman.h/arpa/inet.h/getopt.h/paths.h/sys/param.h.libcapos-posix/sysroot/include/is the-nostdincbare-header sysroot (<stdio.h>,<unistd.h>,<sys/stat.h>, …) whose wrappers forward into that namespace; mirrored C ports (dash) build against it via the Makefile’sCAPOS_C_SYSROOT_INCLUDEflags on thecapos-c-multitu-elfrule. Focused sysroot proofmake run-c-libc-surface.capos-wasm/is theno_stdWASI host adapter: awasmi-backedRuntime, thewasm-hostuserspace binary, the Preview 1 import resolver, and the manifest-supplied wasm payload reader.vendor/wasmi-no_std/andvendor/dns-c-wahern/are static-pinned, no-patches upstream snapshots consumed bycapos-wasm/and the POSIX DNS smoke; do not patch them in place (refresh procedure in eachVENDORED_FROM.md).vendor/dash/is the mirror-as-is dash0.5.13.4snapshot (src/stays byte-identical; capOS deviations live underpatches/). Its capOS build pipeline lives outside the mirror undervendor/dash/capos/: the pinnedconfig.handgen-tables.sh(stages a patched source copy + runs the six host table generators). The Makefiledashtarget buildstarget/dash/dash.elfthroughcapos-c-multitu-elfagainstlibcapos.a+libcapos_posix.a.
Validation: make run-c-hello, make run-posix-pipe-smoke,
make run-posix-printf, make run-wasm-host, make run-wasi-hello-rust,
make run-wasi-random. The former POSIX DNS smoke is retired with the
qemu-only kernel UdpSocket owner.
Demo Services
demos/ is a nested userspace smoke-test workspace. Each demo is a release-built
service binary packaged into the boot manifest:
adventure-client,adventure-server,adventure-npc-shopkeeper,adventure-npc-wanderercapos-chat,chat-bot,chat-client,chat-servercapset-bootstrap,console-paths,credential-storeendpoint-queue-limit-smoke,endpoint-roundtrip,ipc-server,ipc-client,in-flight-call-limit-smokeframe-allocator-cleanup,memoryobject-shared-child,memoryobject-shared-parentpaperclips,paperclips-contentrevocable-read,revocation-observerring-corruption,ring-reserved-opcodes,ring-nop,ring-fairnessservice-common,shell-spawn-test,shell-typed-callterminal-session,terminal-strangertimer-smoke,timer-floodtls-smoke,unprivileged-stranger,virtual-memoryuser-fault-parent,user-fault-victim(user fault containment proof,make run-user-fault)
Shared demo support lives in demos/capos-demo-support/src/lib.rs and uses
capos-rt for entry, allocator, syscall, CapSet, and panic support while
keeping raw ring helpers for low-level transport smokes.
Validation: make run-spawn.
Manifest and Tooling
system.cueis the default init-owned boot manifest source. It imports the shared defaults package, boot-launches standaloneinit, and lets init start the shell, remote-session CapSet gateway, and resident services.system-spawn.cueis the ProcessSpawner smoke manifest source.system-smoke.cueis the scripted focused shell-led login/shell smoke manifest source.system-chat.cue,system-adventure.cue, andsystem-paperclips.cueare focused resident-service and terminal-demo manifest sources.system-memoryobject-shared.cue,system-revocable-read.cue, andsystem-measure.cueare focused regression/measurement manifest sources.system-shell.cueis the focused anonymous-shell manifest source (no verifier, shell stays anonymous).system-terminal.cueis the focused TerminalSession proof manifest source.system-credential.cueis the focused CredentialStore proof manifest source.system-login.cueis the focused password-login proof manifest source.system-login-setup.cueis the focused first-boot setup proof manifest source.tools/mkmanifest/evaluates manifest input, embeds binaries, validates manifest shape, writes boot-manifest Cap’n Proto bytes, and providescue-to-capnpfor schema-aware CUE-authored data-message conversion. Its siblingmkoverlaybin encodes aSystemConfigOverlayfrom CUE into thesystem/config/overlay.binbytes (filling the canonical content hash) for the installable-system config-overlay proof.tools/manualc/is the System Manual corpus compiler: it parsesschema/capos.capnpfor section-2 interface pages, reads the authored man corpus underdocs/manual/man<section>/*.man, and emits the boot-packagedManualCorpusblob. It fails the build if any in-tree capability interface lacks a section-2 page (i.e. a schema doc comment).docs/manual/holds the authored man-shaped corpus consumed bymanualc(section 1 shell-command pages and section 7 concept pages); section-2 capability pages are generated from the schema, not stored here.system-manual-smoke.cueis the focused Manual proof manifest source.tools/agent-session-recaps/contains private-session recap and raw-archive tooling for the agentic development experiment. The tools are tracked here; raw transcripts and generated recap stores stay outside the repo unless explicitly redacted and reviewed.tools/check-generated-capnp.shverifies checked-in generated schema output.scripts/record_worklog.pyemits per-task commit spans (from each task’scommits:list, falling back to task-file history) for the development timeline/Gantt;scripts/validate_backfill_tasks.pyvalidates backfilled task-file frontmatter against the chunk’s real SHAs;scripts/check-md-links.pyis the pre-commit broken-relative-link gate over all.md.tools/githooks/is the repocore.hooksPath(enabled withmake hooks):prepare-commit-msgstamps provenance trailers (Plan-Item/Run-Id/Agent-Kind) onto run-driven commits, alongside the git-lfs hooks.tools/qemu-net-harness.shruns the current QEMU net harness, withtools/qemu-net-smoke.shasserting virtio-net transport, MSI-X metadata selection, kernel-owned MSI-X vector-pool allocation/programming, masked route-lifecycle proof, queue vector assignment, descriptor guards, ARP, and ICMP fixture lines.fuzz/contains fuzz targets for manifest Cap’n Proto decoding (with the production reader-options envelope), mkmanifest JSON conversion/validation, ELF parsing, Telnet IAC filtering, terminal line discipline, ring SQE wire validation, ISO 9660 PVD/directory-record parsing, theCAPOSRO1/CAPOSST1/CAPOSWF1storage mount parsers, and thecapos-tlsX.509 validity walk.
Validation: cargo test-mkmanifest, make generated-code-check,
make fuzz-build, make fuzz-smoke.
Documentation
docs/capability-model.mdis the current capability architecture reference.docs/architecture/threading.mdanddocs/architecture/park.mdrecord the accepted contracts and first implementation for in-process thread ownership and private ParkSpace authority.docs/*-design.mdfiles record targeted implemented or accepted designs.docs/proposals/contains accepted, future, exploratory, and rejected designs.docs/research/summarizes prior art (thecapability-systems-survey.mdsynthesis plus per-system deep-dive reports).docs/proposals/mdbook-docs-site-proposal.mddefines the documentation site structure and status vocabulary used by the orientation pages.
First Chat Demo
The First Chat demo is the smallest runnable multi-process service demo in
capOS. It boots a resident chat-server, a bounded chat-bot actor, and a
native shell that can launch chat-client with explicit StdIO plus the
broker-issued operator Chat endpoint grant.
The chat service is not a shell builtin. The shell only launches a client
process and services that client’s StdIO endpoint while the client talks to
the resident Chat endpoint. The focused manifest routes the kernel singleton
chat_endpoint through init to chat-server, which is the same endpoint the
broker facets into operator shell bundles.
Run It
Use the focused QEMU proof:
make run-chat
The scripted proof creates a volatile shell credential, rejects an attempted
client endpoint relabel, launches chat-client under the authenticated shell
session, sends one lobby message, checks membership with /who, observes the
resident bot reply, quits the client, and exits the shell. The terminal
transcript should include:
[chat] /join <channel>, /leave, /who, /exit, or plain text
[chat:#lobby]> hello from shell
[chat] #lobby <member-2> hello from shell
[chat] #lobby <member-1> [chat-bot] echo-bot heard you.
For default manual use, boot the ordinary playground:
make run
After login:
run "chat-client" with { stdio: client @stdio, chat: client @chat }
The default playground starts the resident chat-server and includes
chat-client, but it does not start the bounded chat-bot proof actor. Use
make run-chat when you need the one-shot echo-bot transcript.
For lower-level manual proof work, let make run-chat build the focused ISO,
then boot capos-chat.iso yourself with the terminal UART attached to stdio and
the console UART written to a log.
Useful client commands:
/join #other
/who
/leave
/exit
plain chat text
The resident bot is a bounded proof actor. If the operator waits too long before joining and sending the first lobby message, the bot can time out and exit; the chat client and server remain usable, but the bot reply will no longer appear.
What It Demonstrates
make run-chat and the manual terminal path described above currently show:
chat-serverruns as a resident service exporting only theChatendpoint;chat-serverkeys membership by the opaque caller-session reference in the endpoint metadata, not by a caller-selected endpoint badge;chat-botis a separate participant with a delegated chat client endpoint and its own session-bound membership record;capos-shelllauncheschat-clientas an ordinary userspace process;- the foreground client receives only explicit
StdIOandChatgrants; - caller-selected endpoint relabeling is rejected for delegated chat clients;
- the
handlesupplied tojoinis request data only; the service assigns visiblemember-Nlabels and the handle does not select membership authority or sender identity; - lobby messages and bot replies are visible through the terminal transcript;
/wholists current channel members from the resident service;- client exit returns to the shell prompt, and the manifest child wait path observes clean shell and bot exits during normal completion.
Current Limits
This is not yet a distinct-local-user chat surface over Telnet or multiple terminals.
system.cue and system-chat.cue each boot one terminal-backed shell on the
QEMU terminal UART, and the shell’s run command waits on the foreground
client’s StdIO endpoint. Multiple chat-client runs can reuse the resident
service, but the current manual flow is one foreground client at a time. The
demo client still sends the hard-coded join handle shell for compatibility;
the server ignores it for visible sender labels and does not request disclosed
display/profile metadata from the session broker yet.
The default make run foreground shell now receives its shell bundle from
AuthorityBroker, including a profile-scoped chat endpoint for operator
shells. Guest and anonymous shells do not receive chat by default. An
operator shell can therefore run the same chat-client command after login.
This is still not a distinct durable user chat surface: the demo
client joins with the hard-coded handle shell, the server assigns its own
visible member label, and multiple terminal sessions still need a
multi-session terminal host or network gateway before they are a real
multi-user chat model.
To make distinct local users chat through Telnet or terminals, capOS still needs a multi-session terminal host or Telnet gateway that can keep multiple shell sessions alive, grant each session a broker-authorized chat root/facet, and disclose only the bounded display/profile metadata the user or broker explicitly permits.
Aurelian Frontier — Proof Slice
This page describes the current runnable proof slice of the Aurelian Frontier game. It is the end-to-end example of a capOS-native interactive application: a Roman-frontier text adventure with magic wards, warrior skills, wizard spells, NPC chat history, per-player state, and explicit capability grants. The wider game design lives in Aurelian Frontier; this page covers what runs today and how the QEMU smoke proves it.
Unlike a shell builtin, the game runs as ordinary userspace processes:
capos-shelllaunchesadventure-clientwith onlyStdIO,Adventure, andChatclient capabilities.adventure-serverowns room, inventory, writ, combat, evidence, and effect state keyed by the endpoint caller-session scoped reference and epoch, while consuming validated read-only prototype mission content generated fromadventure-contentCUE source.chat-servercarries room messages and labels replayed room history so NPC actors do not treat old messages as fresh input.adventure-npc-wandererandadventure-npc-shopkeeperprove that separate actors can join the shared ashen-road channel without receiving ambient game authority.adventure-scenario-testis a noninteractive capOS userspace test process with onlyConsoleandAdventurecaps. It drives the custody scenario throughAdventureClientRPCs and prints a console success marker.
Run It
Use the focused QEMU proof:
make run-adventure
The scripted run creates a volatile shell credential, launches the interactive
adventure client for representative rendering and command coverage, and also
asserts the resident adventure-scenario-test success marker and exit status
for the complex custody path.
run "adventure-client" starts from a fresh expedition view by default. Use the
client’s resume command to return to that session’s active expedition state
instead of silently continuing it on launch.
For the default init-owned boot, start make run, log in or run setup, then
use the MOTD compatibility commands:
spawn "chat-server" with { console: @console, chat: @chat } -> $chat
spawn "adventure-server" with { console: @console, adventure: @adventure, chat: client @chat } -> $adventure
spawn "adventure-npc-wanderer" with { console: @console, chat: client @chat } -> $wanderer
spawn "adventure-npc-shopkeeper" with { console: @console, chat: client @chat } -> $shopkeeper
run "adventure-client" with { stdio: client @stdio, adventure: client @adventure, chat: client @chat }
Normal launch commands omit legacy receiver selectors; delegated client
endpoint identity is preserved by default. The adventure server derives player
state from live session-bound endpoint caller metadata. The focused
make run-adventure proof is the authoritative regression path. Its manifest
uses selector-free Adventure and chat endpoint grants, while hostile and
lower-level smokes retain explicit legacy selector fixtures for rejection
coverage.
Current Mission
The implemented mission starts in fort_aurelian, crosses gate_yard and
ashen_road, and reaches signal_tower, with under_vault present as a
bounded site in the generated graph. The player can request and delegate a
ward-writ, ask actors about the mission, quote and buy Maro’s route support,
fight a ward-wraith, order Livia to expose the tower sigil, recover
eagle-standard, record a wounded-legionary evacuation, seal the gate-yard
breach, and get Iunia’s witness-certified temple-seal custody. Room views
show canonical room, exit, actor, mob, and writ ids alongside the current
mission and lead. Status and inventory separate survival, location, mission,
physical items, writs, relic custody, marks, evidence, effects, and the next
lead; status also prints the fixed smoke seed calendar (ashfall day 9,
ash-wind, ward-static), a bounded seasonal resource count/cap summary, and
a carried seasonal-resource forecast that names the next season’s degraded and
expired counts. The current gameplay slice also lets active collectible
seasonal resources be taken at their site; carried crops, fish, and forage
participate in the next-season aging rule, while active repair-material
resources can be harvested without being treated as fragile seasonal carry
items. ask quartermaster about season-transition applies that aging rule:
expired crops are removed, fish/forage degrade to explicit -degraded
inventory tokens, and unknown or non-seasonal items stay unchanged. After the
audited debrief grants Aurelian standing, the quartermaster can sell one
bounded field-ration from the fixed-smoke per-expedition seasonal stock,
spending that standing and adding the ration to inventory. Ordinary inventory
is currently bounded to six slots. This is not a full seasonal economy or
persistent calendar advance.
Status also prints the active generated calendar event metadata for the fixed
seed: the lantern-vigil festival’s actor-location, shop, witness, route, and
rumor overlays. These event fields are metadata/status only; actor movement,
event-driven shop mutation, witness blocking, route safety mutation, debrief
branching, quests, gifts, and affection are not implemented. Status also prints
active generated actor routine metadata for
named actors, selected from the fixed calendar plus the current mission and
emergency state: actor id, room id, routine kind/trigger, schedule/effect text,
authority stance, and metadata-only gameplay stance. These routine records do
not move actors or grant/revoke authority. Status also prints
a concise regional frontier summary for the generated settlement, outpost, and
route metadata, plus a concise regional market order-book summary for generated
market books, buy/sell orders, crossed pure matches, and receipt-ledger ids.
Market-eligible items are limited to ordinary seasonal resources, construction
materials, and explicit outpost produced/consumed supplies. Writs, relics,
actors, mobs, spells, skills, order tasks, and artifact/authority-gated
blueprint outputs are excluded. The first live regional market transaction
proof is bounded to one generated order-book match at a time: Adventure owns
reserve, commit, cancel/release, stale-version rejection, idempotent replay,
and ordered receipt facts behind existing quote, buy, and sell calls for
explicit regional-market proof actions. Fresh committed field-ration matches
now debit the player-local Aurelian chit balance once, decrement the seller
ash_farm field-ration stock once, accrue two service-owned regional market
fee chits once, credit two service-owned ash_farm seller-proceeds chits once,
and deliver the committed quantity into the player expedition inventory only
when ordinary inventory capacity can accept the full delivery; if capacity is
full, replaying buy commit-field-ration from regional-market can apply the
held delivery after ordinary items are dropped without spending, decrementing
stock, accruing fees, or crediting proceeds again. Commit replay does not
duplicate delivery, debit, outpost stock movement, fee accrual, or seller
proceeds. Commit 29c065a9 at 2026-04-30 17:41 UTC added bounded order
expiry to live matching and reserve: fixed-smoke day 65 keeps the field-ration
proof active, while the scenario process proves a day-73 expired field-ration
reserve releases without status, inventory, currency, outpost stock, fee,
seller-proceeds, or delivery mutation. Commit 205fd6a0 at
2026-04-30 18:40 UTC added a bounded
service-owned fee withdrawal proof: sell withdraw-fees to regional-market
moves the two accrued regional-market fee chits into a service-owned treasury
record exactly once, status exposes the treasury balance, replay is stable, and
inventory, currency, outpost stock, seller proceeds, and delivery state do not
mutate. Commit a547db3d at 2026-04-30 19:43 UTC adds a bounded receipt
snapshot/restore proof:
buy receipt-snapshot from regional-market clones the live regional
market receipt facts, reconstructs a separate transaction state, replays the
old field-ration commit against that reconstructed state, and returns proof
success without mutating live status or inventory.
Commit 4b44b32 at 2026-04-30 20:07 UTC adds a bounded settlement-side
snapshot-view proof: buy settlement-snapshot from regional-market checks
applied delivery, debit, stock, fee, proceeds, and withdrawal ids plus the
current settlement balances, replays the committed field-ration fact and fee
withdrawal as already applied, and returns proof success without mutating live
status or inventory.
The construction-job receipt snapshot work is scoped to pure Rust construction
receipt snapshot semantics plus a size-constrained QEMU no-mutation probe.
Pure adventure-content tests restore a separate ConstructionJobState from
ordered field-repair job facts and validate malformed, over-capacity, and
non-closed snapshot shapes. The focused QEMU path drives repair receipt-snapshot with field-engineer after the old completed repair only to
check status/inventory stability and confirm live construction state and
material stock are unchanged. The runtime command does not replay receipts
into the live construction service and is not durable restart loading or a
general construction persistence layer.
It does not yet move NPC stores, broader outpost inventories, durable currency
ledgers, durable seller-proceeds ledgers, profile ledger balances, fee
ledgers, durable calendar advancement, durable crash-recovery state, or
general economy behavior.
Status also prints a construction
foundation summary for
generated blueprint, artifact, enchantment slot, and gate metadata;
the first live construction-job proof is bounded to the field-engineer gate
repair path: Adventure owns reserve/start, completion, cancel/release,
stale-version rejection, idempotent replay, service-owned material holds and
restores, and ordered job facts behind existing repair calls. It does not yet
persist durable stock ledgers, replenish stock from outposts, update player
output/currency inventories, advance job time, persist crash-recovery state, or
provide general crafting/artifact gameplay.
Status now also prints disabled-by-default optional fake-agent NPC metadata:
budget count, supported fake-agent purpose count, aggregate session token
budget, tool-call budget, and audit visibility. That is deterministic metadata
for future optional chatter, hints, outpost summaries, personal routines,
nonbinding shop flavor, and festival reactions, not live LLM gameplay or
autonomous NPC authority. Status also
prints the first local party foundation: a service-created local player label,
the current party leader/members/pending invites, scoped ward-writ
delegations, and recorded assists. Party labels are derived from live Adventure
caller-session keys and do not disclose global session or principal data. The
same service-local labels are used by the first physical-item transfer
foundation, transfer <item> to <player>, which mutates both player inventories
atomically inside Adventure, requires shared party membership, and refuses
relic custody such as eagle-standard. Currency escrow and two-client transfer
proof remain future work.
Valid near-miss ids such as ward and wraith return explicit suggestions.
The site graph, regional metadata, visible items, actors, mobs, aliases,
objectives, mission text, leads, scripted proof-path metadata, named-item
inspection text, and prepared-spell inspection text are authored in
demos/adventure-content/content/prototype.cue, generated into
demos/adventure-content/src/generated.rs, and validated by host tests before
the server consumes them.
Useful commands in the current game:
look
resume
status
request ward
request ward-writ
accept ward-writ
delegate ward-writ to livia
order Livia to guard
go east
go east
say hello road
take scout-marker
quote route from maro
buy route from maro
quote regional-field-ration from regional-market
buy reserve-field-ration from regional-market
buy commit-field-ration from regional-market
buy reserve-incense from regional-market
sell cancel-incense to regional-market
sell withdraw-fees to regional-market
transfer scout-marker to player-1
repair gate with field-engineer
repair retry-field-repair with field-engineer
repair complete-field-repair with field-engineer
repair stale-field-repair with field-engineer
repair reserve-cancel-field-repair with field-engineer
repair cancel-field-repair with field-engineer
go north
order livia to dispel-sigil
inspect ward-wraith
cast ember-dart ward-wraith
skill strike ward-wraith
recover eagle-standard
ask wounded-legionary about evacuation
guard
cast shield-bind self
go south
go west
seal gate
go west
ask iunia about custody
inventory
go down
What It Proves
make run-adventure currently asserts:
- shell-spawned game clients run with explicit
StdIO,Adventure, andChatgrants; - ordinary
adventure-clientlaunch andlookstart fresh, while the explicitresumecommand reloads active expedition state through anAdventurecap call; - room joins, movement, physical item pickup, typed relic recovery, inventory, status, and representative failure messages are visible in the terminal transcript;
give,ask,request,accept,delegate,order,seal,recover,revoke,quote,buy,sell,trade,transfer, andrepairare wired as typed adventure calls, not shell-special strings;adventure-clientexposesparty create,party invite,party accept,party leave,party delegate,assist, andtransfer <item> to <player>command paths backed by typed Adventure methods;- the party proof covers one-client party creation, missing local-player refusal paths for invite and assist, party status output, and help/client command availability; two-client successful accept, leave, delegate, and assist calls remain future work;
- the transfer proof covers one-client unknown target, self-transfer, and missing-item refusals, with status or inventory unchanged as appropriate; successful two-player transfer remains covered by pure Rust state tests until the launcher/session harness can run two real Adventure clients;
- canonical room, exit, actor, mob, and writ ids, room-view leads, common actor casing aliases, near-miss suggestions, and improved actor-task hints are visible in the terminal transcript;
- combat status exposes hp, guard, fatigue, warrior stars, wizard circles, prepared spells, active mobs, mission state, physical items, writ authority, relic custody, marks, evidence, effects, fixed smoke seed calendar state, and objective state;
- generated actor routine metadata is visible through status as structured status-only records filtered by the fixed calendar and current mission/emergency state;
- generated regional market order-book metadata is visible through status as aggregate metadata and pure non-mutating crossed-match counts only;
- market and construction coverage proves a Maro route quote, a successful
route exchange, an Iunia clean-custody trade refusal that names the
temple-sealgate and price, a bounded regional-market reserve/commit/retry/stale/release/cancel proof where the server owns the transaction state and receipt facts, and a bounded field-repair construction job proof where the server owns job state, service-owned material hold/release facts, held-stock mutation, and terminal facts; shell-smoke coverage also keeps the full market command-help surface, includingsell, visible; - delegated authority can expose the ward, repeated spell actions are
idempotent, and
eagle-standardrecovery records bounded evidence in the interactive transcript; - the
adventure-scenario-testprocess covers physical-item-onlytakeanddrop, carried seasonal resource pickup, quartermaster-triggered seasonal inventory aging, post-debrief seasonal ration purchase, Iunia custody denials, witness refusal, survivor evacuation, gate sealing,temple-sealcustody, categorized evidence tokens, andunder_vaultaccess through realAdventurecap calls, and asserts the fixed calendar, seasonal carry forecast, regional market delivery/replay, construction foundation, construction-job denial/reserve/replay/open-conflict/complete/stale/release/ reserve-after-release paths, agent NPC budget, and one-client party status lines through realAdventurecap calls; - the two-client local co-op proof remains open because the current focused manifest/session launcher path does not yet provide two distinct live Adventure caller-session keys without faking them inside one process;
- replayed room messages are labeled as history, and the named NPC actor proof accepts visible replies whether the player observes them live or through room-history replay after movement;
- the read-only prototype content model rejects malformed room graphs, bad aliases, overlong text, empty proof paths, malformed construction metadata, and invalid agent NPC budget metadata in host tests.
make generated-code-checkfails if the checked-in generated adventure content drifts from the CUE source or generator.
Design Context
The gameplay and future setting plan live in the Aurelian Frontier proposal. The proposal covers the Aurelian frontier setting with magic-warrior and wizard ranks, future mobs, portals, golems, logistics, campaigns, persistent shared world state, multiplayer, and how those mechanics map onto capability-native authority.
Paperclips Terminal Demo
The Paperclips terminal demo is a small clean-room incremental game inspired by
the paperclip maximizer thought experiment and by Frank Lantz’s browser
implementation of that premise. In the focused manifest it now runs as a
Paperclips server plus terminal client launched through the native shell. The
server is authoritative for generated content, resources, GameState,
proof-command gating, unlock checks, and game-rule mutation. The terminal
client owns StdIO, handles the transcript, renders help from server-provided
command specs, plain status from server-provided PaperclipsStatusSnapshot
data, and plain projects from server-provided project entries when connected
to a server, and sends gameplay requests to the server through an explicit
PaperclipsGame endpoint capability.
This is still an early client/server protocol. The server owns regular timer
cadence and the current command list, while command execution still uses raw
text and mostly returns transcript text rather than typed command invocations
or structured UI events. In server mode, PaperclipsGame.status returns a
PaperclipsStatusSnapshot for plain status, and the terminal client renders
the familiar status text locally. PaperclipsGame.projects likewise returns
the unlocked project list for local terminal rendering of plain projects,
while project <id> still executes through the raw text/server-mutating
command path. The backlog tracks broader structured state/events and moving
unlocked command facets behind server-issued capabilities so a future web
client or web-shell gateway can use the same game authority instead of
reimplementing Paperclips logic.
No source code, CSS, images, generated tables, or copied resource files from
the original browser game are checked into this repository. The implementation
uses original Rust code and local CUE content in
demos/paperclips-content/content/paperclips.cue. During development, the
original site and a public mirror were inspected for license/provenance only;
neither exposed a permissive license that would allow copying assets into
capOS.
Reference sources:
- Original game site: https://www.decisionproblem.com/paperclips/
- Public mirror inspected for license/provenance: https://github.com/jgmize/paperclips
- Public gameplay/stage summaries: https://universalpaperclips.fandom.com/wiki/Stages
- Background overview: https://en.wikipedia.org/wiki/Universal_Paperclips
Run It
Use the focused QEMU proof:
make run-paperclips
The scripted proof logs into the shell, launches the child process, drives the opening refusals and business loop, scales production through repeatable marketing and explicit sales, completes a business-phase project chain, asserts the transition to autonomous phase, completes a representative autonomous drone/factory scaling step, transitions into the cosmic phase, proves a bounded probe interval with replication and production, verifies that final conversion remains locked, and then checks clean child and shell exit. The accelerated proof transcript starts with an explicit proof-capability launch, then uses ordinary player commands plus proof-only acceleration and machine-status commands:
run "paperclips" with { stdio: client @stdio, game: client @paperclips_proof_game, proof_accelerator: @proof_accelerator }
status
buy autoclipper
buy wire 1000
buy marketing
make
run 10000
price 99
sell 1
price 25
sell 1
make
run 10000
sell 1
make
run 10000
sell 1
make
run 10000
sell 1
make
run 10000
sell 1
...
project autoclipper-license
project background-jobs
status
run 5000
buy wire 2
run 600000
make
projects
project survey-drones
sell 60
buy marketing
...
project precision-rollers
project design-search
run 600000
project forecast-engine
project survey-drones
project material-harvesters
run 100000
project foundry-lines
run 1000
project mesh-coordination
run 600000
project seed-probes
run 600000
status --json
status
projects
exit
For default manual use, boot the ordinary playground:
make run
After login:
run "paperclips" with { stdio: client @stdio, timer: @timer }
The ordinary make run playground command uses the standalone fallback because
the default manifest does not start the Paperclips server. The focused
make run-paperclips manifest uses run "paperclips" with { stdio: client @stdio, game: client @paperclips_game, timer: @timer }, where the server owns
game state and the client timer drains server-generated status messages while
the player is idle at the prompt. The structured command-list, status-snapshot,
and project-list methods do not change the default manifest or MOTD launch
command.
Useful commands inside the demo:
status
projects
make
sell <n>
price <cents>
buy wire [bundles]
buy autoclipper [count]
buy marketing [count]
buy processor [count]
buy memory [count]
buy drone [count]
buy factory [count]
buy probe [count]
project <id>
help
exit
make starts exactly one manual paperclip. Manual work takes 500 configured
milliseconds before the clip becomes available. Repeating make while work is
pending is refused until the player completes the Background Jobs project;
after that, repeated make commands reserve wire and queue manual jobs behind
the active one. Time advancement reports completed manual cycles before the
status update. Purchase counts are optional and default to one;
explicit zero counts are rejected. Automation advances on configured millisecond
intervals while the process is running. Normal player launches do not expose
run <ms> or status --json; the focused QEMU proof passes an explicit
proof_accelerator capability and uses those commands only as proof
instrumentation.
The shell rejects renaming an ordinary @timer grant into that proof slot.
Blank input repeats the last non-empty command. The first autoclipper is granted
by the Autoclipper License project, which costs cash and trust; repeatable
buy autoclipper [count] purchases appear only after the license grants the
starter autoclipper. Later-stage purchase
commands such as buy drone, buy factory, and buy probe appear in help
only after the corresponding automation path is unlocked. The projects list
shows only unlocked technologies; in server mode that plain listing is rendered
from structured server-provided project entries. Complete the listed projects
and pay their shown costs to reveal later project chains.
The proof-only status --json command prints a single compact JSON object for
scripted assertions when the process receives the proof_accelerator
capability. Normal player launches do not advertise or accept it, and it stays
separate from the structured plain-status snapshot used for terminal
presentation. All fields are numeric and emitted in stable order. stage uses
0=business, 1=autonomous, 2=cosmic, and 3=complete; design and
strategy are the two planning resources, and cosmic_matter maps to the
universe-matter state.
Funds change only when clips are sold explicitly by default. Demand follows a
bounded random walk during the business phase, then price modifies the current
market size for sell <n>. A successful business-phase sale starts a short
CUE-configured market-settlement cooldown, so repeated immediate sales are
refused until timer/proof time advances. Wire is bought in CUE-configured
bundles at a market price that drifts on a slower interval; repeated purchases
add temporary price pressure that decays over later market updates. Repeatable
marketing buys still spend funds, but each new level contributes more demand
than the previous level. The CUE content owns the base marketing gain, walk
thresholds, wire market thresholds, step sizes, sale cooldown, and deterministic
generator parameters. It also has an autoSellEnabled rule for experiments
that should sell during ms, but the checked-in demo keeps it disabled so market
movement is visible.
Content Pipeline
Paperclips uses the same generated-content discipline expected for larger demos, but with a stricter runtime data path:
demos/paperclips-content/content/paperclips.cue
-> cue export --out json
-> tools/paperclips-content-gen
-> schema-validated PaperclipsContent Cap'n Proto bytes in src/generated.rs
-> paperclips-content deserializes the typed Paperclips schema at startup
The CUE file owns the game balance: initial state, purchase costs, millisecond
intervals, explicit/automatic selling policy, demand rules, trust milestones,
project costs, project labels/descriptions, production cadence, later-stage
matter conversion and replication caps, manual-work pacing, unlock thresholds,
and project effects. Rust owns mechanics, validation, command parsing, and the
terminal adapter. make generated-code-check fails if the checked-in generated Cap’n Proto bytes drift
from the CUE source.
Unlock Flow
The tech progression is data-driven by the project list:
- retail phase starts with 10 wire, no cash, manual single-clip production, sales, and early wire purchases;
- Autoclipper License grants the first autoclipper for cash plus trust, then
Background Jobs enables queued manual
makecommands; - repeatable marketing investment raises dynamic demand, while later business projects improve autoclippers, unlock design search, and unlock Strategy generation;
- Survey Drones moves the game into autonomous matter conversion;
- harvester/foundry/mesh projects scale harvesting, production, and compute;
- Seed Probes move the game into cosmic replication;
- Final Conversion completes the run once reachable matter is exhausted.
What It Demonstrates
make run-paperclips currently shows:
capos-shelllaunchespaperclipsas a normal child process;- init launches Paperclips server services before the shell starts;
- the terminal client receives only explicit
StdIOandPaperclipsGameendpoint grants; - server-mode
helpis rendered from the Paperclips server’s structured command specs, so the visible command list follows server-side unlock/proof authority; - server-mode plain
statusis rendered by the terminal client from the Paperclips server’sPaperclipsStatusSnapshot, while proof-onlystatus --jsonstays a separate server-gated command; - server-mode plain
projectsis rendered by the terminal client from the Paperclips server’s structured project list, whileproject <id>execution stays on the server-mutating text command path; - the normal and proof launches use separate server endpoints, so proof-only commands are decided by server-side authority rather than by client text;
- the foreground shell services the child’s stdio bridge while the game runs, so the demo exercises real endpoint IPC between shell and child process;
- the server’s timer capability drives regular automation without ambient clock access by the terminal client;
- the Paperclips server owns generated content,
GameState, unlock checks, proof-command gating, and game-rule mutation for the focused manifest; - a repeatable economic choice (
buy marketing) changes the early business loop before automation is purchased; - representative Stage 1 refusal output remains legible: early locked
buy autoclipper, insufficient-fundsbuy wire 1000, pending manual work, bulk manual rejection, a high-pricesell 1demand refusal, a no-wire manual production refusal, and a lockedproject survey-dronesattempt are asserted in the focused transcript; - manual production and explicit sales fund Autoclipper License, which grants
the first autoclipper and unlocks repeatable
buy autoclipper; - repeatable demand investment remains a purchase path rather than a one-shot project, and the smoke asserts at least five marketing purchases before the phase transition path completes;
- business-phase sales are paced by a timer-backed market-settlement cooldown, and the smoke asserts an immediate repeat sale is refused;
- scaled business-phase production reaches the 10,000-clip trust threshold,
then completes
autoclipper-license,precision-rollers,design-search,forecast-engine, andsurvey-drones; - completing the chain is asserted by
[done]project entries, the visible== autonomous phase ==status line,Automation: 14 autoclippers, 1 drones, 0 factories, 0 probes, and the local matter grant; - the autonomous follow-up completes
material-harvestersandfoundry-lines, runs milliseconds, then assertsAutomation: 14 autoclippers, 5 drones, 2 factories, 0 probes, lower local matter, and additional clip production; - the late-game follow-up completes
mesh-coordination, thenseed-probes, asserts== cosmic phase ==, visible probe replication, lower cosmic matter, and additional clip production, then assertsfinal-conversionremains locked; - the late-game proof also asserts a proof-only
status --jsonline with compact, machine-readable numeric state while preserving the human transcript checks; - the Paperclips server maintains game state without ambient authority;
- the pure rules layer in
paperclips-contentis host-testable separately from the terminal adapter and reads generated Cap’n Proto content data; - exiting the game closes stdio, returns to the shell, and lets the focused manifest halt through the normal debug-exit path.
This is now a coarse client/server game-state demo. It is not yet the final capability-management showcase: help, plain status, and plain projects are structured, command execution is still raw text, broader state/events remain future work, and unlocks are reflected in server-owned command specs/project lists/output rather than transferred command facets. That split is the intended path for a later web client or gateway that uses the same game capabilities.
Current Limits
The demo intentionally implements a compact terminal adaptation, not a browser-accurate port. It has no original artwork, CSS, JavaScript, exact project list, exact balancing, save file, market UI, tournament model, or complete original event text. The host tests cover early mechanics, project locking, the deterministic business-to-autonomous project chain, autonomous resource conversion caps, factory/drone scaling, cosmic probe replication, and completion gating, including a one-real-time-hour upper bound for normal creativity generation. The focused QEMU proof covers launch, the first production loop, one early automation purchase, representative Stage 1 refusal output, business-phase project chaining, the autonomous transition, one timer-driven autonomous scaling action, and a bounded cosmic probe interval. It is representative transcript coverage rather than an exhaustive full playthrough.
Future rule/content expansion is tracked in
Paperclips Terminal Demo. New data-heavy
content should migrate through mkmanifest cue-to-capnp: author bounded CUE,
convert it to a specified Cap’n Proto root with pinned host tools, validate the
result on the host, and keep runtime CUE parsing out of the demo.
Current Design Authority
The current capOS design lives in reader-facing architecture, capability, security, device, configuration, and status pages. Proposal documents remain important design history, but they stop being the primary place to patch a design after that design is implemented or accepted as the working baseline.
Stable Homes
Use these homes for current behavior and accepted contracts:
| Area | Current-design home |
|---|---|
| Boot, manifest, init, processes, rings, IPC, session context, memory, scheduling | docs/architecture/ |
| Capability model, authority accounting, ABI policy | docs/capability-model.md, docs/authority-accounting-transfer-design.md, docs/abi-evolution-policy.md |
| Operator configuration and CUE overlays | docs/configuration.md |
| DMA isolation, device authority, trusted inputs, panic surfaces | docs/dma-isolation-design.md, docs/devices/, docs/trusted-build-inputs.md, docs/panic-surface-inventory.md |
| Current status, roadmap, backlog, task lifecycle | docs/status.md, docs/roadmap.md, docs/backlog/, docs/tasks/ |
| Proposal status and archival decision records | docs/proposals/index.md and individual files under docs/proposals/ |
When a current-design home already exists, future implementation slices update that page. When none exists and the proposal has become the working design, create or extend a stable page in the nearest existing area instead of leaving the proposal as the only current reference.
Proposal Lifecycle
The proposal index classifies proposals with these roles:
- Active design: near-term design work still being changed before or during implementation. It may remain the primary working document while the design is not stable.
- Accepted design: selected direction. It can guide implementation, but any implemented subset needs a stable current-design page or an explicit pointer to the page that already owns the current contract.
- Partially implemented: some behavior is in tree. The proposal must distinguish present behavior from planned behavior, and current pages should describe the implemented subset.
- Implemented: the proposal is an archival decision record. Future changes update the stable current-design docs and code references first; the proposal changes only for archival status, links, or corrected history.
- Superseded or Rejected: historical records. They should point at the replacement or rejection rationale and must not be cited as current behavior.
Initial Promotions
This repository already had stable homes for several implemented or accepted designs. The initial promotion set makes the weakest current-authority links explicit:
| Proposal or decision | Current-design authority | Disposition |
|---|---|---|
| Session-Bound Invocation Context | docs/architecture/session-context.md, with endpoint transport details in docs/architecture/ipc-endpoints.md | Implemented proposal becomes archival history. |
| Error Handling | docs/architecture/error-handling.md, with ring transport details in docs/architecture/capability-ring.md | Implemented proposal becomes archival history. |
| System Configuration | docs/configuration.md and docs/architecture/manifest-startup.md | Implemented proposal stays as rationale and closeout history. |
| DMA Assurance Model | docs/dma-isolation-design.md | Accepted design remains grounded in the stable DMA design page. |
| SMP and Scheduler Evolution | docs/architecture/threading.md and docs/architecture/scheduling.md | Accepted design feeds current scheduler and threading contracts. |
Follow-up promotions should focus on proposals whose implemented slices are large enough that readers still have to mine proposal text for current behavior. Good candidates include storage/naming, installable system, SystemInfo/System Manual, and userspace driver relocation once their current contracts settle further.
Boot Flow
Boot flow defines the trusted path from firmware-owned machine state to the first user processes. It establishes memory management, interrupt/syscall entry, capability tables, process rings, and the boot manifest authority graph.
Current Behavior
Firmware loads Limine, Limine loads the kernel and exactly one module, and the
kernel treats that module as a Cap’n Proto SystemManifest. The kernel rejects
boots with any module count other than one.
kmain initializes serial output, x86_64 descriptor tables, memory, paging,
SMEP/SMAP, the kernel capability table, the idle process, PIC, and PIT. It then
parses the manifest, validates the kernel-owned boot boundary, loads only
initConfig.init.binary into a fresh AddressSpace, builds init’s bootstrap
capability table and read-only CapSet page from initConfig.init.caps, enqueues
init, and starts the scheduler.
Default boot uses the standalone init ELF as that init process. It receives
the bootstrap authority needed to read BootPackage, validate the service graph,
spawn child services, and supervise them. The foreground capos-shell is now
an init-started service with the terminal, credential, session, audit, and
broker capabilities needed for the local shell flow; it does not receive
BootPackage or broad ProcessSpawner authority. Focused shell-led manifests such
as system-smoke.cue and system-shell.cue still boot capos-shell directly
as initConfig.init for narrow login/shell proofs until the run-target/init
policy cleanup migrates them.
flowchart TD
Firmware[UEFI or QEMU firmware] --> Limine[Limine bootloader]
Limine --> Kernel[kmain]
Limine --> Module[manifest.bin boot module]
Kernel --> Arch[serial, GDT, IDT, syscall MSRs]
Kernel --> Memory[frame allocator, heap, paging, SMEP/SMAP]
Kernel --> Manifest[validate kernel manifest boundary]
Manifest --> InitImage[parse and map init ELF]
Manifest --> InitCaps[build init CapTable and CapSet page]
InitImage --> InitProcess[create init Process and ring]
InitCaps --> InitProcess
InitProcess --> Scheduler[start round-robin scheduler]
Scheduler --> Init[enter init]
Init --> DefaultPath[default init-owned service graph]
DefaultPath --> Shell[spawn capos-shell service]
DefaultPath --> Gateway[spawn remote-session gateway and resident services]
Init --> SpawnPath[focused system-spawn executor path]
SpawnPath --> BootPackage[read BootPackage manifest]
SpawnPath --> Spawner[spawn child services]
Spawner --> Children[focused demo processes]
The invariant is that the kernel starts only initConfig.init after validating
the kernel-owned manifest boundary, and no child service starts until
mkmanifest/init validation has accepted service binary references, authority
graph structure, and bootstrap capability source/interface checks.
Design
The boot path is deliberately single-shot. The kernel receives a single packed
manifest and validates only the kernel-owned boot contract before creating
init. Init then performs the userspace execution step: it reads manifest chunks
from BootPackage, validates a metadata-only ManifestBootstrapPlan, resolves
kernel and service cap sources, and asks ProcessSpawner to load each child ELF
into its own address space with its own user stack, TLS mapping if present, ring
page, and CapSet mapping.
The default manifest (system.cue) now boots an init-owned local path: the
kernel launches the standalone init binary described by initConfig.init,
and init spawns the shell, remote-session CapSet gateway, and resident services
from initConfig.services. The shell mints an anonymous UserSession on
startup through SessionManager.anonymous(), receives an empty-allowlist
anonymous launcher from the broker, and waits at its own interactive prompt.
The user types login (or setup on a fresh image) to upgrade in place. The
smoke and shell manifests still provide focused shell-led proofs, while
system-spawn.cue remains the focused init-owned graph retained for
ProcessSpawner validation.
Invariants
- Limine must provide exactly one boot module, and that module is the manifest.
- Kernel manifest validation must complete before init is enqueued, and init BootPackage validation must complete before any child service is spawned.
- Service ELF load failures roll back frame allocations before boot continues or fails.
- Kernel page tables are active and HHDM user access is stripped before SMEP/SMAP are enabled.
- The kernel passes
_start(ring_addr, pid, capset_addr)in RDI, RSI, and RDX. - CapSet metadata is read-only user memory; the ring page is writable user memory.
- QEMU-feature boots halt through
isa-debug-exitwhen no runnable processes remain.
Code Map
kernel/src/main.rs-kmain, manifest module handling, validation, boot-only-init loading, process enqueue, halt path.kernel/src/spawn.rs- ELF-to-address-space loading, fixed user stack, TLS mapping,Processconstruction helpers.kernel/src/process.rs- process bootstrap context, ring page mapping, CapSet page mapping.kernel/src/cap/mod.rs- bootstrap capability resolution and CapSet entry construction for init.capos-config/src/manifest.rs- manifest decode and schema-version storage.capos-config/src/validation.rs- graph/source/binary validation policy.tools/mkmanifest/src/lib.rs- host-side manifest validation and binary embedding.system.cueandsystem-spawn.cue- default and spawn-focused boot graphs.limine.confandMakefile- bootloader config, ISO construction, QEMU targets.
Validation
make run-smokevalidates the scripted focused shell-led login path: singlecapos-shellinit boot fromsystem-smoke.cue, password prompt, failed-auth redaction, successful shell launch, narrow shell bundle, and clean QEMU halt.make runis the operator-facing interactive boot path with the terminal UART on stdio and console/debug output logged separately.make run-spawnvalidates that the kernel boot-launches only the standaloneinitwith Console, BootPackage, and ProcessSpawner, and that init validates BootPackage metadata before running the focused ProcessSpawner, Timer, IPC, and memory smokes.cargo test-configcovers manifest decode, roundtrip, and validation logic.cargo test-mkmanifestcovers host-side manifest conversion and embedding checks.make generated-code-checkverifies checked-in Cap’n Proto generated output.
Open Work
- The run-target/init-policy backlog still needs to migrate remaining focused shell-led manifests onto standalone init or explicitly preserve them as compatibility smokes.
- A future manifest-loader or mkmanifest gate should reject accidental
non-
initdefault boot graphs once all focused exceptions are reconciled.
Manifest and Service Startup
The manifest is the boot package and init configuration. It names embedded binaries, the single kernel-launched init process, kernel boot parameters, and the init-owned service graph used by focused executor manifests.
Current Behavior
tools/mkmanifest requires the repo-pinned CUE compiler, evaluates
system.cue, embeds declared binaries, validates binary references and the
init-owned authority graph under initConfig, serializes SystemManifest, and
places manifest.bin into the ISO. The kernel receives that file as the single
Limine module. The diagram below is intentionally large: it separates the
default init-owned boot path from the focused spawn-proof path.
flowchart TD
Cue[system.cue or system-spawn.cue] --> Mkmanifest[tools/mkmanifest]
Binaries[release userspace binaries] --> Mkmanifest
Mkmanifest --> Manifest[manifest.bin SystemManifest]
Manifest --> Limine[Limine boot module]
Limine --> Kernel[kernel parse and validate]
Kernel --> InitCaps[init CapTable and CapSet page]
InitCaps --> Init[enter initConfig.init process]
Init --> ShellPath[default system.cue: spawn shell/remote CapSet gateway/services]
Init --> SpawnPath[focused system-spawn.cue: standalone init executor]
SpawnPath --> BootPackage[BootPackage.readManifest chunks]
BootPackage --> Plan[capos-config ManifestBootstrapPlan validation]
SpawnPath --> Spawner[ProcessSpawner.spawn]
Spawner --> Children[init-spawned child processes]
The default manifest starts only initConfig.init from the kernel, and that
process is now the standalone init ELF. Init receives the bootstrap authority
needed to read BootPackage, validate initConfig.services, spawn the
foreground shell, remote-session CapSet gateway, resident chat service, and
other default services, then wait according to the manifest policy. The shell
is an init-started service; it receives terminal, credential-store,
session-manager, audit-log, and authority-broker caps, mints its own anonymous
UserSession, and waits for an explicit login or setup command before
upgrading. It never holds BootPackage or broad ProcessSpawner authority.
Focused shell-led manifests such as system-smoke.cue and system-shell.cue
still put capos-shell directly in initConfig.init for narrow login/shell
proofs. That compatibility path is tracked by the run-target/init-policy
backlog and should not be confused with the default system.cue boot path.
The focused system-spawn.cue manifest still puts the standalone init ELF in
initConfig.init.
There, init receives ProcessSpawner, a read-only BootPackage cap, and
Console. It reads bounded manifest chunks into a metadata-only
capos-config::ManifestBootstrapPlan, validates binary references, authority
graph structure, exports, cap sources, and interface IDs, then spawns the
focused smoke services. Low-level spawn grants still model receiver selectors
for hostile and compatibility proofs, but normal shell client @... grants
omit selector syntax and preserve delegated client endpoint identity. Raw
parent-capability grants must preserve the source hold metadata,
endpoint-client grants may mint selectors only from an endpoint owner or a
ProcessSpawner-returned parent endpoint facet without widening it to server
authority, and kernel-source Endpoint, FrameAllocator, VirtualMemory, Timer,
ThreadControl, ThreadSpawner, and EntropySource grants mint fresh child-local
caps without receiver selectors. QEMU-only PersistentStore grants mount the
root store through the same child-local kernel-source path when a focused proof
manifest names that source. Endpoint kernel grants also return parent-side
client facets as ProcessSpawner result caps so init can wire later
service-sourced imports without ever holding child endpoint owner caps.
mkmanifest cue-to-capnp is the adjacent general conversion path for
CUE-authored data that should not become part of SystemManifest. It evaluates
the input with the same pinned CUE compiler, package mode, tag injection, and
CAPOS_CUE_TAGS handling as the manifest path, then passes the exported JSON
to the pinned Cap’n Proto compiler through capnp convert json:binary. The
caller supplies the .capnp schema file, root struct type, output path, and
optional Cap’n Proto import paths. This is schema-aware serialization for data
messages rooted at arbitrary specified structs; it is not a live capability or
interface-object serialization path.
Design
Manifest validation has three layers:
- Kernel bootstrap references: binary names are unique,
initConfig.init.binaryresolves, referenced payloads are non-empty, and init kernel cap sources match their expected interface IDs. - Init-owned binary references:
initConfig.services[*].binaryreferences resolve before the executor spawns children. - Init-owned authority graph: service names, cap names, export names, and service-sourced references are unique and resolvable; re-exporting service-sourced caps is rejected.
- Init-owned cap sources: expected interface IDs match kernel sources or declared service exports.
Kernel startup now resolves only initConfig.init.caps. Init performs service
execution in two userspace passes. The preflight pass walks
initConfig.services in manifest order, resolves kernel and service-sourced
caps against init grants and prior exports, and rejects an unstartable graph
before spawning children. The spawn
pass grants caps in declaration order, records declared exports, keeps owned
parent client facets for exported child endpoints, and attenuates endpoint
exports to client-only facets for importers. After every child is spawned, init
drops and flushes those parent facets before waiting on children; a dropped init
facet therefore cannot owner-cancel queued, pending, or in-flight child endpoint
state.
Invariants
- The manifest is schema data plus an init config tree, not shell script or ambient namespace.
- Omitted cap sources fail closed.
- Cap names within one service are unique and are the names userspace sees in CapSet.
- Service exports must name caps declared by the same service.
- Service-sourced imports must reference a declared service export.
- Endpoint exports to importers must be attenuated to client-only facets.
- Init must not hold endpoint owner caps for child-local manifest endpoints.
expectedInterfaceIdchecks compatibility; it is not the authority selector.- Legacy receiver metadata travels with cap-table hold edges and endpoint invocation metadata. Spawn-time client endpoint minting may carry the requested child selector only from owner or trusted parent endpoint result sources instead of copying the parent’s hold selector. Client facets received through ordinary spawn grants are not selector-minting authority for later spawns. Caller-selected endpoint badges are transitional compatibility state; session-bound invocation context plus broker-granted service roots/facets is the target shared-service authority model.
Code Map
schema/capos.capnp-SystemManifest,NamedBlob,SystemConfig,KernelCapSource, and genericCueValuestorage forinitConfig.capos-config/src/manifest.rs- manifest structs,initConfigCUE parsing, capnp encode/decode, metadata-onlyManifestBootstrapPlan, and schema-version storage.capos-config/src/validation.rs- kernel bootstrap, init-owned graph, binary-reference, and capability-source validation policy.tools/mkmanifest/src/lib.rsandtools/mkmanifest/src/main.rs- host-side manifest build pipeline, binary embedding, and general CUE-to-Cap’n Proto data-message conversion.kernel/src/main.rs- kernel manifest module parse and validation.kernel/src/cap/mod.rs- bootstrap cap creation and CapSet entry construction for init.kernel/src/cap/boot_package.rs- read-only manifest-size and chunked manifest-read capability.kernel/src/cap/process_spawner.rs- init-callable spawn path for packaged boot binaries.capos-rt/src/client.rs- typed BootPackage and ProcessSpawner clients.init/src/main.rs- BootPackage manifest reader, graph preflight, generic spawn loop, hostile spawn checks, and child waits.system.cueandsystem-spawn.cue- default init-owned login/service graph and focused init-owned spawn manifests usinginitConfig.
Validation
cargo test-configvalidates manifest decode, CUE conversion, graph checks, source checks, and binary reference checks.cargo test-mkmanifestvalidates host-side manifest conversion, embedded binary handling, pinned CUE path/version checks, pinned Cap’n Proto path/version checks, and schema-aware JSON-to-binary conversion throughcapnp convertwhenCAPOS_CAPNPis available.make run-smokevalidates the focused shell-led scripted login manifest: singlecapos-shellinit boot fromsystem-smoke.cue, failed-auth redaction, successful password auth, broker-issued shell launch, terminal isolation, and clean halt.make runis the operator-facing interactive boot path with the terminal UART on stdio and console/debug output logged separately.make run-spawnvalidates the narrowersystem-spawn.cuegraph: the kernel boot-launches only standaloneinit, init validates BootPackage metadata, ProcessSpawner launches each focused child service, grants Timer to the timer smokes, and init waits for them.make generated-code-checkvalidates schema-generated Rust stays in sync.
Open Work
- The run-target/init-policy backlog still needs to migrate remaining focused
shell-led manifests or preserve them as explicit exceptions, then add a
manifest-loader or mkmanifest guard against accidental non-
initdefault boot graphs. - Service object identity migration still needs to retire caller-selected endpoint badge syntax from normal manifest paths. Normal shell paths already reject explicit client-grant selector syntax; low-level hostile fixtures and manifest-scoped non-identity encodings such as TCP listen ports remain separate cases.
Process Model
The process model defines how capOS represents isolated user programs, how they receive authority, how they enter and leave the scheduler, and how a parent can observe a child.
Current Behavior
A Process currently owns a user address space, a per-process capability
table, a ring scratch area, a mapped capability ring, an optional read-only
CapSet page, private thread/kernel-stack ledgers, and one or more Thread
records. Process IDs are assigned by an atomic counter. The scheduler names
current execution, run queues, direct IPC handoff, and blocking waiters with
generation-checked ThreadRef values. Each thread owns its kernel stack,
saved CPU context, FS base, and cap_enter blocking state, while address
space, capability table, ring, CapSet, and resource accounting stay
process-owned.
ELF images are loaded into fresh user address spaces. PT_LOAD segments are
mapped with page permissions derived from ELF flags, the user stack is fixed at
USER_STACK_BASE (0x100_0000 as of WASI Phase W.2 sub-slice 1; see
capos-config/src/process_layout.rs for the canonical layout) with a
linker-enforced image limit below it, and PT_TLS data is mapped into a
per-process TLS area below the ring page. The process starts from a synthetic
CpuContext that returns to Ring 3 with iretq.
ProcessSpawner lets a holder spawn packaged boot binaries, grant selected
caps to the child, and receive result caps. Every successful spawn returns a
non-transferable ProcessHandle; child-local endpoint kernel grants also return
parent-side client facets so a supervisor can wire imports without sharing
endpoint owner authority. ProcessHandle.wait either completes immediately for
an already-exited child or registers one waiter. Child-local ThreadControl
grants give runtimes ownership of their current FS base and current-thread
exit. Child-local ThreadSpawner grants let a process create additional
in-process threads and receive process-local ThreadHandle result caps for
join, detach-on-release, and exit-code observation.
Design
Process construction separates image loading from capability-table assembly.
Default boot maps only init in the kernel and gives it a bootstrap CapSet.
Spawned children use the same image loading and Process creation helpers, but
their grants are supplied by the calling process through ProcessSpawner.
Init resolves service-sourced manifest imports against previously recorded
exports before asking ProcessSpawner to create each child.
Each process starts with three machine arguments:
RDI- fixed ring virtual address (RING_VADDR).RSI- process ID.RDX- fixed CapSet virtual address, or zero if no CapSet is mapped.
Exit releases authority before the Process storage is dropped. The scheduler
switches to the kernel page table before address-space teardown, cancels
endpoint state for the exiting pid, completes any pending process waiter, and
defers the final process drop until execution is on another kernel stack.
Future process lifecycle work should keep authority transfer explicit: parents should not gain ambient access to child internals, and child grants should come from named caps plus interface checks.
The 7.1.0 in-process threading contract is documented in
In-Process Threading. It defines ThreadSpawner and
ThreadHandle as process-local authorities, preserves ProcessHandle as the
parent-facing whole-process lifecycle handle, and keeps process exit as the
operation that releases shared capability authority.
Invariants
- A process cannot access a resource unless its local
CapTableholds a cap. - Bootstrap CapSet metadata is immutable from userspace.
- A stale
CapIdgeneration must not name a reused cap-table slot. ProcessSpawnerraw grants require a copy-transferable cap or an endpoint owner cap; client-endpoint grants require an endpoint owner orProcessSpawnerendpoint result source and never add receive or return authority.ProcessSpawnerkernel-source Endpoint, FrameAllocator, VirtualMemory, ThreadControl, ThreadSpawner, and EntropySource grants are fresh child-local caps and cannot be badged. QEMU-only PersistentStore grants mount a Store cap through the child-local kernel-source path for focused persistence proofs. Endpoint kernel grants are exportable only through returned parent client facets, not through a shared owner cap in init.ProcessHandlecaps are non-transferable.ThreadHandlecaps are process-local, non-transferable, and observe only one thread in the same process.- At most one waiter may be registered on a
ProcessHandle. - Process exit releases cap-table authority before the kernel stack frame is freed.
Code Map
kernel/src/process.rs-Process, bootstrap CPU context, ring/CapSet mapping, exit capability cleanup.kernel/src/spawn.rs- ELF mapping, stack mapping, TLS mapping, process construction helpers.kernel/src/sched.rs- process table, process handles, wait completion, exit path.docs/architecture/threading.md- frozen 7.1.0 contract for process-owned versus thread-owned state, creation, FS-base, and join/exit behavior.kernel/src/cap/process_spawner.rs-ProcessSpawnerCap,ProcessHandleCap, spawn grant validation, child-local kernel grants, child CapSet construction.capos-lib/src/cap_table.rs-CapIdgeneration and cap-table operations.capos-config/src/capset.rs- fixed CapSet page ABI.schema/capos.capnp-ProcessSpawner,ProcessHandle, andCapGrant.init/src/main.rs- BootPackage manifest validation, generic spawn loop, child waits, and hostile spawn checks.
Validation
make run-smokevalidates init-owned default service startup,ProcessSpawner,ProcessHandle.wait, child grants, exit cleanup, and clean halt.make run-spawnvalidates the narrower ProcessSpawner graph for endpoint, IPC, VirtualMemory, FrameAllocator cleanup, and hostile spawn failures.cargo test-libcoversCapTablegeneration, stale-slot, and transfer primitives.cargo test-configcovers CapSet and manifest metadata used to build process grants.cargo build --features qemuverifies the kernel and QEMU-only paths compile.
Open Work
- Add lifecycle operations such as kill and post-spawn grants only after their authority semantics are explicit.
- Implement restart policy outside the kernel-side static boot graph.
Session Context
Session-bound invocation context is the current shared-service identity model. Capabilities decide what a process may invoke. The process session supplies the privacy-preserving subject context for the invocation. Request payloads, manifest strings, and legacy endpoint receiver metadata do not identify the caller and must not authorize service behavior by themselves.
Current Behavior
Every normal workload process has one immutable SessionContext installed
through trusted spawn, session-manager, or broker paths. Endpoint CALL delivery
includes a scoped caller-session reference plus freshness metadata by default.
The server does not receive a global principal, account, profile, display name,
auth source, or tenant field unless the call explicitly requests disclosure and
the invoked service/facet has a matching disclosure scope.
The current endpoint ABI carries:
scoped_refandscoped_ref_hi: a 128-bit opaque caller-session reference derived from a boot secret, endpoint service scope, and kernel session id;epoch: a domain-separated freshness/audit value for the same service scope;- liveness/freshness state used to fail closed for stale ordinary sessions.
The reference is service-scoped and non-portable. A value observed by one service is not authority and is not a stable global identity in another service.
Authority Split
Capability possession answers whether a process may invoke a target. Session context answers who the invocation is attributable to, whether the session is fresh enough, which resource/accounting bucket should be used, and which subject facts may be disclosed.
The service decision is therefore layered:
- capability authority;
- invocation subject context;
- service-local policy and state.
For example, holding ChatRoot lets a process ask chat to join. The caller’s
live session supplies the subject context. The chat service may key its
per-session state by the opaque reference and may request bounded disclosure
when its method contract and broker policy allow it.
Disclosure
Disclosure is opt-in and field-granular. A service receives broader subject facts only when both conditions hold:
- the method or call shape explicitly requests the fields;
- the invoked capability or broker-granted facet carries a service-scoped disclosure scope allowing those fields.
Without both, endpoint metadata stays opaque. Services that need display names, profile classes, or audit labels should request only those fields and treat them as service-local policy input, not as independent authority.
Transfer And Liveness
Cross-session capability transfer is allowed only when the transferred cap’s
transfer scope permits it. A transferred cap carries invoke authority; the
receiver’s session remains the invocation subject. service_regrant_only caps
cannot cross sessions through raw copy, move, endpoint IPC, or spawn grants;
a trusted service or broker regrant path must mint target-session authority.
Ordinary endpoint calls from logged-out or expired sessions fail closed. The
current liveness implementation is a Live/LoggedOut state cell plus expiry;
administrator revocation, recovery-only session modes, and renewal/recovery
caps remain future lifecycle work. Fixed wall-clock expiry remains a bounded
guardrail, not complete production interactive-session lifecycle UX.
Code Map
kernel/src/session_context.rs- kernel session records, liveness state, and scoped reference derivation.kernel/src/cap/endpoint.rs- endpoint caller-session delivery and stale endpoint/session checks.kernel/src/cap/transfer.rsandcapos-lib/src/cap_table.rs- transfer-scope validation and rollback.kernel/src/cap/session_manager.rs- session creation andUserSessionresult-cap minting.kernel/src/cap/user_session.rs-UserSessioncapability behavior.kernel/src/cap/restricted_launcher.rsandkernel/src/cap/authority_broker.rs- broker and launch surfaces that mint session-scoped bundles.capos-rt/src/client.rs- runtime clients that observe session, endpoint, and logout behavior.docs/architecture/ipc-endpoints.md- endpoint transport and transfer rules.docs/architecture/process-model.md- spawn and process ownership model.
Validation
make run-session-contextcovers process-session invariants, default endpoint caller-session metadata, stale normal endpoint rejection, transfer scopes, and disclosure gating.make run-capnp-chat-interopand the chat/adventure smokes cover ordinary service state keyed by live caller-session metadata instead of caller-chosen selectors.make run-remote-session-capset-interopand focused remote-session UI smokes cover DTO gateway logout/close propagation.make run-ssh-public-key-sessioncoversUserSession.auditContext, explicit logout idempotence, and post-logout fail-closed reads.
Open Work
- Administrator revocation, renewal/recovery, live proxy cleanup, and audit reason separation remain future lifecycle work.
- Stable service-audit identity across endpoint replacement or service upgrade needs a future service-audit scope.
- Delegated act-on-behalf-of subject context is a separate future design, not part of the completed session-bound invocation context milestone.
- A dedicated result-cap move-source rollback proof is still needed before fixed expiry is treated as the whole production session lifecycle.
Design Grounding
The archival decision record is Session-Bound Invocation Context. The superseded direction is Superseded: Service Object Capabilities. Capability-system precedent is summarized in Capability-Based and Microkernel Operating Systems Survey.
In-Process Threading Contract
This page records the implemented contract for kernel-managed threads inside
one process. The park authority contract is frozen separately in
Park Authority. These pages are the handoff from the initial
single-thread runtime checkpoint to same-process SMP work. The current slice
has per-thread completion rings for spawned child threads, per-CPU WFQ run
queues with bounded stealing, a caller-thread-bound SchedulingPolicyCap,
and a SchedulingContext cap that records identity, bind/revoke,
dispatcher budget charging/replenishment, bounded endpoint donation/return,
and fixed depletion/deadline notification cells. Same-process sibling
scheduling has formal accepted 1-to-2 evidence on capos-bench 2026-05-02
21:38 UTC against main commit 374f8556 (capOS work 1.883x / total
1.787x, both clearing the configured 1.6x gates; matching Linux pthread
baseline 1.988x/1.987x on the same physical-core pin set). The
2026-05-02 1-to-4 row was the diagnostic that justified Phase D’s fair-share
enqueue policy: capOS sat at 1.566x/1.538x while Linux scaled to
3.963x/3.858x. Phase D now runs per-CPU WFQ queues with bounded stealing
and manually accepted the 2026-05-10 1-to-4 diagnostic row
(3.088x/2.700x) while the harness-enforced gate remains 1-to-2
work/total speedup; see docs/benchmarks.md for the full evidence table
including historical pre-collapse rows. Phase F has landed the
one-SQ-consumer prerequisite, nohz telemetry, housekeeping/deferred-work
placement, the clockevent/deadline substrate, and bounded SQPOLL ring mode
including the non-periodic SQPOLL producer-wake progress path; the first
automatic nohz activation increment is closed via
docs/tasks/done/2026/scheduler-phase-f-auto-nohz-activation.md and
SQPOLL-driven auto-nohz activation is also closed via
docs/tasks/done/2026/scheduler-phase-f-auto-nohz-sqpoll.md; generic
full-nohz for ordinary budgeted compute leases and timeout-based auto-revoke are
landed; policy-service AutoNoHz issuance remains future work.
Scope
The threading milestone changes the scheduler’s unit of execution from process
to thread while keeping the process as the authority, address-space, and
resource-accounting boundary. Same-process sibling scheduling on multiple CPUs
is functional for per-thread-ring processes. The accepted 1-to-2 performance
claim is now the formal capos-bench 5-run pair recorded on 2026-05-02
21:38 UTC against main commit 374f8556: capOS work 1.883x and total
1.787x clear the configured 1.6x gates; the matching Linux pthread
baseline on the same physical-core pin set (0,1,2,3) records
1.988x/1.987x, validating the workload shape. The 2026-05-02 1-to-4 row
was the diagnostic that justified Phase D: capOS sat at 1.566x/1.538x
while Linux scaled to 3.963x/3.858x. Phase D now runs per-CPU WFQ queues
with bounded stealing and its 2026-05-10 1-to-4 row (3.088x/2.700x) was
manually accepted from recorded diagnostics; the harness-enforced gate remains
1-to-2 work/total speedup. Historical pre-collapse rows and the post-collapse
3-run diagnostic remain in docs/benchmarks.md for reference. Phase E adds
the SchedulingContext cap (identity, caller-thread bind, revoke, budget
charging/replenishment, bounded synchronous endpoint donation/return, and
fixed depletion/deadline notification cells with drain observer results),
and Phase F has landed the bounded SQPOLL ring mode plus the
clockevent/deadline substrate. Automatic nohz activation, realtime
admission, and privileged userspace scheduler-policy services remain later
work.
This contract covers:
- process-owned versus thread-owned state;
- the initial thread creation ABI;
- per-thread FS-base/TLS rules;
- thread exit and join semantics;
- the per-thread ring blocking and completion-routing contract;
- the caller-thread-bound
SchedulingPolicyCapandSchedulingContextsurfaces that mutate per-thread WFQ weight/latency-class and per-thread scheduling-context binding; - the handoff to the 7.1.1 park authority design.
Ownership Split
The process remains the security boundary. All threads in one process share the same address space and capability table, so a thread has the same authority as its sibling threads.
| Process-owned state | Thread-owned state |
|---|---|
| Process id and process generation | Thread id and thread generation |
| User address space and CR3 | Saved CPU context and user register state |
| Capability table and resource ledger | Kernel stack and syscall stack top |
| Initial compatibility ring and ring arena ownership | Per-thread ring endpoint, scratch, and FS base |
| Read-only CapSet page | Scheduling/blocking state |
| ProcessHandle exit state | ThreadHandle join/exit state |
| Endpoint owner state and process-wide cleanup hooks | WFQ weight, latency class, virtual runtime, and virtual_finish_ns enqueue tag |
| Process-wide resource ledgers (thread records, kernel stacks, cap-table slots) | SchedulingContext binding (identity/generation, remaining budget, replenish/deadline timestamps, donation/return slot, notification recorder) |
The implementation migrated incrementally. The 7.2.0 slice made each process
contain a single initial Thread, with saved context, kernel stack, FS base,
and blocking state stored on that thread. Later slices changed scheduler-owned
queues, current execution, direct IPC handoff, and wake records to
generation-checked ThreadRef values, added creation and lifecycle caps, and
then assigned per-thread rings to spawned children.
Scheduler Contract
Scheduler stores runnable execution contexts as thread
references, not process ids. A thread reference is (pid, process_generation, tid, thread_generation). The process generation keeps handles from naming a
reused process; the thread generation keeps handles from naming a reused
thread slot inside a live process.
This identity applies to Scheduler.current, run queues, direct IPC targets,
Timer sleep waiters, process/terminal waiters, endpoint caller/receiver wake
records, and deferred cancellation state.
Runnable ownership is split across per-CPU run queues
(SCHEDULER_CPUS = 4). Each queue is ordered ascending by
virtual_finish_ns, which is recomputed per enqueue from
virtual_runtime_ns, the thread’s WFQ weight (clamped to
[MIN_WEIGHT, MAX_WEIGHT] in capos-abi::scheduler), and a per-class
slice scaled by LatencyClass (Interactive divides the slice,
Batch multiplies it, Normal/IpcServer pass it through). Default
placement targets the current CPU; a bounded steal path balances when a
CPU’s local queue is empty, recomputes the WFQ tag at the destination,
and records placement-spread / steal migrations under the measure
feature. Each per-CPU queue is reserved at thread-create time to the live
runnable-capable thread count so timer-tick, unblock, direct-IPC fallback,
and steal-requeue paths never allocate.
The run queue, current, direct IPC target, and blocked waiter scans are
thread-oriented. Address-space switches happen only when the next runnable
thread belongs to a different process. TSS.RSP0, the syscall kernel stack, and
FS base are updated on every thread switch because those are thread-local
machine resources. Per-thread runtime_ns advances 1:1 with elapsed CPU
time; virtual_runtime_ns advances by
elapsed_ns * REFERENCE_WEIGHT / weight so weight changes the cumulative
WFQ share rather than just an enqueue tie-breaker.
SchedulingContext bindings layer dispatcher budget on top of WFQ. A
thread may carry at most one SchedulingContextThreadBinding. While
bound, the dispatcher charges elapsed time against the binding’s
remaining_budget_ns, replenishes from period_ns at the next replenish
boundary, records deadline_or_timeout and budget_depleted
notifications in the per-context fixed cells, and routes synchronous
endpoint donation/return for passive receiver threads (donated_holder
in the notification snapshot tracks whether the holder is the donor or
the receiver). Stale-generation or revoked caps fail closed before
mutating scheduler state. Realtime-island admission, CPU placement
enforcement, and overrun-fault policy remain deferred.
The idle path is a per-CPU CPL0 (kernel-mode) idle thread; the former
special user-mode idle process has been removed. Each CPU’s idle thread is a
kernel-owned execution context — it runs on the kernel PML4 with a dedicated
idle kernel stack and cannot block, exit, or hold ordinary caps. A lightweight
synthetic idle Process record is retained per CPU only so the idle
ThreadRef resolves through scheduler bookkeeping; it maps no user code,
stack, or cap ring. See the “Idle paths” section of
docs/architecture/scheduling.md.
Phase F has landed the one-SQ-consumer prerequisite, nohz telemetry,
housekeeping/deferred-work placement, the clockevent/deadline substrate,
and a bounded SQPOLL ring-mode worker (MAX_SQPOLL_WORKERS = 16,
request_sqpoll_start_for_thread / finalize_pending_sqpoll_start_for_thread
with stale-owner rollback). Tick suppression now exists behind explicit
CpuIsolationLease admission, including ordinary budgeted compute leases that
target a live SchedulingContext; policy-service AutoNoHz issuance and generic
SQPOLL nohz for arbitrary rings remain future work.
Thread Creation ABI
Thread creation is exposed through a process-local ThreadSpawner capability.
It creates threads only in the caller’s current process. It does not grant
authority to another process and is non-transferable across IPC in the initial
implementation.
The initial control-plane shape is:
interface ThreadSpawner {
create @0 (
entry :UInt64,
stackTop :UInt64,
arg :UInt64,
fsBase :UInt64,
flags :UInt64
) -> (handleIndex :UInt16);
}
interface ThreadHandle {
join @0 () -> (exitCode :Int64);
exitCode @1 () -> (exited :Bool, exitCode :Int64);
}
interface ThreadControl {
getFsBase @0 () -> (fsBase :UInt64);
setFsBase @1 (fsBase :UInt64) -> ();
exitThread @2 (code :Int64) -> ();
}
Any 7.2 schema adjustment must update this page in the same branch before
implementation review. The stable semantics are that creation is in-process,
the returned handle is an observed result cap, ThreadHandle observes one
thread rather than the whole process, and current-thread exit is available
through both ThreadControl.exitThread and the raw exit(code) syscall.
The new thread starts in Ring 3 at entry with:
RDI = arg;RSI = tid;RDX = pid;RCX = the current thread's ring address;R8 = CAPSET_VADDR, or zero if the process has no CapSet.
The runtime supplies the user stack and TLS block. The kernel validates that
entry, stackTop, and fsBase are user-canonical, that stackTop is
16-byte aligned at entry, and that reserved flags bits are zero. Page
presence and stack-growth policy remain process address-space questions;
before a page-fault subsystem exists, an invalid thread stack can fault the
process.
Resource Accounting
Thread creation allocates kernel memory and is quota-backed by process-owned
ledger state, not per-capability helper counters. The 7.2.0 checkpoint charges
the initial thread during process creation; ThreadSpawner.create extends the
same ledgers to additional threads. The ledger of record is:
PROCESS_THREAD_LIMIT, the maximum live or retained thread records in one process, initially 16;PROCESS_THREAD_KERNEL_STACK_PAGES, initially matching the current per-thread kernel stack allocation size of 32 pages;thread_records_used/thread_records_max;thread_kernel_stack_pages_used/thread_kernel_stack_pages_max.
The initial process thread charges one thread record and one kernel-stack
allocation during process creation. ThreadSpawner.create reserves a thread
record and kernel-stack page budget before allocating the stack or publishing a
ThreadHandle; every later failure rolls both reservations back before
returning. Cap-slot reservation for the result handle remains charged to the
existing process cap-table ledger.
Creation failures are controlled application exceptions. Thread count,
kernel-stack budget, handle cap-slot exhaustion, and kernel stack allocation
failure return Overloaded with a specific message and no partially runnable
thread. Invalid entry, stack, FS base, or flags return Failed.
Thread exit releases the kernel stack only after the scheduler is running on a
different kernel stack. The thread record remains charged while a live
ThreadHandle, pending join waiter, or unjoined exit status can still observe
it. Once the handle is released without a pending join, or once a one-shot join
has consumed the status and no wait record pins it, the retained record charge
is released. Process exit releases all thread records and stack charges once.
The off-stack property is enforced by an OffStackToken witness on every stack
frame release path: the deferred per-thread drain calls
Process::release_thread_kernel_stack, whole-process teardown calls
Process::release_all_thread_kernel_stacks, and pre-publication rollback calls
Process::rollback_created_thread. The token constructor is private to the
scheduler module. Implicit Thread::Drop is deliberately not a release path;
if a Thread value reaches its destructor with a nonzero stack, it fails
closed by leaving the frames allocated instead of freeing a stack without an
off-stack witness.
FS Base And TLS
FS base is thread-owned. The existing ThreadControl.getFsBase and
ThreadControl.setFsBase operations keep their names, but after threading they
refer to the current thread, not the whole process. setFsBase continues to
reject non-user-canonical values and writes the CPU FS-base MSR immediately
when called by the running thread. Both methods route through
context-aware dispatch (CapCallContext::caller_thread) so the
operation always targets the caller, never a different thread; calling
ThreadControl from a non-live caller returns
ProcessFsBaseError::CallerNotLive.
The initial process thread uses the PT_TLS block installed by ELF loading.
Additional threads receive an FS base from ThreadSpawner.create; the runtime
is responsible for allocating and initializing each thread’s TLS/TCB data.
There is no process-global FS base. Current-thread FS-base operations are useful
for the single-thread runtime checkpoint, but they must not be treated as the
final threading ABI for language runtimes. True multi-threaded Go or
C/POSIX-like runtime support requires each ThreadRef to own a distinct TLS
block and FS base.
Context switching must save the outgoing thread’s FS base and restore the next thread’s FS base even when both threads belong to the same process and no CR3 switch is needed.
Thread Identity In Waiters And Dispatch
The concrete identity type for in-process scheduling is:
#![allow(unused)]
fn main() {
ThreadRef {
pid,
process_generation,
tid,
thread_generation,
}
}
Process identity still governs authority and accounting, but wakeup and
blocking state must name a thread. 7.2 changes context-aware capability
dispatch so CapCallContext carries both the caller process id for authority
checks and the caller ThreadRef for wake/cancel decisions. Existing pid-only
records that can resume execution or write a caller CQE must be widened before
multiple threads can run in one process.
The migration target is:
TimerSleepWaiterstores the sleepingThreadRefand validates the generation before waking it;- endpoint CALL, RECV, RETURN target, deferred-cancel, current-caller, and
direct IPC handoff records store the blocked or target
ThreadRef; - terminal line input and any other
ProcessWaiterconsumer store the waitingThreadRefand validate the generation before writing a CQE; ProcessHandle.waitrecords the waitingThreadRefwhile the handle still names the child process;ThreadHandle.joinrecords the waitingThreadRefand the targetThreadRef;cap_enterblocks the currentThreadRefon that thread’s ring endpoint;- process-exit cleanup cancels every waiter whose
pidandprocess_generationmatch the exiting process, regardless of thread id.
A generation mismatch on wake or completion is a stale waiter and must be drained without writing to userspace. This mirrors current process-generation behavior and prevents one thread slot reuse from receiving another thread’s Timer, endpoint, join, or ring completion.
Exit And Join
The current exit(code) syscall terminates the current thread. This preserves
single-thread process exit because the process exits when its last non-idle
thread exits, and it avoids tearing down a shared address space while sibling
threads are still current on other CPUs.
Thread exit does not add a new syscall. The initial implementation added
ThreadControl.exitThread(code) as a terminal capability-ring operation on
the current thread, with the same current-thread termination semantics as the
raw syscall. A successful invocation does not post a CQE back to the exiting
thread, because cap_enter will not return to that execution context. It
records the exit code, wakes or completes any valid join waiter, and removes
only the current thread from scheduling. If the last non-idle thread in a
process exits through exit(code) or exitThread, the process exits with that
thread’s code and completes the parent-facing ProcessHandle.
Whole-process termination remains a ProcessHandle operation. It releases the
shared capability table, cancels process-owned endpoint state, removes
timer/park/ring waiters for every thread in the process, and completes the
parent-facing ProcessHandle after the process is no longer current on any
CPU.
ThreadHandle.join is process-local and one-shot. If the target thread already
exited and its status is retained, join returns its code immediately and marks
the status joined. If it is still live, join blocks the caller’s thread until
the target exits. Self-join returns Failed. A second waiter, join after a
successful join, or join after detach returns Failed; it must not park an
ambiguous waiter. ThreadHandle.exitCode is nonblocking and may observe the
retained status while the handle is live, but it does not consume the one-shot
join right.
Releasing the last ThreadHandle before the target exits detaches the target:
the thread continues to run, but no exit status is retained after it exits
unless a join waiter already pins the state. Releasing the handle after exit
but before join drops the retained status and releases the thread-record
charge. A pending join waiter pins the handle state until completion or process
exit, so cap release cannot create a use-after-free. The exiting thread’s
kernel stack must not be freed while it is still executing on that stack; final
process teardown performs an explicit token-gated stack release after another
kernel stack is active, before the deferred Process value is dropped.
Fatal user faults remain process-fatal in the first implementation. Per-thread fault isolation can be designed later, after the basic scheduler and futex paths are stable.
Capability Ring And Blocking
The first Ring v2 implementation keeps the initial thread’s compatibility
ring at RING_VADDR and gives each spawned child thread a kernel-chosen ring
mapping inside the reserved process ring arena. Runtime-selected ring address
ranges remain a later VirtualMemory reservation extension.
ThreadSpawner.create allocates a ring record and user mapping for the new
thread, stores that mapping on the child ThreadRef, and passes the ring
address in the child start registers. cap_enter blocks the current thread
against that thread’s own CQ, so same-process sibling threads may block in
cap_enter independently. Timer, endpoint, join, park, and cancellation paths
must route completions by generation-checked ThreadRef to the target
thread’s ring endpoint.
The runtime’s single-owner ring-client invariant remains local to each ring
client. Well-formed userspace serializes submission and completion matching per
thread ring through capos-rt; it must not have two consumers racing on the
same SQ/CQ. The scheduler still refuses to run the exact same ThreadRef on
two CPUs at once, but it no longer treats every multithreaded pid as tied to
one scheduler CPU.
This is sufficient for functional same-process sibling scheduling. The formal
accepted 1-to-2 make run-thread-scale capOS evidence is the capos-bench
2026-05-02 21:38 UTC pair (work 1.883x, total 1.787x, both clearing the
configured 1.6x gates). The guest result row’s accepted field remains
diagnostic; the host summary enforces the work-window and total-time gates, and
refuses speedup enforcement unless CAPOS_THREAD_SCALE_QEMU_TASKSET_CPUS
records the QEMU CPU pin set. Linux validates the repaired benchmark shape
through four workers on physical cores (3.963x/3.858x). That capOS
4-worker row was diagnostic (1.566x/1.538x) and justified Phase D’s
per-CPU WFQ queues plus bounded stealing. The 2026-05-10 Phase D rerun
recorded 1-to-4 work/total diagnostics 3.088x/2.700x, manually accepted
for closeout; remaining risks are the shared scheduler lock, temporary CPU
pinning, CQ/join/exit/block/schedule overhead, broader workload classes, and
higher-thread-count evidence.
Scheduling Policy And Context Authority
SchedulingPolicyCap is the caller-thread-bound surface for WFQ knobs.
Every method routes through CapCallContext::caller_thread; there is no
per-cap-object ThreadHandle, no badge-encoded thread id, and no
cross-thread mutation in this slice. Cross-thread authority is deferred to
the privileged scheduler-policy service plan. The schema shape is:
interface SchedulingPolicyCap {
setWeight @0 (weight :UInt16) -> ();
setLatencyClass @1 (class :LatencyClass) -> ();
snapshot @2 () -> (
weight :UInt16,
class :LatencyClass,
runtimeNs :UInt64,
virtualRuntimeNs :UInt64,
);
}
setWeight validates against [MIN_WEIGHT, MAX_WEIGHT] at the cap
boundary and updates the caller thread’s WFQ weight; the new weight
applies to the next enqueue’s virtual_finish_ns tag and to subsequent
virtual_runtime_ns accounting. setLatencyClass swaps the per-thread
LatencyClass (Normal, Interactive, IpcServer, Batch) used to
scale the dispatcher slice. snapshot is a read-only observer over the
core WFQ state and does not expose the measure-only counters.
SchedulingContext is the schema-typed cap for dispatcher budget
authority:
interface SchedulingContext {
info @0 () -> (info :SchedulingContextInfo);
create @1 (spec :SchedulingContextSpec) -> (
contextIndex :UInt16,
identity :SchedulingContextIdentity,
result :SchedulingContextOperationResult,
dispatchEffect :SchedulingContextDispatchEffect,
);
bindCallerThread @2 () -> (
identity :SchedulingContextIdentity,
binding :SchedulingContextBinding,
result :SchedulingContextOperationResult,
dispatchEffect :SchedulingContextDispatchEffect,
);
revoke @3 () -> (
identity :SchedulingContextIdentity,
previousGeneration :UInt64,
result :SchedulingContextOperationResult,
dispatchEffect :SchedulingContextDispatchEffect,
);
drainNotifications @4 () -> (
notifications :SchedulingContextNotificationSnapshot,
);
}
create returns a same-interface child context as transferred result
cap 0 and becomes chargeable only after bindCallerThread. revoke
bumps the generation and clears any matching thread binding; later calls
through the stale cap generation report staleGeneration or fail closed
before mutating scheduler state. drainNotifications reads the fixed
per-context budget-depleted and deadline-or-timeout slots; the
scheduler updates these in place from hard paths without allocation,
including the holder identity and a donatedHolder bit for endpoint
donation/return. The bootstrap manifest grants SchedulingPolicyCap and
SchedulingContext only to focused-proof manifests; the default boot
manifest does not grant them.
Userspace API Surface
The capos-rt runtime exposes the threading caps as typed clients on top
of the per-thread ring:
ThreadControlClient–get_fs_base/set_fs_base/exit_thread, including*_waitblocking variants overRuntimeRingClient.ThreadSpawnerClient::create– submits theentry/stackTop/arg/fsBase/flagsABI and returns anOwnedCapability<ThreadHandle>delivered as transferred result cap 0 in the CQE.ThreadHandleClient–join,exit_code(nonblocking observer), and theirfinish_*helpers;finish_joindecodes the one-shot exit code.SchedulingPolicyClient–set_weight,set_latency_class, andsnapshot, all caller-thread-bound.SchedulingContextClient–info,create,bind_caller_thread,revoke, anddrain_notifications.
A typical spawn/join pseudocode against these clients is:
#![allow(unused)]
fn main() {
let handle = thread_spawner.create_wait(
&mut ring,
entry_addr,
user_stack_top,
arg,
fs_base,
/* flags */ 0,
timeout_ns,
)?;
// ... runtime work on the parent thread ...
let exit_code = thread_handle
.join_wait(&mut ring, timeout_ns)?;
}
The userspace runtime is responsible for the user stack, TLS/TCB, and any free-list bookkeeping for retired handles; the kernel only validates the ABI fields and charges the per-process ledgers.
Park Handoff
Park authority is defined in Park Authority. The scheduler changes above must leave room for a thread block reason that is not tied to the process ring CQ. The frozen handoff is:
- park wait blocks the current thread, not the whole process;
- park wake makes selected generation-checked
ThreadRefvalues runnable; - timeouts use the same monotonic time base as
Timer; - private park keys are based on address-space identity plus user virtual address;
- shared-memory park keys are MemoryObject-derived identity plus offset;
- the first implementation starts with compact
CAP_OP_PARKandCAP_OP_UNPARKoperations rather than generic Cap’n Proto methods; - park wait SQEs are thread-owned so ring dispatch cannot park a sibling
thread under the waiter’s
user_data; - blocking park wait is a syscall-context operation that releases runtime
ring-client ownership before the thread parks, while
capos-rtdemultiplexes reserved park CQEs back to the waiting thread.
Pre-thread 4.5.4 measurement chose the compact capability-authorized shape for
failed wait and empty wake. 4.5.5 measured the real blocked/resume path through
thread-lifecycle under make run-measure, so the compact ParkSpace opcodes
remain the runtime ABI target for this slice.
Security Invariants
- A thread never owns a separate capability table in the initial model.
- A thread cannot escape the authority of its containing process.
- A
ThreadHandlenames only a thread in the same process and is non-transferable in the first implementation. - Thread creation is charged to one process-owned thread/kernel-stack ledger of record before the thread can become runnable.
- Process exit releases shared authority once, after all live threads are removed from scheduling.
- Per-process resource quotas are shared by all threads.
ThreadControlchanges only the current thread’s FS base.ThreadControl.exitThreadterminates only the current thread and is a capability-ring operation, not a syscall.- Every waiter or direct handoff that can resume execution stores a generation
checked
ThreadRef. - Process-owned user-buffer validation/copy/read paths hold the process
AddressSpacelock; future shared-memory thread primitives still need mapping provenance or object pins when they derive keys from shared backing.
Implementation Order
- Add internal
Threadstate, make each process own one initial thread, move saved context / kernel stack / FS base / block state onto that thread, and charge the initial thread against private process ledgers. Done 2026-04-24 23:09 UTC. - Change scheduler queues, blocking, exit cleanup, and direct IPC targets from pid-oriented state to thread references while preserving one thread per process. Done 2026-04-24 23:33 UTC.
- Add
ThreadSpawner,ThreadHandle, andThreadControl.exitThreadwith a QEMU smoke for create, join, detach, self-join rejection, second join rejection, and last-thread process exit. Done 2026-04-25. - Implement the ParkSpace private wait/wake path from Park Authority after the scheduler can block and wake individual threads, then run 4.5.5 blocked/resume measurements before declaring the park ABI stable. Done 2026-04-25.
Validation
The thread-lifecycle proof creates multiple threads in one process, proves
they share the address space and CapSet, proves each has an independent FS
base, rejects invalid join cases, joins one thread from another, and lets the
last thread exit the process. The existing make run-spawn path keeps covering
runtime-fs-base and single-thread-runtime so regressions in the pre-thread
runtime contract stay visible. make run-measure additionally records the
private ParkSpace blocked/resume timings and proves process exit with a parked
park waiter. Phase D fairness/Interactive/weight-change smokes
(make run-thread-fairness, make run-thread-fairness-interactive,
make run-thread-fairness-weight-change) exercise the SchedulingPolicyCap
caller-thread-bound surface; the thread-scale proof carries the recorded
WFQ scaling evidence. The recorded 1-to-2 work/total speedup gate is the
host-enforced Phase D acceptance criterion; the 1-to-4 row remains a
manually accepted diagnostic. Safe runtime park wrappers and a focused
SchedulingContext budget/donation/notification smoke remain future
capos-rt and harness work.
Park Authority Contract
This page freezes the 7.1.1 design contract for thread-park (park/unpark)
authority. It is the handoff from the in-process threading contract to the 7.2
implementation work and records the first 7.2.3 implementation status.
Linux prior art. Park solves the same problem as Linux futex(2):
userspace owns the uncontended fast path through atomic operations on a 32-bit
word, and the kernel parks/wakes threads only on contention. capOS uses the
distinct name Park because the contract differs in important ways from
Linux’s: it is capability-gated (no ambient authority), there is no priority
inheritance, no requeue, no robust lists, and the shared variant is keyed by
MemoryObject identity rather than (inode, pgoff). References to “Linux
futex” in this page point to that prior art, not to the capOS API surface.
Scope
The first park milestone stays single-CPU and in-process. It gives a multi-threaded runtime one kernel primitive: park the current thread when a userspace word still has an expected value, and wake parked threads associated with that word. Userspace owns the uncontended path through ordinary atomic operations; the kernel owns only the contended sleep/wake path and timeout integration.
This contract covers:
- production park authority objects;
- private and shared park key identity;
- the provisional compact wait/wake transport ABI;
- scheduler, timeout, and process-exit interactions;
- resource-accounting and security invariants;
- the 4.5.5 measurement loop after real thread blocking exists.
This is not a Linux futex(2) compatibility surface. Priority inheritance,
requeue, robust lists, shared-memory park-words before MemoryObject mapping
identity is exposed, and SMP-safe user-buffer pinning remain later work.
Implementation Status
The 2026-04-25 7.2.3 slice implements:
- schema marker interfaces for
ParkSpaceandSharedParkSpace; - compact
CAP_OP_PARKandCAP_OP_UNPARKopcodes; - process-local, non-transferable ParkSpace grants through boot/spawn manifests;
- private wait/wake keyed by the caller process address space and user virtual address;
- per-thread
Parkblock state with finite timeout integration; - one reserved CQE credit per parked waiter so wake/timeout delivery cannot be crowded out by ordinary completions;
- QEMU correctness coverage in
thread-lifecyclefor mismatch, immediate timeout, wake-one, wake-many, anonymous VirtualMemory multi-waiter unmap range cleanup with stale wake-after-reuse checks, anonymous VirtualMemory.decommit reuse stale waiter cleanup, and MemoryObject.unmap borrowed-mapping reuse stale waiter cleanup; - 4.5.5 QEMU timing coverage in
run-measure.
SharedParkSpace is a marker only. capos-rt has the marker type but no safe park
client wrapper yet; the current correctness and measurement demos use raw
compact SQEs so the ABI can settle before runtime synchronization wrappers
claim the user_data namespace.
Design Grounding
The reviewed project documents for this contract are:
docs/tasks/README.md;docs/roadmap.md;REVIEW.md;docs/architecture/threading.md;docs/architecture/scheduling.md;docs/architecture/userspace-runtime.md;docs/proposals/go-runtime-proposal.md.
The relevant research grounding is:
docs/research/out-of-kernel-scheduling.mdfor the kernel-assisted wait/wake split used by language runtimes;docs/research/llvm-target.mdfor the Go/runtime syscall surface that needs thread creation, per-thread TLS, and futexes;docs/research/genode.mdfor typed capability precedent and resource-accounted session state.
Authority Objects
ParkBench remains measurement-only. It is not a production authority and
must not be granted by normal boot manifests.
The first production model has two authority objects:
interface ParkSpace {}
interface SharedParkSpace {}
These schema interfaces are marker interfaces for typed CapSet/result-cap identity. The wait and wake operations use compact ring opcodes rather than Cap’n Proto methods, because the pre-thread 4.5.4 measurement showed the generic Cap’n Proto path is not the right default for the park hot path.
ParkSpace is minted for a process by the same bootstrap/spawn path that
grants ThreadControl and ThreadSpawner. It is process-local and
non-transferable in the initial implementation. Holding it authorizes private park wait/wake only in the
caller’s own address space; it does not grant memory access, cross-process wake
authority, or the right to name arbitrary kernel wait queues.
SharedParkSpace is the shared-park object for a later MemoryObject-derived slice. A
MemoryObject holder can derive a SharedParkSpace scoped to that MemoryObject’s backing
identity. Shared park operations through that SharedParkSpace are keyed by object
offset, not by one process’s virtual address. The first 7.2 implementation may
leave SharedParkSpace unimplemented, but it must not choose a private-key ABI that
prevents this shared-key model.
Park Keys
Private park keys are address-space scoped:
#![allow(unused)]
fn main() {
ParkKey::Private {
address_space_id,
address_space_generation,
uaddr,
}
}
The first implementation can derive address_space_id and generation from the
process id/generation while each process owns exactly one address space. The
contract names address-space identity deliberately so a later fork/shared-AS
model does not inherit a pid-shaped key.
Private parks are synchronization inside one address space. wake for a
private key may wake only waiters in the same address space generation; a raw
virtual address alone is never cross-process synchronization authority.
Shared park keys are MemoryObject scoped:
#![allow(unused)]
fn main() {
ParkKey::Shared {
memory_object_id,
memory_object_generation,
offset,
}
}
Shared keys are disabled until the kernel can prove, while handling a park operation, that the submitted user address maps the MemoryObject backing the SharedParkSpace and can compute the byte offset in that backing object. Virtual aliases of the same shared page must converge on the same shared key. Private aliases within one address space do not converge unless they use the same user virtual address.
Shared parks require explicit shared-memory authority through the
MemoryObject-derived SharedParkSpace. Never use raw virtual address alone for
cross-process park/futex keys.
All park words are 32-bit and must be 4-byte aligned. wait validates the
word as a readable user mapping before reading it. wake validates that the
address is user-canonical and aligned; shared wake additionally validates
the MemoryObject mapping identity so a caller cannot wake an unrelated object
by guessing an offset.
Private-key cleanup is part of the ParkSpace contract, not an implementation
detail of the Go runtime. Unmap, revoke, address-space generation change, and
address-space teardown must drain or fail waiters for the old private key
before the same virtual address can be reused as unrelated state. A stale
private waiter may complete only against the address-space generation it was
registered under; it must not observe or wake a later mapping with the same
numeric uaddr.
Current implementation status: process/thread-exit cleanup exists. Anonymous
VirtualMemory.unmap, VirtualMemory.decommit, and MemoryObject.unmap for
borrowed mappings drain private waiters whose uaddr lies in the affected
range by posting PARK_INTERRUPTED through the waiter’s reserved completion
credit before making the blocked thread runnable. Cleanup removes the waiter
from the address-keyed private wait table before attempting the completion. If
that completion cannot be posted immediately, the thread remains blocked in a
pending park-completion state with the exact completion status and reserved
completion credit still charged, and scheduler wake processing retries the
stored status; the waiter is not restored to the uaddr table while the
virtual address can be reused.
Shared park-word cleanup and explicit address-space generation teardown remain
open. Until those land, the implemented private path is suitable for
process-lifetime park words, anonymous VirtualMemory regions that use these
unmap/decommit paths, and borrowed MemoryObject mappings that are explicitly
unmapped with MemoryObject.unmap.
The ordinary QEMU proof covers wake-one, wake-many, handoff wake retry, multi-waiter private range cleanup, and stale wake-after-reuse. It does not deterministically force the transient unmap interruption ring-scratch contention race that can make the first interruption completion post fail: from userspace, waiter submission is observable before the kernel registers the waiter or after ring dispatch has released the scratch buffer. The production cleanup path therefore treats that race as a retry state outside the address-keyed waiter table rather than restoring the waiter.
Provisional Ring ABI
The 7.2 implementation starts with compact capability-authorized operations:
CAP_OP_PARK;CAP_OP_UNPARK.
The numeric opcode values are assigned when the implementation edits
capos-config/src/ring.rs. CAP_OP_PARK_BENCH remains reserved for
measurement-only kernels and must not be repurposed.
CAP_OP_PARK uses the existing 64-byte SQE fields as:
| SQE field | Meaning |
|---|---|
cap_id | ParkSpace for private wait, or SharedParkSpace for shared wait |
user_data | returned in the wait completion CQE |
addr | user virtual address of the 32-bit park word |
len | expected 32-bit value |
pipeline_dep | relative timeout in monotonic nanoseconds; u64::MAX means no timeout |
flags | must be CAP_SQE_THREAD_OWNED |
call_id | owning thread id; a different thread leaves the SQE at the ring head |
CAP_OP_UNPARK uses:
| SQE field | Meaning |
|---|---|
cap_id | ParkSpace for private wake, or SharedParkSpace for shared wake |
user_data | returned in the wake caller’s completion CQE |
addr | user virtual address of the 32-bit park word |
len | maximum number of waiters to wake; zero is malformed |
Both operations require method_id, result_addr, result_len,
pipeline_field, xfer_cap_count, and _reserved0 to be zero.
CAP_OP_UNPARK also requires flags == 0, pipeline_dep == 0, and
call_id == 0. Park operations are not promise-pipelineable in this slice.
pipeline_dep is used as the wait timeout storage only for
CAP_OP_PARK; future promise pipelining must keep rejecting
CAP_SQE_PIPELINE on park opcodes or replace the park ABI in a reviewed
branch.
Wait completions use non-negative CQE.result statuses:
| Result | Meaning |
|---|---|
PARK_WOKEN = 0 | a wake operation made the thread runnable |
PARK_VALUE_MISMATCH = 1 | the loaded word did not equal expected |
PARK_TIMED_OUT = 2 | the timeout expired before a wake |
PARK_INTERRUPTED = 3 | a future cancellation/interrupt path aborted the wait |
Wake completions return the non-negative number of threads woken. Malformed SQEs, invalid caps, unreadable wait words, unsupported cap object types, and stale authority use the existing negative transport errors until a later ABI adds a more specific compact-error namespace.
Ring Ownership And Dispatch Context
Park operations use the process capability ring for submission and CQE
delivery, but blocking wait is not an ordinary long-lived runtime call. A
runtime must not hold RuntimeRingClient while the thread is parked in
CAP_OP_PARK; otherwise no sibling thread in the same process can borrow
the same ring client to submit CAP_OP_UNPARK.
The runtime contract for park operations is:
capos-rtowns a process-wide park submission/completion path separate from the generic request-bufferRuntimeRingClientpending-call list;- park wait reserves a unique
user_datavalue, writes the SQE while holding the runtime’s ring-submission lock, records a park-wait completion slot in runtime-owned memory, and releases the ring-submission lock before enteringcap_enter; - park wait sets
CAP_SQE_THREAD_OWNEDandcall_idto the current thread id so a sibling thread cannot drain the wait and park the wrongThreadRef; - the park
user_datanamespace is reserved by the runtime so ordinary generic clients cannot accidentally claim a park completion; - all runtime CQ draining must route reserved park
user_datacompletions to the park-wait slot instead of treating them as generic client completions; - if another thread drains the waiter CQE before the waiting thread returns
from
cap_enter, the waiting thread reads the already-recorded status from that park-wait slot; - park wake may use the ordinary serialized ring submission path because it completes without parking the caller’s thread.
CAP_OP_PARK is syscall-context only. Timer ring polling and any future
interrupt-context ring drain must leave it unconsumed because consuming it can
block the current thread and mutate scheduler state. CAP_OP_UNPARK also
starts as syscall-context only; widening wake to timer polling would need a
separate review of scheduler locking and completion delivery.
This design preserves one process ring and the single blocked cap_enter
waiter rule. A thread blocked in Park is not the process ring’s
CapEnter waiter, so a sibling can still enter the kernel to submit wake,
Timer, IPC, or ordinary capability work through the same process ring.
Wait And Wake Semantics
wait is atomic with respect to wake for the same key:
- validate the SQE shape, including thread ownership, and authority cap;
- verify
call_idnames the current thread so a sibling cannot park on behalf of the waiter; - validate the user address shape and derive the private or shared park key;
- lock the current process
AddressSpaceacross validation and the user-word read for private keys; future shared keys must additionally prove mapping identity or pin the backing object; - take the park bucket lock;
- read the 32-bit user word while the bucket lock is held;
- compare the loaded value with
expected; - if the value differs, post
PARK_VALUE_MISMATCHwithout blocking; - if the value matches and the timeout is zero, post
PARK_TIMED_OUTwithout blocking; - otherwise, record the current
ThreadRef, key, timeout deadline, anduser_data, then block only the current thread.
The user-word read, comparison, and enqueue are serialized with wake by the
park scheduler path, and the read itself occurs while the process
AddressSpace mutex is held. This prevents a page-table validation/use race
and the classic lost wake where a waiter reads the old value, a sibling stores
the new value and wakes no one, and the waiter then parks based on the stale
read. Shared park-words still need mapping provenance or object pinning so a
MemoryObject-derived key cannot be swapped out from under key derivation. The
user word is not a kernel-owned mutex. Runtime code must use normal atomic
load/store and memory-ordering rules around the park word.
wake derives the same key, removes up to maxWake valid waiters from that
key’s FIFO list, posts PARK_WOKEN completions to the waiting process
ring using the completion credits reserved when those waiters parked, and marks
those ThreadRef values runnable after generation checks. A wake SQE is
consumed only when the kernel can also post the wake caller’s own CQE; if that
ordinary CQ slot is not available, no waiters are removed and the SQE remains
pending like other uncompletable ring work. Stale waiters caused by thread or
process generation mismatch are drained without writing to userspace, release
their reserved completion credits, and do not count as successfully woken.
If a valid waiter is still in a current or handoff CPU slot when the wake path
removes it from the address-keyed table, the wake still counts that waiter as
woken and records a pending PARK_WOKEN completion for scheduler retry.
Timeouts use the same monotonic time base as Timer. The kernel may convert
nanoseconds to scheduler ticks internally, but the ABI remains nanoseconds.
Finite deadlines post PARK_TIMED_OUT through the waiting process ring
using the waiter’s reserved completion credit and wake the blocked thread if
the thread generation still matches.
An explicit wake, timeout, cancellation, process exit, and unmap/revoke cleanup
race must produce exactly one waiter completion or cleanup-consumption path.
Once any path consumes the waiter record, the other racing paths must observe it
as gone and must not post a second CQE or wake a later ThreadRef.
Process exit removes every park waiter whose pid/process generation matches the exiting process. Thread exit removes that thread’s own park waiter before the thread record can be retained for join observation. These cleanup paths must not allocate.
Unmap, mapping revoke, and address-space teardown remove or fail private waiters for the affected key/generation before the old virtual address range is made reusable for unrelated mappings. A wake or timeout racing with cleanup must either complete the old waiter under its original generation or observe that cleanup already consumed it; it must not post a completion to a new owner of the same numeric address.
Resource Accounting
Park waits are bounded by the process thread ledger. A thread can be in only one scheduler block reason, so live park waiters cannot exceed live threads. The first private ParkSpace implementation stores the wait node in thread-owned block state and links it into a fixed process-owned waiter table. That is valid only because private ParkSpace caps are process-local and the first key is the process address space plus user virtual address. Shared SharedParkSpace support must move to object-owned fixed buckets scoped to MemoryObject identity. Wait, wake, timeout, and process-exit cleanup must not allocate. Registering a blocking wait reserves one deferred CQE credit in the waiting process. Ordinary completion posting treats reserved credits as unavailable, so wake and timeout paths can always post the waiter completion without losing the waiter. If the kernel cannot reserve that credit, it must not enqueue or block the wait; it either leaves the SQE pending until capacity exists or posts a negative completion for the wait attempt without consuming a waiter slot.
ParkSpace creation is charged as ordinary process capability/table state.
If the first implementation needs per-process bucket storage beyond the cap
object itself, that storage must be reserved before the ParkSpace is
published and released when the process exits or the cap is finally dropped.
In the first private implementation, the waiter table is process-owned and
survives release of the ParkSpace handle. CAP_OP_RELEASE of the last
capability handle removes submit authority but cannot free a parked waiter’s
storage. A waiter can still receive a PARK_WOKEN CQE from a wake
operation that already resolved the authority object, a PARK_TIMED_OUT
CQE from a finite deadline, or a future PARK_INTERRUPTED CQE from an
explicit cancellation path. Thread or process exit drains the wait node without
posting a CQE to the exiting thread/process and releases the reserved
completion credit. If a runtime drops the last ParkSpace while it has
indefinite waiters, it can deadlock its own process, but it cannot create a
use-after-free or leak authority outside that process. Future shared SharedParkSpace
storage must use explicit non-cap-table waiter pins so object-owned buckets are
not freed while parked waiters remain.
SharedParkSpace storage is charged to the MemoryObject-derived object when shared
parking lands. It must not create a second unbounded resource path where a
holder can allocate wait queues by touching many offsets.
Security Invariants
- Holding a ParkSpace or SharedParkSpace authorizes blocking/waking, not memory access. Wait still requires a readable user word.
- Private ParkSpace caps are process-local and non-transferable in the first implementation.
- Shared park authority must be derived from MemoryObject identity and offset, not from another process’s virtual address.
- Park wait blocks the current thread, not the whole process.
- Park wait SQEs are thread-owned; a non-owner
cap_enterleaves the SQE at the ring head instead of parking the wrong thread. - Park wake can only make generation-checked ThreadRef values runnable.
- Park completions are posted to the waiting process ring using the waiter
SQE’s
user_data. - Blocking wait registration reserves one CQE credit for the eventual waiter completion, and wake must not remove a waiter unless that credit exists.
CAP_OP_PARKis dispatched only from syscall-contextcap_enterand never from timer or interrupt-context ring polling.- A parked private ParkSpace waiter is stored in process-owned fixed storage; future shared SharedParkSpace waiters must pin the authority object backing their bucket table until wake, timeout, thread exit, or process exit removes the waiter.
- One process ring still has at most one blocked
cap_enterwaiter in 7.2; park wait does not create an extra blocked ring waiter. - Private ParkSpace wait reads hold the process
AddressSpacelock across validation and the user-word read. SharedParkSpace park-words remain blocked until MemoryObject mapping provenance or explicit object pins cover shared key derivation.
Measurement Handoff
4.5.4 measured failed wait and empty wake before real threads existed. That
result chooses a compact capability-authorized operation as the starting ABI
for 7.2 rather than a generic Cap’n Proto wait/wake method pair.
4.5.5 is closed for the first real thread-blocking path. It measures:
- value-mismatch wait;
- empty wake;
- wait-to-block;
- wake-to-runnable;
- wake-to-resume through
cap_enter.
The 2026-04-25 QEMU sample printed:
[thread-lifecycle] park path avg cycles: failed_wait=6778 empty_wake=6840 wait_to_block=55994326 wake_to_runnable=28219 wake_to_resume=28000684
The compact shape still holds for this slice: CAP_OP_PARK and
CAP_OP_UNPARK remain the production runtime ABI target, while
ParkBench remains measurement-only.
Implementation Order
- Add
ParkSpaceandSharedParkSpacemarker interfaces plus compact opcode constants. - Add a process-local ParkSpace grant path next to
ThreadControlandThreadSpawner; keep it non-transferable. - Add thread-owned
Parkblock state and fixed private waiter storage with no wait/wake allocation. - Dispatch
CAP_OP_PARKandCAP_OP_UNPARKagainst ParkSpace for private address-space keys. - Add QEMU smoke coverage for mismatch, timeout, wake-one, wake-many, and handoff wake retry. Safe runtime park wrappers remain a later capos-rt slice.
- Run 4.5.5 blocked/resume measurements and fold the result into the final ABI decision.
- Drain or fail private waiters before the affected virtual address range
can be reused. Anonymous
VirtualMemory.unmapandVirtualMemory.decommit, plusMemoryObject.unmapfor borrowed mappings, are covered; shared park-word cleanup and address-space generation teardown remain open. - Add MemoryObject-derived SharedParkSpace support only after mapping provenance or object pins cover shared key derivation under the same validation/use discipline.
Validation
The thread-lifecycle proof creates multiple threads in one process, parks
threads on a userspace park word, wakes them through the same ParkSpace, proves
timeout and value-mismatch paths, and shows that process exit drains pending
waits. make run-measure records failed-wait, empty-wake, wait-to-block,
wake-to-runnable, and wake-to-resume timings for the implemented private path.
Safe capos-rt park wrappers remain future runtime work.
Capability Model
How capabilities work in capOS.
What is a Capability
A capability in capOS is a reference to a kernel object that carries:
- An interface (what methods can be called), defined by a Cap’n Proto schema
- A permission (the object it references, enforced by the kernel)
- A wire format (Cap’n Proto serialized messages for all invocations)
A process can only access a resource if it holds a capability to it. There is no ambient authority – no global namespace, no “open by path” syscall, no implicit resource access.
Identity Terms and Authority
capOS documentation uses identity terms as policy metadata, not as kernel authorization primitives. A user is human-facing prose. A principal is the stable identity metadata used by authentication, policy, audit, and ownership records. An account is planned durable local record state for a principal, including credential references, roles, attributes, storage-root references, and default profile names. A session is the live context that receives a concrete CapSet. Policy profiles and resource profiles select bundle fragments, approval eligibility, and quotas that a trusted broker may use when minting capabilities.
None of those terms is kernel authority: the kernel dispatches through
generation-tagged CapId entries, not users, roles, accounts, groups, UIDs,
or profile names. Account-store behavior, durable profile records, and broader
quota policy remain future work tracked in the
local users backlog.
Session-Bound Invocation Context
Services should not infer authority from caller-supplied identity fields. A
request parameter such as user, principal, client, or role is data.
The active model is one immutable session context per process plus explicit
capabilities granted by a broker or supervisor.
The general pattern is:
- authentication or admission creates a live
SessionContext; - process spawn installs exactly one immutable session context in the child;
AuthorityBrokergrants service roots/facets appropriate to that session;- endpoint calls carry privacy-preserving caller-session metadata by default;
- subject details such as global principal id, display name, profile class, or external claims are disclosed only through explicit client disclosure and a matching broker/service disclosure scope. The current endpoint CALL path implements this as a disclosure request mask intersected with cap-held disclosure scope.
The kernel role is narrower. It verifies that a process holds a live cap-table entry, that the process session is live, and that transfer/spawn obey session scope. It may deliver an opaque service-scoped caller-session reference and freshness result to endpoint servers, but it must not disclose broader subject details by default. It does not decide that a process is Alice, an operator, a moderator, or an NPC. Those are policy facts maintained by session, broker, account, and application services.
Opaque receiver selectors may still exist in the IPC implementation and in
historical service-object routing tests. A receiver selector is not identity
metadata, not shell syntax, not a user field, not a disclosure channel, and not
a role bit. New shared-service identity should use the caller session context
and broker-granted service facets, not caller-selected numeric labels.
The chat demo now follows this rule for membership: the server receives the
endpoint caller metadata and keys member records by an opaque live
caller-session reference, while chat handles remain request data and visible
member labels are assigned by the service.
The shared chat/adventure endpoint helper now exposes caller-session metadata
through EndpointCaller instead of a badge field; the old badge-named
user-data type remains only as a source-compatible alias. Terminal output and
shell-serviced stdio bridges are also gated by live caller-session metadata.
Schema as Contract
Capability interfaces are defined in .capnp schema files under schema/.
The schema is the canonical interface definition. Currently defined:
interface Console {
write @0 (data :Data) -> ();
writeLine @1 (text :Text) -> ();
}
interface TerminalSession {
write @0 (data :Data) -> ();
writeLine @1 (text :Text) -> ();
readLine @2 (request :LineRequest) -> (status :LineStatus, line :Data);
}
interface FrameAllocator {
allocFrame @0 () -> (handleIndex :UInt16);
allocContiguous @1 (count :UInt32) -> (handleIndex :UInt16);
}
interface MemoryObject {
info @0 () -> (pageCount :UInt32, sizeBytes :UInt64);
map @1 (hint :UInt64, offset :UInt64, size :UInt64, prot :UInt32) -> (addr :UInt64);
unmap @2 (addr :UInt64, size :UInt64) -> ();
protect @3 (addr :UInt64, size :UInt64, prot :UInt32) -> ();
}
interface VirtualMemory {
map @0 (hint :UInt64, size :UInt64, prot :UInt32) -> (addr :UInt64);
unmap @1 (addr :UInt64, size :UInt64) -> ();
protect @2 (addr :UInt64, size :UInt64, prot :UInt32) -> ();
}
interface Endpoint {}
interface ProcessSpawner {
spawn @0 (name :Text, binaryName :Text, grants :List(CapGrant)) -> (handleIndex :UInt16);
}
interface ProcessHandle {
wait @0 () -> (exitCode :Int64);
}
interface BootPackage {
manifestSize @0 () -> (size :UInt64);
readManifest @1 (offset :UInt64, maxBytes :UInt32) -> (data :Data);
}
# Management-only introspection. Ordinary handle release uses the system
# transport opcode CAP_OP_RELEASE, not a method here.
interface CapabilityManager {
list @0 () -> (capabilities :List(CapabilityInfo));
revoke @1 (capId :UInt32) -> ();
# grant is planned for a later Stage 6 management slice
}
Each interface has a unique 64-bit TYPE_ID generated by the Cap’n Proto
compiler. TYPE_ID is the schema constant. interface_id is the runtime
metadata used by CapSet/bootstrap descriptions and endpoint delivery headers.
Method dispatch uses the interface assigned to the capability entry plus
method_id; method_id selects a method inside that schema.
This is not capability identity. A CapId is the authority-bearing handle in
a process table, analogous to an fd. Multiple capabilities can expose the same
interface:
cap_id=3-> serial-backedConsolecap_id=4-> log-buffer-backedConsolecap_id=5->Consoleproxy served by another process
All three use the same Console TYPE_ID, but they are different objects
with different authority. The manifest/CapSet should record the expected schema
TYPE_ID as interface metadata for typed handle construction. Normal CALL SQEs
do not need to repeat it because the kernel or serving transport can derive it
from the target capability entry. CapSqe keeps reserved tail padding for ABI
stability.
The kernel exposes the initial CapSet to each process as a read-only
4 KiB page mapped at capos_config::capset::CAPSET_VADDR and passes its
address in RDX to _start. The page starts with a
CapSetHeader { magic, version, count } and is followed by
CapSetEntry { cap_id, name_len, interface_id, name: [u8; 32] } records
in manifest declaration order. Userspace looks up caps by the manifest
name rather than by numeric index (capos_config::capset::find), so
grants can be reordered in system.cue without breaking clients. The
mapping is installed without WRITABLE so userspace cannot mutate its
own bootstrap authority map.
Security invariant: a CapTable entry exposes one public interface. If the
same backing state must be available through multiple interfaces, mint multiple
capability entries, each wrapping the same state with a narrower interface.
Do not grant one handle that accepts unrelated interface_id values; that
makes hidden authority easy to miss during review.
Invocation Path
Capabilities are invoked via a shared-memory capability ring (io_uring- inspired). Each process has a submission queue (SQ) and completion queue (CQ) mapped into its address space. Two invocation paths exist:
Caller builds capnp params message
→ serialize to bytes (write_message_to_words)
→ write CALL SQE to SQ ring (pure userspace memory write)
→ advance SQ tail
→ caller invokes cap_enter for ordinary capability methods
(timer polling only runs explicitly interrupt-safe CALL targets)
→ kernel reads SQE, validates user buffers
→ CapTable.call(cap_id, method_id, bytes)
→ kernel writes CQE to CQ ring
... caller reads CQE after cap_enter, or spin-polls only for
interrupt-safe/non-CALL ring work ...
→ caller reads CQE result
CapObject::call does not receive a caller-supplied interface ID. The cap
table derives the invoked interface from the target entry before invoking the
object. The SQE carries only the capability handle and method ID because each
capability entry owns one public interface:
#![allow(unused)]
fn main() {
pub trait CapObject: Send + Sync {
fn interface_id(&self) -> u64;
fn label(&self) -> &str;
fn call(
&self,
method_id: u16,
params: &[u8],
result: &mut [u8],
reply_scratch: &mut dyn ReplyScratch,
) -> capnp::Result<CapInvokeResult>;
}
}
All communication goes through serialized capnp messages, even when caller and callee are in the same address space. This ensures the wire format is always exercised and makes the transition to cross-address-space IPC seamless.
The result buffer is supplied by the caller (the user-validated SQE result
region). Implementations serialize directly into it and return the number of
bytes written, so the kernel’s dispatch path does not allocate an intermediate
Vec<u8> per invocation.
Capability Table
Each process has its own capability table (CapTable), created at process
startup. The kernel also maintains a global table (KERNEL_CAPS) for
kernel-internal use. Each table maps a CapId (u32) to a boxed CapObject.
CapId encoding: [generation:8 | index:24]. The generation counter increments
when a slot is freed, so stale CapIds (from a previous occupant of the slot)
are rejected with CapError::StaleGeneration rather than accidentally
referring to a different capability.
Generation wrap must not resurrect old authority. The implemented table retires
a slot permanently when its 8-bit generation would wrap from 255 back to 0;
that slot is not returned to the free list. Heavy churn can therefore exhaust a
table even when many retired slots are empty, but the failure mode is
CapError::TableFull, not stale-cap revalidation. Future widening of CapId
generation bits is an ABI change and belongs in the schema/ring ABI evolution
track.
Operations:
insert(obj)– register a new capability, returns its CapIdget(id)– look up a capability by ID (validates generation)remove(id)– revoke a capability, bumps slot generationcall(id, method_id, params)– dispatch a method call against the interface assigned to the capability entry
Every current boot manifest gives only initConfig.init a kernel-built
capability table. The default system.cue manifest boots the standalone
init binary, which reads BootPackage, validates initConfig.services, and
spawns capos-shell, the remote-session CapSet gateway, and resident demo
services through ProcessSpawner. The Telnet gateway fixture is retired with
the kernel socket owner. Focused shell-led
manifests such as system-smoke.cue and system-shell.cue still boot
capos-shell directly as initConfig.init for narrow login/shell proofs.
Focused init-executor manifests such as system-spawn.cue also boot the
standalone init binary with Console, BootPackage, and ProcessSpawner for
isolated ProcessSpawner coverage. Child capabilities are assembled from
explicit spawn grants in declaration order:
raw grants preserve the source capability metadata, legacy endpoint-client
grants attenuate an endpoint owner or ProcessSpawner endpoint result source
to a client facet while preserving delegated receiver metadata, and child-local
Endpoint, FrameAllocator, and VirtualMemory grants are minted for the child’s
process. Endpoint kernel grants return parent-side client facets as result
caps; init uses those facets for later service imports and releases them
before waiting on children. Kernel bootstrap now builds only
initConfig.init kernel-sourced caps; CapSource::Service resolution stays in
init’s BootPackage executor path.
CapRef.source is structured CUE inside initConfig.services, not an
authority string:
{
name: "client"
expectedInterfaceId: 0xacf0c15a7b2e0041
source: service: {
service: "endpoint-server"
export: "client"
}
}
The source selector chooses the object or authority to grant. The
expectedInterfaceId value is a schema compatibility check against the
constructed object, not the authority selector itself. This distinction matters
because different objects can implement the same interface.
Transport-Level Capability Lifetime
Cap’n Proto applications do not usually model capability lifetime as an application method on every interface. The RPC transport owns capability reference bookkeeping.
The standard Cap’n Proto RPC protocol is stateful per connection. Each side
keeps four tables: questions, answers, imports, and exports. Import/export IDs
are connection-local, not global object names. When an exported capability is
sent over the connection, the export reference count is incremented. When the
importing side drops its last local reference, the transport sends Release
to decrement the remote export count. Implementations may batch these releases.
If the connection is lost, in-flight questions fail, imports become broken, and
exports/answers are implicitly released. Persistent capabilities, when
implemented, are a separate SturdyRef mechanism and should not be treated as
owned pointers.
References:
This distinction matters for capOS:
close()is application protocol. AFile.close()method can flush dirty state, commit metadata, or tell a server that a session should end.Release/ cap drop is transport protocol. It removes one reference from the caller’s local capability namespace and eventually lets the serving side reclaim the object if no references remain.- Process exit is bulk transport cleanup. Dropping the process must release all caps in its table, cancel pending calls, and wake peers waiting on those calls.
capOS therefore needs a system transport layer in the userspace runtime
(capos-rt / later language runtimes), not just raw SQE helpers. That transport
should own typed client handles, local reference counts, promise-pipelined
answers, and broken-cap state. When the last local handle is dropped, it should
queue a transport-level release operation that is flushed through the kernel
ring at an explicit runtime boundary.
Ordinary handle release is a transport concern, not an application method.
The target design: the generated client drops the last local handle
(RAII / GC / finalizer), the runtime transport queues CAP_OP_RELEASE, an
explicit runtime flush or later ring-client boundary submits it, and the kernel
removes the caller’s CapTable slot with mutable access to that table.
Encoding ordinary local release as a
regular method call on CapabilityManager was rejected because it would
mutate the same table used to dispatch the call; CapabilityManager is
therefore management-only (list() plus child-scoped revoke(capId),
later grant()), not the default release path. CAP_OP_FINISH remains
reserved in the same transport opcode
namespace for application-level “end of work” signals that the transport must
deliver reliably, so the kernel can tell them apart from a truly malformed
opcode.
Current status: the kernel dispatches CAP_OP_RELEASE as a local cap-table
slot removal and fails closed for stale or non-owned cap IDs. capos-rt
bootstrap handles remain explicitly non-owning, while adopted owned handles
queue CAP_OP_RELEASE on final drop and expose Runtime::flush_releases() for
callers that need to force the queued releases. Result-cap adoption validates
the kernel-supplied interface ID before producing an owned typed handle.
CAP_OP_FINISH remains reserved and returns CAP_ERR_UNSUPPORTED_OPCODE.
Process exit remains the fallback cleanup path for unreleased local slots.
Queued release is not immediate revocation. A dropped runtime handle no longer
provides local typed access in that runtime, but the kernel cap-table slot is
removed only after the release SQE is flushed and processed, or during process
exit cleanup. Security-sensitive flows that need to invalidate authority for
other holders or peers must use explicit revoke/epoch semantics such as
CapabilityManager.revoke, session expiry, object epochs, or service-specific
close/revoke methods; they must not rely on destructor timing.
Session expiry is also not a substitute for every revocation shape. The target session lifecycle model has separate layers:
- a mutable session liveness cell for
live,logged_out,revoked,expired, andrecovery_onlystate behind the immutable processSessionContext; - broker grant leases for bundle fragments and elevated or temporary caps;
- object/facet epochs for invalidating a live target generation.
Renewal acts on the first two layers. It may extend session liveness or mint fresh grant leases, but it must not make old ordinary grants fresh merely because the session renewed. Object/facet revocation remains an independent target-side operation.
Service authors should make this distinction explicit in protocol design:
- Use ordinary handle drop or runtime
flush_releases()only to stop this process from using one local cap slot. - Use a service
closemethod when the service must observe application-level shutdown, flush durable state, or publish an orderly end-of-session result. - Use
CapabilityManager.revoke, session expiry, object epochs, or a service-specific revoke method when existing peers or delegated holders must lose authority before the service proceeds. - Treat destructor/finalizer timing as advisory cleanup. It is not a security boundary, and it is not proof that another process has stopped using a cap.
Stale-Handle and Revoke Patterns
Not all kernel cap families use the same model for handling stale or revoked capabilities. The correct pattern depends on the semantics of the object, not on a blanket epoch test. Using the wrong model produces incorrect tests or incorrect behavior expectations.
Category A — Exception-based stale guard
The cap exposes an ensure_*_live guard or an equivalent consumed-state check
that returns a stable typed exception (not a silent success) on a stale or
consumed cap.
UserSession(kernel/src/cap/user_session.rs):info()/auditContext()fail closed with a stable exception message afterlogout(); secondlogoutis idempotent. Proved byrun-ssh-public-key-session.SchedulingContext,CpuIsolationLease: expose an explicitrevokemethod returningstaleGeneration. Subsequentinfo,bind_caller_thread,activation_preflight,create, anddrain_notificationscalls fail closed on the staled cap. Proved byrun-scheduling-context(demos/scheduling-context-smoke/src/main.rs:285-313, 1129-1141) andrun-scheduler-cpu-isolation-lease(demos/cpu-isolation-lease-smoke/src/main.rs:201-237).ThreadHandle(kernel/src/cap/thread_handle.rs):join(sched.rs:1038-1057) returnsAlreadyJoinedon the second call (hard fail, not silent success) and returnsTargetNotLive(sched.rs:1371,1377,1385) if the thread record is absent post-cleanup.exitCode(sched.rs:1418-1420) is a non-consuming idempotent read. Thejoin_or_registerconsumed-state check is the stale guard; thejoinedflag is the epoch. Proved byrun-thread-lifecycle(demos/thread-lifecycle/src/main.rs:293-298).
Per-cap epoch tests are applicable only to Category A caps.
Category B — Idempotent-stale-target
The cap returns silent success (or a latched result) on a stale target. No
ensure_*_live guard is present by design.
ProcessHandle(kernel/src/cap/process_spawner.rs):terminateon an already-exited process returnsComplete(0);waitre-reads the latched exit code. Writing fail-closed tests for Category B caps would test the opposite of intended behavior.
Category C — Soft-EOF / zero-write
The cap uses v0 ExceptionType policy: closing one side causes the other to
drain and receive EOF; writes return zero bytes rather than an error.
Pipe(kernel/src/cap/pipe.rs):closecauses read to drain + EOF, write returns zero bytes (schema lines 2429-2433). No epoch test needed.
Category D — No revoke verb (kernel singletons)
These caps expose no revoke or close method in the schema. The backing
object lives for the process lifetime.
CredentialStore, AuthorizedKeyStore, SshHostKey, EntropySource,
SystemInfo, AuditLog, HardwareAuditLog, SessionManager,
AuthorityBroker, RestrictedLauncher, BootPackage. Nothing to test for
stale-handle behavior.
Category E — DDF caps with release/scrub semantics
These caps use internal handle epoch validation. The full stale-handle behavior for each requires targeted per-cap investigation when a behavior gap is identified.
DmaBuffer, DeviceMmio, Interrupt.
Open residuals
- UserSession expiry path (Category A): the
expiresAtMs/anonymousMs- driven expiry path is not yet covered by a focused smoke.run-ssh-public-key-sessioncovers the explicitlogout()close-side path. Note thatrun-session-contextis flaking on TCG-only hosts — a stability fix is needed before that smoke can be strengthened.
Access Control: Interfaces, Not Rights Bitmasks
capOS deliberately does not use a rights bitmask (READ/WRITE/EXECUTE) on capability entries, despite this being standard in Zircon and seL4. The reason is that Cap’n Proto typed interfaces already serve as the access control mechanism, and a parallel rights system creates an impedance mismatch.
Why rights bitmasks exist in other systems: Zircon and seL4 use rights
because their syscall interfaces are untyped – a handle is an opaque reference
to a kernel object, and the kernel needs something to decide which fixed
syscalls are allowed. capOS has typed interfaces where the .capnp schema
defines exactly what methods exist.
capOS’s approach: the interface IS the permission. To restrict what a caller can do, grant a narrower capability:
Fetch(full HTTP) →HttpEndpoint(scoped to one origin)Store(read-write) →Storewrapper that rejects write methodsNamespace(full) →Namespacescoped to a prefix
The “restricted” capability is a different CapObject implementation that
wraps the original. The kernel doesn’t know or care – it dispatches to
whatever CapObject is in the slot. Attenuation is userspace/schema logic,
not a kernel mechanism.
Session transfer scope: capability holds now carry reference-level transfer
scope. same_session caps cannot move into another process session through
raw IPC, endpoint return, or spawn grants. cross_session_shareable caps may
cross and then invoke under the receiver process session. service_regrant_only
caps require a trusted fixed-session broker/launcher path. These meta-rights
are about the reference, not the referenced object, and do not overlap with
interface-level method access control.
Non-writable filesystem caps are forwardable to a same-session child;
writable caps are not. Directory/File caps are minted Copy/same_session
at the read-only and RAM mint sites, so a holder can forward an opened directory
or file to a ProcessSpawner.spawn child within the same session – the kernel
handoff that backs POSIX fd inheritance across fork/execve. The security
argument is the same for all of them: the child gains no authority the parent
does not already hold, same_session keeps the cap from escaping the session,
and the spawn-grant epoch wrapper keeps a forwarded child cap from outliving a
revoked parent. Two flavours exist:
- Read-only views – the read-only filesystem (
readonly_fs) and the packaged-image source (installable_image), plus theirread_only_fs_root/installable_image_sourcebootstrap roots. Their interfaces fail closed on every mutation, so forwarding shares a pure read view. Here the interface is the permission makes the share unambiguously benign. - The holder’s own RAM scratch namespace – the
directory::transfer_result_capresults and thekernel:directory/kernel:filebootstrap sources (viaboot_cap_hold). ThisDirectory/Fileinterface includes mutation methods, so the forwarded cap is shared read/write with the child, not a read view. It is still safe to forward because it is the parent’s own scratch tree shared within one session, not a privilege the parent lacked.
The disk-backed writable filesystem (writable_fs) is a distinct CapObject
type minted NonTransferable: a writable cap carries the filesystem-wide
single-writer claim, so forwarding it would let two processes hold that claim.
The ProcessSpawner Raw/Move grant modes reject a NonTransferable source, so
the single-writer policy is preserved by the mint-time mode rather than a
separate check. Proven by make run-spawn-grant-directory.
TerminalSession is forwardable to a same-session child, parent-retained.
The bootstrap TerminalSession cap is minted Copy/same_session (matching
Console) in boot_cap_hold, so a holder can forward its terminal-backed
stdout/stderr to a ProcessSpawner.spawn child without losing its own terminal.
TerminalSessionCap is a stateless unit struct: write/writeLine dispatch
onto the shared kernel terminal and readLine resolves the caller’s session
context per call (requires_live_caller_session stays true), so there is no
per-session ownership state to strip on a forward. The child gains no terminal
authority the parent did not already hold, and same_session keeps the cap from
escaping the session. This is the non-destructive capability-model realization of
POSIX “all children share the controlling tty”; the prior
Move/service_regrant_only mint was a policy default, not a state-ownership
requirement, and a destructive Move would have stripped a shell of its terminal
on its first child spawn under full fd inheritance. Two writers reaching the same
terminal serialize at the shared kernel UART; sub-line interleaving between a
parent and a child writing concurrently is an accepted research-surface
limitation, not an authority leak. Proven by make run-posix-terminal-forward.
See research survey for the cross-system analysis that led to this decision (§1 Capability Table Design).
Planned Enhancements (from research)
Tracked in Roadmap Stages 5-6:
- Legacy badge / receiver selector – the current storage field is a
u64per capability hold edge, delivered to endpoint servers on invocation. Existing code still calls it a badge because it began as seL4-style client identity metadata. The active model keeps that field out of service identity: new service capability should use one immutable process session, broker-granted service roots/facets, privacy-preserving endpoint caller-session metadata, and explicit subject disclosure plus a matching disclosure scope when a service needs more than an opaque service-scoped session reference. - Epoch (from EROS) – per-object revocation epoch. Incrementing the epoch invalidates all outstanding references. O(1) revoke, O(1) check.
Current Limitations
- Process-ring blocking remains process-level; private ParkSpace waits are
per-thread.
cap_enter(min_complete, timeout_ns)processes pending SQEs and can block one admitted thread per process until enough CQEs exist or a finite timeout expires. That ring wait is still process-owned and does not make the capability ring itself a per-thread completion queue. Separately, the implemented private ParkSpace path provides process-local per-thread wait/wake on userspace words through compactCAP_OP_PARK/CAP_OP_UNPARKoperations. SharedParkSpace park-words and runtime safe park clients remain future work. - No persistence. Capabilities exist only at runtime.
- Capability transfer is implemented for Endpoint CALL/RECV/RETURN.
Transfer descriptors on the capability ring let callers and receivers copy or
move transferable local caps through IPC messages. Delivery also enforces the
cap hold’s session transfer scope; an unsupported cross-session transfer
fails with
CAP_ERR_TRANSFER_NOT_SUPPORTEDand is reported to the caller instead of being requeued to the endpoint. See Storage and Naming “IPC and Capability Transfer” for the full design. - Transfer ABI (3.6.0 draft). Sideband transfer descriptors are defined in
capos-config/src/ring.rsasCapTransferDescriptor:cap_idis the sender-side local capability-table handle.transfer_modeis eitherCAP_TRANSFER_MODE_COPYorCAP_TRANSFER_MODE_MOVE.xfer_cap_countinCapSqeis the descriptor count.- For CALL/RETURN, descriptors are packed at
addr + lenafter the payload bytes and must be aligned toCAP_TRANSFER_DESCRIPTOR_ALIGNMENT. - Result-cap insertion semantics are defined by
CapCqe:resultreports normal payload bytes, whilecap_countreports how manyCapTransferResult { cap_id, interface_id }records were appended immediately after those payload bytes inresult_addrwhenCAP_CQE_TRANSFER_RESULT_CAPSis set. User space must bound-checkresult + cap_count * CAP_TRANSFER_RESULT_SIZEagainst its requestedresult_len. - Future promise pipelining must target that sideband result-cap namespace:
pipeline_depnames a process-local promised answer, andpipeline_fieldis a zero-basedCapTransferResultrecord index in that answer’s completion. It is not a Cap’n Proto schema field number; the kernel must not traverse opaque result payload bytes to find a capability. - Transfer-bearing SQEs are fail-closed:
- unsupported transfer scope or object class:
CAP_ERR_TRANSFER_NOT_SUPPORTED, - malformed descriptor metadata (invalid mode, reserved bits, non-zero
_reserved0, misalignment, overflow):CAP_ERR_INVALID_TRANSFER_DESCRIPTOR, - all other reserved-field misuse remains
CAP_ERR_INVALID_REQUEST.
- unsupported transfer scope or object class:
- Revocation propagates through object epochs.
CapabilityManager.revokeinvalidates child-local grant copies for the revoked object, and the ring maps revoked ordinary and endpoint use to typedDisconnectedexceptions where a result buffer exists. Broader supervision/restart policy remains future work. - MemoryObject is the mapped bulk-data substrate.
FrameAllocatorreturns ownedMemoryObjectresult caps instead of raw physical addresses. The object exposes metadata plus caller-local map/unmap/protect operations for page-aligned ranges. File I/O, networking, GPU data planes, and zero-copy IPC still need service-level SharedBuffer operations built on this substrate. See Storage and Naming “Shared Memory for Bulk Data” for the broader interface design.
Future Directions
- Broader capability-bearing services. Endpoint CALL/RECV/RETURN already carry copy/move sideband transfer descriptors and install result caps in the receiver’s local table. Remaining work is to use that transport in higher service layers: capability-bearing naming and persistence services, Directory/File and Namespace-style object models, promise pipelining over result-cap indexes, and policy for durable references. See Storage and Naming.
- Persistence. Persistent object references should be restored through a capability-bearing naming or persistence service that can authorize the request and mint a fresh live object. Do not serialize local cap-table handles, endpoint generations, receiver selectors, or server cookies as durable authority.
- Network transparency. Remote capability transport should use connection-local export/import tables and explicit disconnect semantics. A remote Console capability can expose the same typed interface as a local one, but the portable authority is the live object reference, not a global URL or serialized local routing selector.
ABI Evolution Policy
This policy governs externally visible capOS ABIs:
- Cap’n Proto schema in
schema/capos.capnp. - Generated schema bindings checked by
make generated-code-check. - Ring and bootstrap ABI constants and layouts in
capos-config/src/ring.rs,capos-config/src/capset.rs, andcapos-abi/src/lib.rs. - Debug/log formats only when a document explicitly declares them stable.
The current project is still a research tree, not a released platform with a public compatibility promise. Even so, schema and ring changes must follow this policy before external clients, host tools, or out-of-tree runtimes depend on them.
Design Grounding
This policy is grounded in current capOS docs and the checked-in prior-art notes that apply to schema and transport evolution:
docs/architecture/capability-ring.mdfor the implemented process-wide ring, fixed 64-byteCapSqe, fixed 32-byteCapCqe, opcode boundary, and current completion semantics.docs/proposals/ring-v2-smp-proposal.mdfor the undecided future per-thread-ring version-negotiation shape.docs/proposals/error-handling-proposal.mdfor the transport/application error split and unsupported-operation behavior.docs/trusted-build-inputs.mdfor generated-code drift checks and pinned Cap’n Proto tooling.docs/design-risks-register.mdfor the prior open ABI compatibility and Ring v2 compatibility questions.docs/research/capnp-error-handling.mdfor Cap’n Proto exception and schema error-model precedent. OS scheduling, filesystem, networking, and hardware prior-art research does not directly change this schema/ring ABI policy.
Compatibility Classes
Every ABI change must name one class in its task, review, or commit message.
| Class | Meaning | Required handling |
|---|---|---|
| Compatible addition | Existing clients keep working without recompilation or behavior change. | Add tests or generated-code drift evidence. Update docs when semantics matter. |
| Compatible tightening | Existing malformed or previously unspecified inputs fail earlier or more specifically. | Document the rejected shape and expected error. Add hostile coverage when reachable from userspace. |
| Soft deprecation | Old shape still works, but new callers should stop using it. | Mark the field/method/opcode as deprecated in docs and keep a replacement path live through the deprecation window. |
| Breaking change | Existing valid clients can fail, observe different semantics, or require regenerated code. | Requires a proposal or backlog plan, migration notes, compatibility proof or explicit break decision, and task/risk updates when relevant. |
| Internal-only | Not visible outside one crate or generated artifact and not serialized, mapped, or invoked across a boundary. | Normal code review; do not label serialized or mapped data as internal-only. |
Cap’n Proto Schema Rules
Schema interface IDs, method ordinals, struct field ordinals, enum discriminants, union tags, and named constants are stable once checked in.
Allowed compatible changes:
- Add a new field with a new ordinal and a default value that old readers can safely ignore.
- Add a new method with a new ordinal when old clients do not need it.
- Add a new result union arm only when old clients already treat unknown or unsupported domain outcomes as a controlled failure.
- Add a new interface or struct with a fresh ID/name.
- Add documentation that narrows previously undocumented behavior without changing wire compatibility.
Disallowed without a breaking-change plan:
- Reuse a removed field, method, enum, or union ordinal.
- Change the meaning, type, units, authority, or lifetime of an existing field.
- Rename a schema item when generated code or logs expose the old name as a public integration surface.
- Make an optional/defaulted field mandatory for existing callers without a versioned fallback.
- Replace a schema result union with a transport error or vice versa without an error-layer migration note.
Removed schema space stays reserved. If a field or method is retired, leave a comment at the old ordinal explaining why it is reserved and where the replacement lives.
Ring ABI Rules
The ring ABI is a fixed-layout shared-memory contract. CapSqe, CapCqe,
ring header fields, opcodes, flags, transfer descriptor layout, CQE result
codes, and fixed virtual addresses are kernel/userspace ABI.
Rules for the current process-wide ring:
- Do not change the size, alignment, byte order, or meaning of an existing ring struct field without a breaking-change plan.
- Preserve objective layout checks for current ABI structs. At minimum,
capos-config/src/ring.rsmust keep compile-time checks forCapSqe,CapCqe,CapTransferDescriptor, endpoint caller-session metadata, endpoint message headers, and ring capture records. Any new negotiated ring layout must add equivalent checked constants for SQE size, CQE size, transfer descriptor size, ring header offsets, SQE/CQE array offsets, and feature/version fields. - Do not change
SQE_ARRAY_OFFSET,CQE_ARRAY_OFFSET,SQ_ENTRIES,CQ_ENTRIES,RING_VADDR, or fixed SQE/CQE sizes by arithmetic side effect. A change to any of those values is a layout change and must name its compatibility class. - Reserved SQE fields must be rejected unless the opcode explicitly defines them. New meanings for reserved fields require hostile tests that old kernels fail closed.
- New opcodes must start as reserved or unsupported. A reserved opcode should
return
CAP_ERR_UNSUPPORTED_OPCODE; malformed non-reserved opcodes should returnCAP_ERR_INVALID_REQUEST. - New flags must specify whether old kernels reject them, ignore them, or treat them as malformed. Silent ignore is allowed only for flags that cannot carry authority or resource effects.
- New negative CQE result codes must be appended as new constants. Existing negative result codes cannot be renumbered or repurposed.
- Capability transfer descriptors must continue to reject unknown reserved bits until a documented transfer mode consumes them.
Ring v2 or per-thread-ring work must declare whether it is:
- a negotiated compatible extension to the current ring page;
- a new ring layout selected by boot/runtime version negotiation; or
- an intentional ABI break.
That decision belongs in the Ring v2 proposal/backlog before implementation.
Version Negotiation
When an ABI cannot be evolved by compatible addition, introduce an explicit version gate instead of inferring compatibility from struct size or accidental behavior.
Acceptable gates include:
- manifest or boot-package
schemaVersionfields; - a future runtime boot-info field that names ring layout and feature bits;
- interface methods that return a structured unsupported-version result;
- manifest/tooling checks that reject unsupported data versions before boot.
Unsupported versions must fail closed with a stable, documented error. A client must not need to parse debug text to distinguish “unsupported version” from “malformed input”.
Deprecation Window
Before external consumers exist, a deprecation may be removed after the
replacement path, docs, and smokes land in main.
After external consumers are declared for an ABI, deprecated schema or ring surfaces must remain for at least one full selected milestone after the replacement is documented and tested. Removing them earlier is a breaking change and must be called out as such.
Deprecation notes must name:
- the old field, method, opcode, flag, or constant;
- the replacement;
- the last proof target that still exercises the old shape;
- the planned removal condition.
Review Gates
Schema or ring ABI changes must include the relevant checks:
make generated-code-checkforschema/capos.capnpchanges.cargo test-configfor manifest/schema validation changes.cargo test-ring-loomfor ring queue protocol changes.- Compile-time layout assertions and host tests for ring struct size, alignment, offsets, entry counts, and fixed virtual addresses when a ring layout changes.
cargo test-libfor CapTable/capability transfer semantics.- A focused QEMU smoke when a userspace-visible behavior changes.
make docsfor policy or manual changes.
Reviewers should reject ABI changes that lack a compatibility class, migration notes for breaking behavior, or an unsupported-version/error story for new version gates.
Current Open ABI Decisions
- Ring v2 backward compatibility remains undecided. Until it is decided, do not claim per-thread rings are compatible with the current process-wide ring.
- Production release reproducibility remains separate from ABI compatibility.
Final ISO, manifest, and embedded ELF checksums are tracked in
docs/trusted-build-inputs.mdand relevant task records.
Capability Ring
The capability ring is the userspace-to-kernel transport for capability invocation. It avoids one syscall per operation while preserving a typed Cap’n Proto method boundary and explicit completion reporting.
The current error model is documented in Error Handling. Ring CQE status values report transport failures; typed capability exceptions and ordinary schema result unions sit above that transport layer.
Current Behavior
Each non-idle process gets one 4 KiB ring page mapped at RING_VADDR. The page
contains a volatile header, a 16-entry submission queue, and a 32-entry
completion queue. Userspace writes CapSqe records, advances sq_tail, and
uses cap_enter(min_complete, timeout_ns) to make ordinary calls progress.
sequenceDiagram
participant U as Userspace runtime
participant R as Ring page
participant K as Kernel ring dispatcher
participant C as Capability object
U->>R: write CapSqe and advance sq_tail
U->>K: cap_enter(min_complete, timeout_ns)
K->>R: read sq_head..sq_tail
K->>K: validate SQE fields and lock AddressSpace for user buffers
K->>C: call method or endpoint operation
C-->>K: completion, pending, or error
K->>R: write CapCqe and advance cq_tail
K-->>U: return available CQE count
U->>R: read matching CapCqe
Timer polling also processes each current process’s ring before preemption, but
only non-CALL operations and CALL targets that explicitly allow interrupt
dispatch may run there. Ordinary CALLs wait for cap_enter.
Why ordinary CALL waits for
cap_enter: Submitting aCALLSQE is only a shared-memory write. The kernel still needs a safe execution point to drain the ring and run capability code. Timer polling runs in interrupt context, so it must not execute arbitrary capability methods that may allocate, block on locks, mutate page tables, spawn processes, parse Cap’n Proto messages, or perform IPC side effects.cap_enteris the normal process-context drain point: it processes pending SQEs, posts CQEs, and then either returns the available completion count or blocks until enough completions arrive. The design keeps SQE publication syscall-free and batchable, keeps the syscall ABI limited toexitandcap_enter, and avoids turning the timer interrupt into a general capability executor. A future SQPOLL-style path can remove the explicit syscall from the hot path only by running dispatch in a worker context, not from arbitrary timer interrupt execution.
Design
CapSqe is a fixed 64-byte ABI record. CAP_OP_CALL names a local cap-table
slot and method ID plus parameter/result buffers. CAP_OP_RECV and
CAP_OP_RETURN implement endpoint IPC. CAP_OP_RETURN normally returns
successful result bytes to the original caller; with
CAP_SQE_RETURN_APPLICATION_EXCEPTION, its payload is a serialized
CapException and the original caller completes with
CAP_ERR_APPLICATION_EXCEPTION or the truncated application-exception code.
CAP_OP_RELEASE removes a local cap-table slot through the transport.
CAP_OP_CANCEL (opcode 6) cancels a pending endpoint receive posted by the
same process on the same endpoint cap; pipeline_dep carries the receive
SQE’s user_data. CAP_OP_NOP measures the fixed ring path.
CAP_OP_PARK_BENCH (opcode 7) is a measurement-only compact opcode dispatched
only by kernels built with the measure feature; normal kernels reject it as
malformed. CAP_OP_FINISH is ABI-reserved and currently returns
CAP_ERR_UNSUPPORTED_OPCODE.
CAP_OP_RELEASE is deliberately scoped to local transport cleanup. It removes
one holder’s cap-table slot after the SQE is processed, or as part of process
exit cleanup; it does not revoke peer-held caps, cancel delegated authority, or
stand in for an application close method. Services that need security-visible
invalidation must use an explicit control path such as CapabilityManager.revoke,
session expiry, object epochs, or a service-specific close/revoke protocol.
Reviewers should treat claims based only on handle drop, RAII, GC finalizers, or
queued release flushing as local-cleanup claims, not revocation claims.
Opcode boundary: Ring opcodes are kernel ABI, not a loophole around the syscall surface.
cap_enterandexitremain the CPU trap entrypoints, but every accepted authority-bearing or resource-mutatingCAP_OP_*still adds distinct kernel semantics that must pass the capability method / ring opcode / syscall decision graph. No-authority diagnostics such asCAP_OP_NOPare still kernel ABI and must stay side-effect-free and review-visible, but they are not resource authority paths.CAP_OP_PARKandCAP_OP_UNPARKare justified because blocking wait mutates scheduler state, must be thread-owned on the process ring, reserves completion credit for later wake/timeout delivery, and needs compact capability-authorized hot-path framing. They are not a precedent for moving ordinary object methods into the opcode table for convenience.
CAP_OP_CALL may set CAP_SQE_THREAD_OWNED with call_id equal to the owning
thread id. If another thread drains the shared process ring first, the kernel
leaves that SQE at the head instead of consuming it and returns a distinct
owner-head cap_enter result instead of blocking the non-owner behind it. This
is limited to context-sensitive self-thread operations such as
ThreadControl.exitThread; ordinary runtime submissions leave call_id = 0.
CAP_OP_PARK and CAP_OP_UNPARK are compact capability-authorized
operations for process-local ParkSpace. Wait SQEs must set
CAP_SQE_THREAD_OWNED with call_id equal to the owning thread id; a
non-owner cap_enter leaves the SQE at the head just like a thread-owned CALL.
They reject promise-pipeline fields and run only from syscall-context ring
dispatch, not timer polling. A blocking wait consumes the SQE but posts no
caller CQE immediately; instead it reserves one waiter CQE credit, parks the
current thread, and later completes with a non-negative park status. Ordinary
CQE posting treats reserved park credits as unavailable so wake and timeout
delivery cannot lose waiter completions.
The kernel copies user params into preallocated per-process scratch, dispatches
capability methods, copies serialized results into caller-provided result
buffers, and posts CapCqe. Current-process user copies and transfer-descriptor
loads hold the caller’s AddressSpace mutex across permission validation and
the actual HHDM-backed copy/read. A successful method returns non-negative bytes
written. Transport failures are negative CAP_ERR_* codes. Application
exceptions are serialized CapException payloads with
CAP_ERR_APPLICATION_EXCEPTION. Ordinary capability implementation errors and
live endpoint CALL/RETURN target errors use this application-exception path
once a valid target cap or accepted endpoint relationship has been identified;
malformed ring metadata, bad user buffers, lookup failures, and endpoint
rollback/transfer failures stay in the transport namespace.
Transfer-bearing CALL and RETURN SQEs pack CapTransferDescriptor records
after the params/result payload. Successful result-cap transfers append
CapTransferResult records after normal result bytes.
Promise-pipelined CALLs remain rejected by current kernels. When that flag is
enabled, pipeline_dep names a process-local promised-answer identifier, and
pipeline_field selects a zero-based CapTransferResult record from that
answer’s completion. It is not a Cap’n Proto schema field number or payload
path. The kernel resolves dependencies only through the sideband result-cap
records it already owns; normal result bytes stay opaque to the transport.
Future behavior should use the reserved SQE fields for system transport features, not ad hoc per-interface extensions.
Choosing A Capability Method, Ring Opcode, Or Syscall
New kernel functionality should default to a normal typed capability method. The small syscall surface is only the trap surface; the ring opcode table is also a reviewed kernel ABI and must stay narrow. The decision tree below is a full-page reference in the PDF because the branches are easier to read at diagram scale than as compressed prose.
flowchart TD
Start[New kernel-visible operation] --> Ambient{Must it run without any held capability?}
Ambient -- yes --> Trap{Is it process lifecycle or kernel-entry control?}
Trap -- yes --> Syscall[Consider a syscall]
Trap -- no --> RejectAmbient[Reject or redesign around explicit authority]
Ambient -- no --> CapMethod{Can it be expressed as a typed object method?}
CapMethod -- no --> Redesign[Redesign the authority object or transport contract]
CapMethod -- yes --> Hot{Is generic Cap'n Proto CALL materially wrong?}
Hot -- no --> Method[Use CAP_OP_CALL to a capability method]
Hot -- yes --> RingSpecific{Does it need ring/scheduler-specific semantics?}
RingSpecific -- no --> Method
RingSpecific -- yes --> Stable{Is the compact SQE/CQE ABI stable and capability-authorized?}
Stable -- no --> MethodOrDesign[Keep a capability method or write a reviewed design first]
Stable -- yes --> Opcode[Consider a new CAP_OP_* opcode]
Use a normal capability method when the operation is control plane, policy driven, service-specific, infrequent, or naturally represented by Cap’n Proto params/results. Process spawning, credential checks, storage naming, shell or network policy, virtual-memory control-plane calls, and most device-specific commands belong here unless measurement and design review prove otherwise.
Consider a compact ring opcode only when all of these are true:
- The operation is a hot path or scheduler path where generic Cap’n Proto framing is materially wrong.
- The operation has a small, stable field layout that fits the existing SQE/CQE model without per-interface ad hoc extensions.
- It needs ring-specific behavior such as thread ownership, reserved completion credit, CQ ordering/backpressure, asynchronous completion delivery, or interaction with the process ring head.
- It remains authorized by a held capability in
cap_id, not by ambient process identity or guessed kernel object names. - It cannot be handled as a normal capability method plus a future generated fast client without losing an essential scheduler or transport invariant.
Consider a new syscall only when the operation is about entering or leaving the kernel execution context itself and cannot sensibly be authorized by a capability already available to the process. That bar is intentionally higher than the opcode bar. Ordinary resource operations should not become syscalls just because they are common.
Full-SMP Direction
The current process-wide ring is not the target ABI for full SMP. Once sibling threads in one process can run on different CPUs, a shared process CQ would force userspace to serialize completion consumption or the kernel to invent specific-wait state on top of circular-buffer slots.
The selected future direction is per-thread ring ownership, documented in
Ring v2 For Full SMP. In that model,
cap_enter(min_complete, timeout_ns) keeps its current aggregate wait shape,
but the aggregate is the current thread’s CQ. Completion paths post by
generation-checked ThreadRef, while result-cap transfers and authority still
belong to the process cap table.
The first Ring v2 implementation should use kernel-chosen child-thread ring
mappings. The initial fixed RING_VADDR mapping becomes a compatibility
special case backed by the same RingEndpoint lifetime and waiter rules as
child-thread rings. Runtime-supplied ring address ranges are deferred until
VirtualMemory can reserve a ring arena without racing ordinary mappings.
The initial Phase C multi-CPU scheduler proof may continue to use the current process-wide ring as long as userspace serializes ring consumption. Ring v2 is the target for full SMP with sibling threads from one process running and waiting independently on different CPUs.
A runtime reactor can bridge the current process-wide ring for multithreaded
runtimes before Ring v2: one runtime-owned drainer consumes the process CQ,
matches completions by user_data, and wakes waiting threads through
ParkSpace. That bridge is not the full-SMP kernel ABI.
Invariants
- SQ and CQ sizes are powers of two and fixed by the ABI.
- Unknown opcodes fail closed;
FINISHis reserved, not silently accepted. - Reserved fields must be zero for currently implemented opcodes, except
CAP_SQE_THREAD_OWNEDCALL and PARK SQEs may carry the owning thread id incall_id. - Park PARK/UNPARK SQEs must keep unsupported fields zero and must not be dispatched from timer context.
cap_enterrejectsmin_complete > CQ_ENTRIES.- User-buffer validation and copy/read must hold the owning process
AddressSpacemutex for CALL params/results, RECV result buffers, RETURN payloads, transfer descriptors, and deferred same-process completions. - Timer dispatch must not run capabilities that allocate, block on locks, or mutate page tables unless the cap explicitly opts in.
- Per-dispatch SQE processing is bounded by
SQ_ENTRIES. - Transfer descriptors must be aligned, valid, and bounded by
MAX_TRANSFER_DESCRIPTORS. - Promise-pipelined dependency resolution must use sideband
CapTransferResultordinals, never general Cap’n Proto result traversal in the kernel.
Code Map
capos-config/src/ring.rs- shared ring ABI, opcodes, errors, SQE/CQE structs, endpoint message headers, transfer records.kernel/src/cap/ring.rs- kernel dispatcher, SQE validation, CQE posting, cap calls, endpoint CALL/RECV/RETURN, release, transfer framing.kernel/src/arch/x86_64/syscall.rs-cap_entersyscall.kernel/src/sched.rs- timer polling, cap-enter blocking, direct IPC wake.kernel/src/process.rs- ring page allocation and mapping.capos-rt/src/ring.rs- runtime ring client, pending calls, transfer packing, result-cap parsing.capos-rt/src/entry.rs- single-owner runtime ring client token and release queue flushing.capos-config/tests/ring_loom.rs- bounded producer/consumer model.
Validation
cargo test-ring-loomvalidates SQ/CQ producer-consumer behavior, capacity, FIFO, CQ overflow/drop behavior, and corrupted SQ recovery.make runexercises Console CALLs, reserved opcode rejection, ring corruption recovery, NOP, fairness, transfers, and endpoint IPC.make run-measureexercises measurement-only counters, dispatch segment cycle summaries, the NullCap baseline, the ParkBench compact-versus-generic comparison, and the real ParkSpace blocked/resume timing path.cargo test-configcovers shared ring layout and helper invariants.make capos-rt-checkchecks userspace runtime ring code under the bare-metal target.
Open Work
- Implement
CAP_OP_FINISHas part of the system Cap’n Proto transport. - Implement promise pipelining using the reserved
pipeline_depanswer ID andpipeline_fieldresult-cap ordinal mapping. - Define LINK, DRAIN, and MULTISHOT semantics before accepting those flags.
- Add runtime-level ParkSpace wrappers and completion demultiplexing on top of the compact opcodes.
- Add the runtime reactor bridge for multithreaded use of the current process ring, then replace it as the kernel fast path with per-thread Ring v2 completion ownership.
- Add SQPOLL after SMP gives the kernel a spare execution context.
Error Handling
capOS uses three error layers for capability invocation. Keeping the layers separate prevents malformed transport state from looking like a service-domain decision, and prevents ordinary business outcomes from becoming generic kernel exceptions.
Current Model
| Layer | Carrier | Use |
|---|---|---|
| Transport status | Negative CapCqe.result codes | Ring, opcode, lookup, buffer, transfer, and dispatch failures where no safe typed payload boundary exists. |
| Capability exception | Serialized CapException plus CAP_ERR_APPLICATION_EXCEPTION or CAP_ERR_APPLICATION_EXCEPTION_TRUNCATED | Capability-level infrastructure failures after a target capability or accepted endpoint relationship exists. |
| Schema result union | Interface-specific result payload | Expected service or domain outcomes such as not-found, denied-by-policy, conflict, invalid domain input, or accepted/rejected business results. |
Transport failures are intentionally small and mechanical. Examples include a bad SQE layout, an invalid params or result buffer, an unsupported opcode, a malformed transfer descriptor, or a capability lookup that fails before a live target object is identified.
Capability exceptions are for infrastructure failures at a valid capability boundary: target gone, target overloaded, method unimplemented, argument value rejected by the documented capability contract, or a target-side invariant failure. The exception message is diagnostic and must not carry kernel pointers, secret bytes, or unrelated process-private state.
Schema result unions are the normal application surface. A filesystem
notFound, service-level permissionDenied, ordinary conflict, or accepted
conditional rejection belongs in the interface result, not in CapException.
Current Transport Namespace
The ring transport uses signed 32-bit completion results. Non-negative values
are opcode-specific successes. Negative values are defined in
capos-config/src/ring.rs:
| Code | Name | Meaning |
|---|---|---|
-1 | CAP_ERR_INVALID_REQUEST | Malformed request metadata or a non-reserved opcode value. |
-2 | CAP_ERR_INVALID_PARAMS_BUFFER | Params buffer is unmapped, out of range, or unreadable. |
-3 | CAP_ERR_INVALID_RESULT_BUFFER | Result buffer is unmapped, out of range, or unwritable. |
-4 | CAP_ERR_INVOKE_FAILED | Lookup or dispatch failed before a successful typed result was produced. |
-5 | CAP_ERR_UNSUPPORTED_OPCODE | Opcode is reserved but not dispatched by this kernel. |
-6 | CAP_ERR_TRANSFER_NOT_SUPPORTED | Transfer mode or descriptor layout is recognized but unsupported. |
-7 | CAP_ERR_INVALID_TRANSFER_DESCRIPTOR | Transfer descriptor layout is malformed or carries reserved bits. |
-8 | CAP_ERR_TRANSFER_ABORTED | Transfer transaction failed without committing partial capability state. |
-9 | CAP_ERR_APPLICATION_EXCEPTION | A structured CapException was written to the result buffer. |
-10 | CAP_ERR_APPLICATION_EXCEPTION_TRUNCATED | An exception occurred, but no complete detail fit in the result buffer. |
Capability Exceptions
schema/capos.capnp defines ExceptionType and CapException. The current
exception kinds are Failed, Overloaded, Disconnected,
Unimplemented, and the capOS-specific InvalidArgument.
The kernel serializes ordinary capability implementation errors through
kernel/src/cap/ring.rs. capos-rt/src/client.rs decodes application-exception
CQEs into ClientError::Application(ApplicationException). The runtime treats
Disconnected as a broken local handle.
A path should produce CapException only when all of these are true:
- a live target capability was identified, or an endpoint operation is acting on an already accepted call, receive, or return relationship;
- the failure is attributable to capability semantics rather than malformed ring metadata;
- the affected caller supplied a result buffer large enough to receive the serialized exception, otherwise the result is the truncated exception code.
Endpoint RETURN
Endpoint RETURN is asymmetric because the result belongs to the original caller,
not the returning receiver. A server can set
CAP_SQE_RETURN_APPLICATION_EXCEPTION on CAP_OP_RETURN to return a serialized
CapException to the caller. The server’s own RETURN completion reports only
whether the return transport succeeded.
Revoked endpoint RETURN also reports Disconnected to the original caller when
that caller supplied a result buffer. Receiver-side lookup and CQ-space failures
that cannot be tied to the caller’s result buffer remain transport failures.
Code Map
capos-config/src/ring.rs- transport error constants, SQE/CQE layout, and endpoint transport flags.schema/capos.capnp-ExceptionType,CapException, and per-interface result unions.kernel/src/cap/ring.rs- exception serialization, ring dispatch, endpoint RETURN exception handling, andInvalidArgumentsentinel mapping.kernel/src/cap/endpoint.rs- endpoint queue, in-flight call, and revoked endpoint state.capos-rt/src/client.rs- runtime decoding intoClientError.docs/architecture/capability-ring.md- ring ABI and opcode dispatch rules.docs/architecture/ipc-endpoints.md- endpoint CALL/RECV/RETURN transport.
Validation
make run-spawncovers cross-process endpoint RETURN propagation forFailed,Overloaded, andUnimplemented, plus reserved opcode and no-result-buffer exception paths.make run-smokecovers same-process endpoint use and revoked-cap behavior.cargo test-libcovers cap-table stale-slot and transfer rollback behavior that the transport error paths depend on.cargo test-ring-loomcovers ring queue behavior that completion delivery depends on.
Open Work
- Promise pipelining and future multishot/link/drain ring behavior must carry the same three-layer error split.
- Long-lived services should prefer stable result-union variants over generic text errors for ordinary domain outcomes.
- Future external clients need compatibility rules for exception taxonomy evolution once the ABI is treated as cross-version or separately released.
Design Grounding
The archival decision record is Error Handling. Relevant research notes are Cap’n Proto Error Handling and OS Error Handling.
IPC and Endpoints
Endpoints let one process serve capability calls to another process without adding a separate IPC syscall surface. The same ring transport carries ordinary kernel capability calls and cross-process endpoint calls.
Current Behavior
An Endpoint is a kernel capability object with queues for pending client
calls, pending server receives, and in-flight calls awaiting RETURN. A service
that owns the raw endpoint can receive and return. Importers receive a
ClientEndpoint facet that can CALL but cannot RECV or RETURN.
sequenceDiagram
participant Client
participant ClientRing as Client ring
participant Endpoint
participant ServerRing as Server ring
participant Server
Server->>ServerRing: submit RECV on raw endpoint
Client->>ClientRing: submit CALL on client facet
ClientRing->>Endpoint: deliver params and caller result target
Endpoint->>ServerRing: complete RECV with EndpointMessageHeader and params
ServerRing-->>Server: cap_enter returns completion
Server->>ServerRing: submit RETURN with call_id and result
ServerRing->>Endpoint: take in-flight target
Endpoint->>ClientRing: post caller CQE with result and receiver metadata
ClientRing-->>Client: wait returns matching completion
If a CALL arrives before a RECV, the endpoint queues bounded params. If a RECV arrives before a CALL, the endpoint queues the receive request. Delivered calls move into the in-flight queue until the server returns or cleanup cancels them.
Design
Endpoint IPC is capability-oriented. The manifest can export a raw endpoint from one service; importers get a narrowed client facet. This keeps server-only authority out of clients without introducing rights bitmasks.
CALL and RETURN may carry sideband transfer descriptors. Copy transfers insert a new cap into the receiver while preserving the sender. Move transfers reserve the sender slot, insert the destination, then remove the source on commit. RETURN-side transfers append result-cap records after the normal result payload. Cross-session delivery is additionally checked against the cap hold transfer scope: same-session caps fail closed, cross-session-shareable caps may cross, and service-regrant-only caps need a trusted fixed-session regrant path. CALL SQEs may also request field-granular session disclosure. The kernel intersects that request with the invoked cap’s disclosure scope before delivering any subject fields, so a request without scope or scope without a request exposes only the default opaque caller-session metadata.
Legacy receiver metadata is stored on cap-table hold edges and delivered to
servers with endpoint invocation metadata, so one endpoint can distinguish
transitional callers without one object per caller. Some ABI structs still name
this field badge; that name is compatibility state, not the normal
shared-service authority model. Session-bound invocation context is the
replacement model for normal workload paths: every normal process has one
immutable session context, endpoint calls expose privacy-preserving
caller-session metadata by default, and shared services derive user-facing
state from broker-granted capabilities plus service-scoped session references.
See Session Context.
Delegated Client Relabeling Containment
The Gate 0 containment rule is narrow: a process that holds an imported
ClientEndpoint may delegate that same client identity, but it may not mint a
sibling identity by setting another legacy badge during spawn. Endpoint owners
and explicit trusted mint paths remain transitional mechanisms for low-level
tests. Normal shared services use broker-granted roots/facets plus
session-bound invocation context instead of service-object badges.
Normal capos-shell help and smoke expectations must therefore omit arbitrary
badge N launch examples. Omitted shell badge syntax preserves the source
identity instead of selecting badge zero. Legacy badge syntax may remain
reachable only as a debug or hostile-test input, and QEMU coverage for the
Telnet blocker must prove both explicit client @name badge N and low-level
legacy badge-zero relabel encodings from a nonzero delegated client facet fail
closed.
Shell-serviced stdio bridges now bind the active child wait to the first opaque
live caller-session reference seen on the bridge endpoint. A later call from a
different live caller session is answered with an empty result and the child is
terminated; transferred caps are released before either normal transfer
rejection or caller-session rejection returns. Normal StdIO.close is treated
as a clean child close rather than a security rejection.
Future IPC should add notification objects for lightweight signaling and promise pipelining for Cap’n Proto-style dependent calls.
Invariants
- Only raw endpoint holders may RECV or RETURN.
- Imported endpoint caps are
ClientEndpointfacets and must reject RECV and RETURN from userspace. - Delegating an imported client facet must preserve its server-visible object identity. Only endpoint owners or explicit trusted mint paths may create sibling client identities, and normal services should not treat that identity as user/session authority.
- Endpoint queues are bounded by call count, receive count, in-flight count, per-call params, and total queued params.
- Each in-flight call has a kernel-assigned non-zero
call_id. - CALL delivery copies params into kernel-owned queued storage before the caller can resume.
- Move transfer commit must not leave both source and destination live.
- Transfer rollback must preserve source authority if destination insertion or result delivery fails.
- Process exit must cancel queued state involving that pid and wake affected peers when possible.
Code Map
kernel/src/cap/endpoint.rs- endpoint queues, client facet, call IDs, cancellation by pid.kernel/src/cap/ring.rs- endpoint CALL/RECV/RETURN dispatch, result copying, deferred cancellation CQEs.kernel/src/cap/transfer.rs- transfer descriptor loading and transaction preparation.capos-lib/src/cap_table.rs- cap-table transfer primitives and rollback.kernel/src/cap/mod.rs- manifest export resolution and client-facet construction.capos-config/src/ring.rs-EndpointMessageHeader, transfer descriptors, transfer result records, endpoint opcodes.demos/capos-demo-support/src/lib.rs- endpoint, IPC, transfer, and hostile IPC smoke routines.demos/endpoint-roundtrip,demos/ipc-server,demos/ipc-client- QEMU smoke binaries.demos/ipc-zerocopy-producer,demos/ipc-zerocopy-consumer- QEMU smoke for the multi-message shared-buffer zero-copy IPC pattern.
Validation
make run-smokevalidates same-process endpoint RECV/RETURN, cross-process IPC, endpoint exit cleanup, legacy badged calls, transfer success/failure paths, and clean halt.make run-spawnvalidates init-spawned endpoint-roundtrip, server, and client processes.make run-memoryobject-sharedvalidates a one-shot shared-buffer handoff over an endpoint cap transfer.make run-ipc-zerocopyvalidates the multi-message zero-copy IPC pattern at the substrate level: the producer transfers oneMemoryObjectto the consumer and then exchanges four record payloads through the shared mapping while endpoint CALLs carry only sequence numbers and checksums. The demo drives raw SQE/CQE construction throughcapos-demo-supportrather than a typed runtime client and uses an ad-hoc seq+checksum framing because the typedSharedBufferABI, ring-shaped producer/consumer metadata, and notification primitives are still pending; production services (File.readBuf,BlockDevice.readBlocks, NIC RX/TX rings) will reuse the sameMemoryObjectsubstrate through that future surface, not the demo’s framing.cargo test-libcovers cap-table transfer preflight, provisional insertion, commit, rollback, stale generation, and slot exhaustion cases.cargo test-ring-loomcovers ring queue behavior that endpoint IPC depends on for completion delivery.
Open Work
- Add notification objects for signal-style events.
- Add Cap’n Proto promise pipelining after endpoint routing can resolve dependent answers.
- Add a typed
SharedBuffercapability surface (ring-shaped producer/consumer metadata, completion signaling, lifetime/quota rules) on top of the rawMemoryObjectsubstrate exercised bymake run-ipc-zerocopy. - Add epoch-based revocation if broad authority invalidation becomes necessary.
Authority Graph and Resource Accounting for Transfer
This document defines the authority graph and resource-accounting contract
originally tracked as Security Verification Track S.9 in
docs/proposals/security-and-verification-proposal.md. It covers:
- capability transfer (
xfer_cap_count, copy/move, rollback) - ProcessSpawner prerequisites (spawn quotas and result-cap insertion)
Security Verification Track S.9 is complete when this design contract is
concrete enough to guide implementation. The invariants and acceptance
criteria below are implementation gates for capability transfer,
ProcessSpawner, Security Verification Track S.8, and Security Verification
Track S.12 follow-up work, not requirements for declaring the Security
Verification Track S.9 design artifact complete. Current capability-semantics
follow-up items live in docs/backlog/stage-6-capability-semantics.md.
Current Implementation and Target Contract
The current implementation defines ResourceLedger fields in
capos-lib/src/cap_table.rs for capability slots, outstanding calls, scratch
bytes, frame-grant pages, and virtual-reservation pages. Cap-slot and
frame/virtual page reservations are wired into current reservation paths.
Outstanding-call and scratch-byte counters are present ledger fields but are
not yet fully wired into reservation/preflight paths. Endpoint queue quota,
diagnostic log-rate accounting, and CPU token-bucket accounting below are
target contract fields for future implementation work, not current
ResourceLedger members.
1. Authority Graph Model
Authority is modeled as a directed multigraph:
- Nodes:
Process(Pid)Object(ObjectId)(kernel object identity, independent of per-processCapId)
- Edges:
Hold(Pid -> ObjectId)with metadata:cap_id(table-local handle)interface_idbadgetransfer_mode(copy,move,non_transferable)origin(kernel,spawn_grant,ipc_transfer,result_cap)
Security invariant A1: all authority is represented by Hold edges; no
operation can create object authority outside this graph.
Security invariant A2: each process mutates only its own CapTable edges except
through explicit transfer/spawn transactions validated by the kernel.
Security invariant A3: for every live Hold edge there is exactly one
cap_id slot in one process table referencing the object generation.
2. Per-Process Resource Ledger and Quotas
Each process owns a kernel-maintained ResourceLedger. For wired reservation
paths, enforcement is fail-closed at reservation time (before side effects).
The target contract completes enforcement for present-but-unwired fields and
extends the ledger with endpoint queue, diagnostic log, and CPU budget
counters.
ResourceLedger {
// Current ledger fields.
cap_slots_used / cap_slots_max
outstanding_calls_used / outstanding_calls_max
scratch_bytes_used / scratch_bytes_max
frame_grant_pages_used / frame_grant_pages_max
virtual_reservation_pages_used / virtual_reservation_pages_max
// Target/future fields.
endpoint_queue_used / endpoint_queue_max
log_bytes_window_used / log_bytes_per_window (token bucket)
cpu_time_us_window_used / cpu_budget_us_per_window (token bucket)
}
Initial quota profile for Stage 6/5.2 bring-up (tunable by kernel config):
cap_slots_max: 256outstanding_calls_max: 64scratch_bytes_max: 256 KiBframe_grant_pages_max: 4096 pages (16 MiB at 4 KiB pages)virtual_reservation_pages_max: kernel-configured virtual reservation budget- Future target fields:
endpoint_queue_max128 messages,log_bytes_per_window64 KiB/sec with 256 KiB burst, andcpu_budget_us_per_window10,000 us per 100,000 us window.
Security invariant Q1: no counter may exceed its max.
Security invariant Q2: every resource reservation has a matched release on all success, error, timeout, process-exit, and rollback paths.
Security invariant Q3: quota checks for transfer/spawn happen before mutating sender or receiver capability state.
3. Diagnostic Rate Limiting and Aggregation
Repeated invalid ring/cap submissions are aggregated per process and error key.
- Key:
(pid, error_code, opcode, cap_id_bucket) - Buckets:
cap_id_bucket = exact cap idfor stale/invalid cap failurescap_id_bucket = 0for structural ring errors
- Per-key token bucket: allow first
N=4emissions/sec, then suppress. - Suppressed counts are flushed once per second as one summary line:
pid=X invalid submissions suppressed=Y last_err=...
Security invariant D1: invalid submission floods cannot consume unbounded serial bandwidth or scheduler time in log formatting.
Security invariant D2: suppression never hides first-observation diagnostics for
a new (pid,error,opcode,cap bucket) key.
4. Transfer and Rollback Semantics
Transfers (xfer_cap_count > 0) use a kernel transfer transaction
(TransferTxn) scoped to a single SQE dispatch. The current ring ABI does not
provide kernel-owned SQE sequence numbers or a durable transaction table, so
userspace replay of a copy-transfer SQE is repeatable: each replay is treated
as a new copy grant. Move-transfer replay fails closed after the source slot is
removed or reserved by the first successful dispatch.
Future exactly-once replay suppression requires transaction identity scoped to
(sender_pid, call_id, sqe_seq) and a monotonic transfer epoch. Until that
exists, exactly-once claims apply only within one dispatch attempt, not across
malicious rewrites of shared SQ ring indexes.
Sensitive interfaces must choose their transfer mode deliberately:
| Transfer mode | Semantics | Suitable for | Required negative tests |
|---|---|---|---|
copy | Repeatable grant; sender keeps authority and replaying the same copy-transfer SQE can mint another receiver hold. | Stateless or explicitly shareable caps where duplicate receivers are acceptable and audited. | Replay mints only allowed duplicate holds; quota exhaustion fails closed; copy across forbidden session/transfer scope is rejected. |
move | Single authority handoff; sender loses the source hold after successful destination insertion. Replay fails closed after source reservation/removal. | Linear resources, accepted sockets, terminal sessions, one-shot result caps, and authority that should have one active owner. | Replay after success fails; rollback restores sender on partial failure; receiver cannot observe authority before commit. |
non_transferable | No IPC/spawn transfer. | Process-local control caps, raw spawn/network/device authority, private keys, and caps whose authority depends on caller-local state. | IPC/spawn transfer attempts fail closed and leave sender/receiver tables unchanged. |
Copy-transfer replay is therefore acceptable only for caps whose interface contract says repeated receivers are safe. Sensitive caps must be move-only or non-transferable until the interface has an explicit replay threat model and hostile tests.
Phases:
Prepare:- validate SQE transport fields and
xfer_cap_count - validate sender ownership/generation/transferability for each exported cap
- reserve receiver quota (
cap_slots,outstanding_calls, scratch if needed) - pin sender entries in txn state (no sender table mutation yet)
- validate SQE transport fields and
Commit:- insert destination edges exactly once
- for
copy: increment object refcount/export ref - for
move: remove sender slot only after destination insertion succeeds - publish completion/result
Finalize:- release transient reservations
- mark txn terminal (
committedoraborted)
On any error before Commit, rollback is full:
- receiver inserts are not visible
- sender slots/refcounts unchanged
- reservations released
- CQE returns transfer failure (
CAP_ERR_TRANSFER_ABORTED/ subtype)
On error during Commit, kernel executes compensating rollback to preserve
exactly-once visibility: either all inserts are visible with matching sender
state transition, or none are visible.
Security invariant T1: each transfer descriptor is applied at most once within a single SQE dispatch attempt.
Security invariant T2: move transfer is atomic from observer perspective; no state exists where both sender and receiver lose authority due to partial apply.
Security invariant T3: copy-transfer SQE replay is explicitly repeatable until kernel-owned transaction identity exists. Move-transfer replay fails closed after source removal or source reservation.
Security invariant T4: CAP_OP_RELEASE removes one local hold edge only from
the caller table and decrements remote export refs exactly once.
5. Integration with 3.6 Capability Transfer
3.6 implementation must consume this design directly:
CALLandRETURNvalidate all currently-reserved transfer fields fail-closed when unsupported.xfer_cap_countpath is wired throughTransferTxn(no ad hoc direct inserts).- Badge propagation is explicit in transfer descriptors and copied into destination edge metadata.
CAP_OP_RELEASEuses the same authority ledger and refcount bookkeeping.
3.6 acceptance criteria:
- Copy transfer produces one new receiver edge and retains sender edge.
- Move transfer produces one new receiver edge and deletes sender edge atomically.
- Any transfer failure leaves sender and receiver
CapTables unchanged. - Copy replay is an explicit repeatable-grant policy until a kernel-owned transaction identity is added; move replay fails closed after source removal or reservation.
CAP_OP_RELEASEon stale/non-owned cap fails closed without mutating other process tables.
6. Integration with 5.2 ProcessSpawner Prerequisites
5.2 must use the same accounting and transfer machinery:
spawn()preflights child quotas (cap_slots,outstanding_calls,scratch,frame_grant_pages, endpoint queue baseline) before mapping child memory or scheduling.- Parent-provided
CapGrantentries are inserted via the same transfer transaction semantics (copy for initial grants in 5.2.2). - Returned
ProcessHandleis inserted through the standard result-cap insertion path and accounted as a normal cap slot. - Child setup rollback must unwind:
- address space mappings
- ring page
- CapSet page
- kernel stack
- allocated frames
- provisional capability edges/reservations
5.2 acceptance criteria:
- Spawn failure at any step leaves no child-visible process and no leaked ledger usage.
- Successful spawn accounts all child bootstrap resources within quotas.
- Parent and child cap-table accounting remains balanced under repeated spawn/exit cycles.
ProcessHandle.waitand exit cleanup release outstanding-call/scratch/frame usage deterministically.
7. Implementation Notes for Verification Tracks
This design unblocks:
- Security Verification Track S.8 hostile-input tests for quota and invalid-transfer failures.
- Security Verification Track S.12 Kani bounds refresh for ledger and transfer invariants.
- Target 12 in
docs/proposals/security-and-verification-proposal.mdwith explicit allocator hooks and fail-closed exhaustion behavior.
Userspace Runtime
The userspace runtime owns the repeated mechanics that every service needs: bootstrap validation, heap initialization, typed capability lookup, ring submission, completion matching, application exception decoding, and handle lifetime.
Related
- Go VirtualMemory Contract defines the caller-buffer reserve, commit, and decommit methods allocator paths need.
- Programming Languages summarizes current native Rust support and planned language-runtime tracks.
- Memory Management documents the implemented kernel
VirtualMemoryandMemoryObjectbehavior. - Go Runtime is the owning language runtime proposal; LLVM Target records the Go runtime OS hooks that drive this work.
Current Behavior
Runtime-owned _start receives (ring_addr, pid, capset_addr), initializes a
fixed heap, validates the ring address, reads the read-only CapSet page, installs
an emergency Console panic path when available, calls capos_rt_main(runtime),
and exits with the returned code.
The Runtime lends out at most one RuntimeRingClient at a time. The client
wraps the raw ring page, keeps request buffers alive until completions are
matched, handles out-of-order completions, packs copy-transfer descriptors, and
parses result-cap records. Owned runtime handles queue CAP_OP_RELEASE when the
last local reference is dropped; the release queue flushes when a ring client is
borrowed or dropped, or when code calls Runtime::flush_releases() explicitly.
Promise placeholders are currently bookkeeping only; their future SQE
coordinates map AnswerId.raw() to pipeline_dep and a result-cap record index
to pipeline_field.
Design
The runtime separates non-owning bootstrap references from owned local handles.
CapSet entries produce typed Capability<T> values only when the interface ID
matches the requested type, and the same manifest-order CapSet entries remain
available for diagnostic and shell surfaces that need to list or inspect what a
process was actually granted. Result-cap adoption performs the same interface
check before producing OwnedCapability<T>.
Typed clients are thin wrappers over the ring client. They encode Cap’n Proto
params, submit CALL SQEs, wait for a matching CQE, decode transport errors, and
decode kernel-produced CapException payloads into client errors. Endpoint
servers can use submit_endpoint_return_exception() to return a serialized
CapException to the original caller over the same endpoint RETURN path.
The handwritten TimerClient exposes monotonic now reads and sleep calls
over the same completion-matching path.
The handwritten VirtualMemoryClient exposes map, reserve, commit, decommit,
unmap, and protect calls for runtime heap/arena allocation over anonymous user
pages. It has both the ordinary allocation-backed async methods and synchronous
caller-buffer methods for allocator growth paths that cannot allocate while
asking the kernel for more memory. This matches the reserve/commit/decommit
surface specified in
Go VirtualMemory Contract.
The handwritten ThreadControlClient exposes current-process FS-base reads and
updates for runtimes that need to swap a language-managed TLS base after process
startup.
The 7.1.0 threading contract keeps one process ring and the runtime’s
single-owner ring-client invariant for the first in-process threading
implementation. Future multi-threaded runtimes must serialize blocking ring
entry through capos-rt until a runtime reactor or Ring v2 lands. The reactor
bridge uses one runtime-owned CQ drainer plus ParkSpace-backed wait records;
the full-SMP kernel target is per-thread rings, where cap_enter waits on the
current thread’s CQ. After 7.2, the existing ThreadControlClient methods apply
to the current thread’s FS base rather than to a process-wide saved FS base.
ThreadControl.exitThread and the raw exit(code) syscall both terminate the
current thread; the process exits when its last live thread exits.
The 7.2.3 park slice adds a process-local ParkSpace marker type and compact
CAP_OP_PARK / CAP_OP_UNPARK operations. capos-rt should expose
those operations as runtime synchronization primitives in a later slice; the
current thread-lifecycle proof uses raw SQEs so the runtime does not
prematurely claim the park user_data namespace. Blocking park wait is not
an ordinary
RuntimeRingClient call: the wait SQE must be thread-owned for the current
thread, and the runtime must reserve park user_data values,
write the wait SQE under its ring-submission lock, release that lock before
cap_enter, and demultiplex park CQEs into runtime-owned wait slots so a
sibling thread can still submit the wake. The temporary single-thread park
fallback remains only as the pre-thread runtime checkpoint proof.
Future generated clients should preserve this split: transport lifetime and completion matching belong in the runtime, while interface-specific encoding belongs in generated or handwritten client wrappers.
Invariants
ring_addrmust equalRING_VADDR; runtime bootstrap rejects any other address.- The CapSet header magic/version must validate before lookup.
- CapSet handles are non-owning unless explicitly adopted.
- Only one runtime ring client may be live at a time for a process.
- Until Ring v2, multithreaded generic client waits must flow through a runtime reactor/demux path rather than letting multiple threads consume the process CQ directly.
- Park wait must not hold the live runtime ring client while the kernel parks the current thread.
- Request params and result buffers must outlive their matching CQE.
- A result cap can be consumed only once and only with the expected interface ID.
- Promise placeholders must map to sideband result-cap record indexes, not schema field paths.
- Dropping the final owned handle queues exactly one local
CAP_OP_RELEASE;Runtime::flush_releases()forces queued releases and reports rejected kernel release results. - Release flushing treats stale or already-removed caps as non-fatal cleanup.
Code Map
capos-rt/src/entry.rs-_start,Runtime, bootstrap validation, single-owner ring token, release queue flushing.capos-rt/src/alloc.rs- fixed userspace heap initialization.capos-rt/src/capset.rs- typed CapSet lookup and manifest-order iteration wrappers.capos-rt/src/ring.rs- ring client, pending calls, completion matching, copy-transfer packing, result-cap parsing.capos-rt/src/client.rs- Console, TerminalSession, BootPackage, ProcessSpawner, ProcessHandle, VirtualMemory, Timer, ThreadControl, ThreadSpawner, and ThreadHandle clients, and exception decoding.capos-rt/src/lib.rs- typed capability marker types and owned handle reference counting.capos-rt/src/panic.rs- emergency Console output path.capos-rt/src/syscall.rs- raw syscall instructions and public syscall wrappers, including the hostile smoke probe for the removed ambient write syscall.targets/x86_64-unknown-capos.json- userspace target specification.tools/check-userspace-runtime-surface.sh- source check that keeps runtime primitives owned bycapos-rt.init/src/main.rs,capos-rt/src/bin/smoke.rs, andshell/src/main.rs- current runtime users.
Validation
make capos-rt-checkbuilds the runtime smoke binary againsttargets/x86_64-unknown-capos.json, matching the booted userspace target.make init-capos-build,make demos-capos-build,make shell-capos-build, andmake capos-rt-capos-buildexpose focused custom-target build wrappers for the current userspace crates and runtime smoke binary.tools/check-userspace-runtime-surface.shverifiesinit,demos, andshelldo not define_start, panic handlers, global allocators, raw syscall instructions, or entry-point macros outsidecapos-rt.make run-smokevalidates runtime entry, typed Console calls, exception decoding, owned handle release, result-cap parsing through IPC, and clean process exit.make run-spawnvalidatesProcessSpawnerClient,ProcessHandleClient,VirtualMemoryClient,TimerClient,ThreadControlClient,ThreadSpawnerClient,ThreadHandleClient, result-cap adoption, and release behavior under init spawning. Thesingle-thread-runtimechild proves the first runtime-shaped checkpoint over caller-buffer VirtualMemory calls and Timer; thethread-lifecyclechild proves in-process create, self-join rejection, join, detach, last-threadexitThread, and private ParkSpace wait/wake correctness.make run-shellvalidates CapSet iteration, capability inspection, typed application-error decoding, guest session metadata, exact-grant spawning, ProcessHandle waits, and stale-handle release behavior in the focused shell-launch proof manifest.make run-terminalvalidatesTerminalSessionClientwrites, bounded line reads, hidden-echo input handling, and structured cancellation in the focused terminal proof manifest.cd capos-rt && cargo test --lib --target x86_64-unknown-linux-gnucovers host-testable runtime invariants when run explicitly.
Open Work
- Add generated client bindings after the schema surface stabilizes.
- Implement promise/answer transport semantics beyond current placeholders.
- Add typed ParkSpace clients with runtime-owned
user_datademultiplexing. - Define release behavior for queued handles when a process exits before the release queue flushes.
Memory Management
Memory management gives the kernel controlled ownership of physical frames, separates user processes, enforces page permissions, and exposes memory authority only through explicit capabilities.
Related
- Go VirtualMemory Contract
records the reserve/commit/decommit contract that extended the
VirtualMemoryimplementation for Go-style arena allocation. - Userspace Runtime describes the typed
VirtualMemoryClientsurface used by allocator and runtime code. - OOM Handling and Swap and Resource Accounting and Quotas define future memory-pressure and quota policy.
- Memory Authority Model defines the future cross-cutting contract for memory authority classes, residency, mapping consistency, pins, DMA boundaries, swap eligibility, and proof obligations.
Current Behavior
The frame allocator builds a bitmap from the Limine memory map, marks all non-usable frames as used, reserves frame zero, and reserves its own bitmap frames. The heap is initialized separately for kernel allocation.
Paging initialization builds a new kernel PML4, remaps kernel sections with section-specific permissions, copies upper-half mappings with NX applied and user access stripped, switches CR3, then enables page-global support. SMEP/SMAP are enabled after those mappings are active.
Each user AddressSpace owns its lower-half page tables and clones the
kernel’s upper-half mappings. Dropping an address space walks the user half and
frees mapped frames, committed anonymous frames retained behind VM_PROT_NONE,
and page-table frames. VirtualMemory lets a process reserve anonymous
address ranges, commit and decommit physical backing, unmap reservations, and
protect committed pages. Anonymous reservations charge the process virtual
reservation ledger. Committed anonymous pages charge
ResourceLedger::frame_grant_pages.
FrameAllocator allocation methods return a MemoryObject result capability,
not a physical address. The normal result payload carries the result-cap index,
and the CQE transfer-result record carries the local cap id plus
MemoryObject interface id. MemoryObject.info exposes page count and size;
MemoryObject.map maps page-aligned object ranges into the caller address
space, MemoryObject.unmap removes those borrowed mappings, and
MemoryObject.protect updates their page-table flags. Held MemoryObject caps
charge the holder’s frame_grant_pages ledger, and final CAP_OP_RELEASE or
process exit frees the owned frames once no borrowed address-space mapping still
holds the backing alive.
Design
The kernel keeps physical allocation host-testable by placing bitmap logic in
capos-lib and wrapping it with kernel HHDM access in kernel/src/mem/frame.rs.
Page-table manipulation stays in the kernel because it is architecture-specific.
ELF loading and VirtualMemory both use page-table flags to preserve W^X:
non-executable data gets NX, writable mappings are explicit, and userspace
pages must be USER_ACCESSIBLE. The CapSet and ring bootstrap pages occupy
reserved virtual pages; VirtualMemory rejects ranges that overlap either one.
User-buffer validation for process-owned buffers uses the process
AddressSpace mutex. The kernel checks that user pointers stay below the user
address limit, verifies page-table permissions for the requested read/write
access, and copies through the HHDM mapping while holding the same address-space
lock. This keeps validation and use tied to one stable page-table view. The
legacy current-CR3 validator remains only for callers that already provide an
equivalent page-table stability guarantee.
Committed VirtualMemory pages and held MemoryObject caps use the same
per-process frame-grant ledger, with quota checks before frame allocation or
mapping side effects. Anonymous reservation consumes a separate virtual page
quota, so guard ranges and Go-style sysReserve arenas do not spend physical
commit budget. Held MemoryObject caps charge for the backing they keep
reachable, and each live borrowed MemoryObject mapping reserves frame-grant
pages until it is unmapped. This prevents a process from mapping an object,
releasing the cap to drop the cap-slot charge, and keeping the backing pinned
without quota. The address space records borrowed pages separately from sparse
anonymous reservations so teardown and unmap can distinguish anonymous pages
from object-backed pages. Future file/network/DMA resources should reuse that
authority ledger instead of adding one-off counters per cap.
Invariants
- Frame addresses are 4 KiB aligned.
- The frame bitmap’s own frames are never returned as free frames.
- Upper-half kernel mappings are not user-accessible.
- Kernel text is RX, rodata is read-only NX, and data/bss are RW NX.
- User address spaces own only lower-half page-table frames.
- Process frame-grant usage covers committed anonymous VM pages, held
MemoryObjectcaps, and live borrowedMemoryObjectmappings. - Process virtual-reservation usage covers reserved anonymous VM pages whether or not they are committed.
- Committed
VM_PROT_NONEpages retain their frames and data while exposing no present user PTE; reserved uncommitted pages consume no frame-grant quota. - Object-backed user mappings are tracked as borrowed pages and hold the
MemoryObjectbacking alive until unmapped or address-space teardown. MemoryObjectunmap/protect only succeeds for borrowed pages backed by the same object.VirtualMemorycaps are bound to one address space and are not valid cross-process service exports.- CapSet is read-only/no-execute; ring is writable/no-execute.
VirtualMemorycannot reserve, map, commit, decommit, unmap, or protect the ring or CapSet pages.VirtualMemorycommit/decommit/protect/unmap only succeeds for ranges covered by anonymous reservations owned by the cap’s address space.- Capability-ring CALL/RECV/RETURN buffers, transfer descriptors, process and
thread wait completions, and private ParkSpace word reads must validate and
copy/read while holding the target process
AddressSpacelock.
Code Map
capos-lib/src/frame_bitmap.rs- host-testable physical frame bitmap core.capos-lib/src/cap_table.rs- capability holds and per-processResourceLedgerframe-grant accounting.capos-lib/src/frame_ledger.rs- bounded frame-grant helper retained for host tests.kernel/src/mem/frame.rs- Limine memory-map integration and global frame allocator wrapper.kernel/src/mem/heap.rs- kernel heap setup.kernel/src/mem/paging.rs- kernel remap,AddressSpace, page mapping, VM-cap page tracking, user copy helpers.kernel/src/mem/validate.rs- user-address bounds and legacy current-CR3 validation helper.kernel/src/cap/frame_alloc.rs- FrameAllocator capability and cleanup.demos/memoryobject-shared-parent/anddemos/memoryobject-shared-child/- QEMU shared MemoryObject smoke.tools/qemu-memoryobject-shared-smoke.sh- transcript checks for the shared MemoryObject smoke.kernel/src/cap/virtual_memory.rs- VirtualMemory capability.kernel/src/spawn.rs- ELF, stack, and TLS user mappings.kernel/src/arch/x86_64/smap.rs- SMEP/SMAP setup and legacy direct user access guard.
Validation
cargo test-libcovers frame bitmap, frame ledger, ELF parser, and cap-table pure logic.cargo miri-libruns host-testablecapos-libtests under Miri when installed.make kani-libproves the bounded mandatory frame-bitmap, stale-handle, cap-slot/frame-grant accounting, and transfer preflight fail-closed invariants when Kani is installed.make run-smokevalidates ELF mapping, process teardown, TLS, and clean shell-led halt.make run-spawnvalidates MemoryObject-backed FrameAllocator cleanup, VirtualMemory reserve/commit/decommit/VM_PROT_NONE/quota/release smoke, and runtime spawn checks.make run-memoryobject-sharedvalidates a parent allocating and mapping a MemoryObject, transferring it to a child, observing a child write through the same backing pages, unmapping both sides, and halting cleanly.make run-ipc-zerocopyvalidates the multi-message shared point-to-point buffer pattern at the substrate level: a producer transfers oneMemoryObjectto the consumer and then exchanges four record payloads through the shared mapping while endpoint CALLs carry only sequence numbers and checksums. This is a substrate proof, not the production data-plane shape: typedSharedBufferwith explicit producer/consumer ring metadata, notification primitives, and consuming service APIs (File.readBuf,BlockDevice.readBlocks, NIC RX/TX rings) are tracked under Open Work.make run-spawnvalidates ELF load failure rollback and frame exhaustion handling throughProcessSpawner.
Open Work
- Extend frame-grant accounting only if future DMA pinning or service-owned shared-buffer pools need authority beyond held MemoryObject caps and live borrowed mappings.
- Define page-pinning or mapping-identity rules for future shared WaitSet, DMA, and service-owned shared-buffer paths that must keep physical backing stable beyond a single locked copy/read. The owning planning track is Memory Authority Model.
- Add file, block, network, and DMA service APIs that use MemoryObject-backed SharedBuffer caps for zero-copy data paths.
- Add DMA isolation and device memory capability boundaries before userspace drivers.
- Add huge-page handling only with explicit ownership and teardown rules.
Scheduling
Scheduling decides which thread runs, preserves CPU state across preemption and blocking, and integrates capability-ring progress with process-owned execution resources.
Current Behavior
The scheduler stores shared process/thread metadata in
Scheduler::processes: BTreeMap<Pid, Process>. Dispatch-owned runnable state
lives in SchedulerDispatch: a per-CPU run_queues: [VecDeque<ThreadRef>; SCHEDULER_CPUS] array ordered ascending by Thread.virtual_finish_ns,
per-CPU current and handoff_current slots, idle-thread slots, the
direct-IPC target preference, run-queue reservation accounting, and
deferred drop/stack release slots.
Each live thread has at most one queued owner across all per-CPU queues
combined, and every per-CPU queue reserves capacity up to the live
runnable-capable thread count before a new thread is published as
runnable, so later timer, unblock, requeue, and steal-requeue paths do
not allocate. The shared live-reservation count is released when
processes or threads exit or when pre-publication reservation is rolled
back. Reserving each queue to the full live-thread count is required
because the bounded steal path may migrate every live thread into a
single sibling queue between two scheduler passes.
Phase D accepted its Task 6 diagnostic closeout at commit 77caafc0
(2026-05-10 19:39 UTC, docs(scheduler): record phase d thread-scale gate)
and closed in docs commit 1a08ec23 (2026-05-10 21:47 UTC,
docs(scheduler): close phase d). The accepted
state is the WFQ scheduler described here: per-thread weights and
latency classes are mutated only through SchedulingPolicyCap, each
per-CPU runnable queue is ordered by freshly derived
virtual_finish_ns, migration preserves virtual_runtime_ns, and
bounded stealing selects the most-overdue runnable sibling candidate.
The controlled Task 6 benchmark pair on capos-bench recorded capOS
1-to-4 work/total speedups 3.088x / 2.700x versus the previous
single-global-queue baseline 1.566x / 1.538x; the matching Linux
pthread baseline on the same host and physical-core logical CPUs
0,1,2,3 recorded 3.974x / 3.850x. The host harness enforced the
configured 1-to-2 work/total gates; the 1-to-4 row was manually accepted from
recorded diagnostics. Phase E
SchedulingContext is the next scheduler authority phase; EEVDF is a
follow-on ordering-policy evaluation rather than a Phase D blocker.
Phase D Task 3 (2026-05-07) restored the per-CPU runnable queues that
the 2026-05-02 collapse retired and gave them the WFQ ordering Task 2’s
virtual_finish_ns was prepared for. Newly created processes and
threads publish onto the creating scheduler CPU’s per-CPU queue; the
bounded steal path balances the queues when other CPUs run out of local
work. The publish-time placement is intentionally simple in this slice
— “place locally, let steal balance” — and a more sophisticated
caller-aware spread or least-loaded scan is a milestone-gate follow-up,
not a Task 3 acceptance requirement. Wake policy carries
WakePolicy::QueueCpu(u32) for endpoint, timer, park, process-wait,
thread-join, and process-spawn completions so the wake target matches
the queue placement, and DirectTarget keeps its original direct-IPC
handoff role. The transitional CAPOS_SCHED_DISABLE_WFQ=1 /
WakePolicy::QueueAny fallback has been removed before Phase E
SchedulingContext schema work.
wake_idle_scheduler_cpus_locked first probes the placement target
when the policy is QueueCpu, then walks eligible idle scheduler CPUs
and wakes the first that accepts a fresh reschedule IPI, skipping CPUs
that already have a pending IPI so a burst of ready work cross-wakes
more than one neighbor instead of stranding the rest behind one
already-targeted CPU.
Ring SQ Consumer Ownership
Each ring endpoint has kernel-owned SQ-consumer metadata outside the writable
userspace ring page. cap_enter and the bounded timer-side current-thread ring
service both acquire a syscall-mode owner lease before calling
process_ring(). The lease carries a nonzero generation and owner identity;
process_ring() verifies that generation before flushing deferred ring work or
advancing SQ head, and stale owners return StaleSqConsumer without consuming
the head SQE. Duplicate owners fail closed as a retryable busy cap_enter
status.
CQ publication remains independent of SQ ownership. Already accepted completions stay visible through CQ head/tail even after the SQ owner releases, and thread/process teardown releases any live SQ owner before ring unmapping or record drop without clearing accepted CQEs.
Bounded SQPOLL ring mode
Phase F adds a bounded SQPOLL mode for the caller thread’s ring through
CpuIsolationLease with allowedMode = kernelSqpoll and namedRing = callerThread. The transition is explicit: syscall-owned dispatch may request
SQPOLL start while it still owns the SQ, then releases its generation-checked
owner; the poller finalizes into SqpollRunning, may publish
NEED_WAKEUP and enter SqpollSleeping, wakes back to running when a producer
publishes a new SQ tail, and stops or rolls back on lease revoke, cap release,
teardown, or failed start. Timer-side syscall-mode ring service fails closed
while SQPOLL owns the same endpoint, so no second SQ consumer can advance the
SQ head.
The Phase F poller runs from the periodic scheduler service path and from a
bounded current-thread syscall service entry used for SQPOLL producer wakes and
explicit syscall kicks. Both entries borrow the SQPOLL owner lease rather than
acquiring syscall SQ ownership. The current default admits two SQEs per
selected SQPOLL worker, and a worker is not reselected again in the same
periodic service pass or syscall service entry. Poller elapsed time is charged
to the admitted scheduler ledger or scheduling-context target. The wake/sleep
protocol uses a shared ring flag: the poller
publishes NEED_WAKEUP, performs a full ordering barrier, and rechecks SQ
tail before sleeping; producers publish initialized SQEs, store SQ tail with a
barrier, and enter the kernel if NEED_WAKEUP is visible. A cap_enter
producer wake that finds SQPOLL already owns SQ head can run one bounded SQPOLL
batch, return visible CQ availability when the requested threshold is
satisfied, preserve ordinary blocked-current-thread and thread-owned-head
results, and otherwise fail closed as a retryable busy result. Stale owner
generations fail before deferred ring work or SQE start. If teardown requests
stop after a live owner has already accepted a SQE, the poller still publishes
SQ head for that accepted SQE before releasing ownership, preserving accepted
CQEs without leaving work replayable by syscall mode. The focused
make run-scheduler-generic-sqpoll-nohz proof admits this explicit
ring-coupled shape into SQPOLL nohz, drives producer wake and bounded service
progress without depending on a periodic tick, then rolls back on stale
owner/lease revoke. Policy-service automatic nohz, broader
userspace-poller/device-queue admission, and production realtime admission
remain future work.
Per-CPU run queue ordering structure
Each per-CPU VecDeque<ThreadRef> is kept ordered ascending by
Thread.virtual_finish_ns. Enqueue performs an ordered insert via a
linear scan from the front; selection scans the queue by index for
the first destination-Runnable entry (via
pop_first_runnable_local_locked), removes Drop entries it walks
past, and leaves RetryLater entries undisturbed for the next
scheduler pass. Because the queue is ordered ascending, the
first Runnable hit is also the lowest-virtual_finish_ns candidate
the destination CPU can accept (the most overdue against fair share
that this CPU is allowed to run). Linear-scan insert is O(n) per
enqueue;
with SCHEDULER_CPUS = 4 and bounded thread counts in this slice the
constant is small enough to defer a smarter structure (sorted bucket
arrays, intrusive trees) until benchmark evidence shows it dominates
scheduler-lock hold time. Promoting to a smarter structure is a
follow-up under this plan if the Task 6 milestone gate proves the
need.
virtual_finish_ns is recomputed on every enqueue from the thread’s
current virtual_runtime_ns, weight, and latency_class; it is
never carried as committed state across blocking, and migrations
between per-CPU queues recompute it at the destination so the
destination’s view of fair-share progress applies. The derivation rule
per latency class is documented in capos-abi/src/scheduler.rs and
the “Latency-class semantics for Phase D” section of
docs/proposals/scheduler-evolution-proposal.md.
Bounded steal path
When a CPU’s local queue has no immediately runnable entry the
scheduler walks sibling per-CPU queues. For each sibling queue the
scan walks indices ascending and selects that queue’s first entry
that the destination CPU considers Runnable; because each queue is
ordered ascending by virtual_finish_ns, the first Runnable hit is
also the lowest virtual_finish_ns candidate available to the
destination on that source queue. The steal then picks the source
queue whose first-Runnable candidate has the lowest
virtual_finish_ns overall, with ties broken by lower CPU id. The
chosen entry is removed from its current position in the source
queue (not necessarily the head: a RetryLater or single-CPU-owner
thread may sit at the source’s front and stay there), the WFQ tag is
recomputed at the destination, and the entry is inserted at the
destination’s ordered position. The destination queue is reserved to
the full live-thread count, so the steal-requeue is allocation-free.
The scan walks at most SCHEDULER_CPUS * max_queue_len
entries, but in practice each sibling scan stops at the first
Runnable candidate per queue.
RetryLater semantics in the local scan
The local pop scan walks the per-CPU queue by index instead of
popping the front and re-pushing RetryLater candidates. Re-pushing a
RetryLater entry whose virtual_finish_ns has not changed would
ordered-insert it back at the same head position, so a naive
pop-then-requeue loop would re-pop the same RetryLater head every
iteration and starve runnable entries behind it. The index scan
removes Drop entries in place, leaves RetryLater entries undisturbed
for the next scheduler pass to re-evaluate, and returns the first
Runnable candidate it finds. The bounded steal path uses the same
index scan on the destination queue after a steal so a stolen
RetryLater entry does not get re-popped in the same dispatch pass.
Phase E preflight fallback cleanup
The one-bisect-cycle CAPOS_SCHED_DISABLE_WFQ=1 opt-out has been
removed. Enqueues always target the selected per-CPU WFQ queue, and
wake-up sites always carry WakePolicy::QueueCpu(slot) for queued
work. Phase E SchedulingContext work therefore starts from the
accepted Phase D WFQ behavior rather than from a source-level
single-global-queue fallback.
Phase E Task 1: scheduling-context object shape
The first SchedulingContext slice is info-only: schema, config,
runtime, and kernel code expose SchedulingContext.info() and a
bootstrap grant shape, but no dispatcher enforcement, replenishment,
donation/return, depletion notification, realtime island, SQPOLL, or
nohz behavior. SchedulingContextSpec.cpuMask uses the canonical
little-endian bitset defined in schema/capos.capnp: CPU n maps to
bit n % 8 of byte n / 8, with bit 0 as the least-significant bit
of that byte. Empty data means no CPUs are selected rather than all
CPUs. Producers omit trailing zero bytes, so the all-zero set’s
canonical form is empty and any non-empty canonical mask ends with a
nonzero byte.
Phase E Task 2: bind, revoke, and generation identity
The second SchedulingContext slice adds the first bounded authority
lifecycle. SchedulingContext.create()
creates a same-interface result cap for a validated spec, bindCallerThread()
records one caller-thread binding for the current context generation, and
revoke() advances the generation and clears the matching thread metadata
binding. Bootstrap-granted contexts and contexts returned by create() use the
same non-wrapping context-id allocator; the binding identity remains
(contextId, generation), but distinct cap objects no longer share bootstrap
ids. Stale caps report staleGeneration and cannot create, bind, or revoke
scheduler metadata for a new generation; already-revoked contexts report
revoked. Release cleanup clears only a thread metadata binding that matches
the released cap identity.
Phase E: SchedulingContext budget enforcement
make run-scheduling-context is the focused Phase E QEMU proof. It
starts one process with two independently granted bootstrap contexts, verifies
their identities cannot alias, adopts a created result cap, drives bind/revoke
and stale-generation calls, confirms release cleanup by rebinding after the
released cap drops, and now checks the first dispatcher budget behavior.
bindCallerThread() installs a fixed budget ledger in the caller thread’s
scheduler metadata. Runtime charge decrements that ledger at the same
scheduler-lock-contained points that update per-thread runtime/vruntime.
Runnable selection replenishes elapsed periods and treats exhausted bound
contexts as RetryLater until their next period, leaving the queued owner in
place rather than allocating or moving emergency-path state. Stale or revoked
contexts still fail closed before mutating scheduler metadata or accounting.
The current enforcement granularity is the existing periodic scheduler tick:
a running thread may overshoot its budget by the current tick quantum before
the next dispatch charge throttles it. The smoke therefore proves bounded
dispatcher behavior, not nohz/SQPOLL activation or hard realtime admission. It
prints dispatch_effect=budgetEnforced, visible budget charge, replenishment
to full budget after a period, and a throttled wall-clock window.
Phase F: CpuIsolationLease and automatic nohz activation
CpuIsolationLease is a separate authority surface from
SchedulingContext CPU-time budget enforcement. The scaffold records owner
identity, allowed CPU set, allowed isolation mode, live accounting target
reference, housekeeping exclusions, maximum revocation latency, and generation
identity. It rejects stale generations, duplicate or overlapping active leases,
fabricated or stale SchedulingContext accounting targets, malformed CPU masks,
and lease sets that would leave no online scheduler housekeeping CPU outside
the globally admitted active lease CPUs.
The scheduler-side preflight reports a bounded nohz activation/deactivation
decision surface: lease identity, target CPU mask, target runnable entity
count, active housekeeping CPU availability after subtracting all active lease
CPUs, selected housekeeping CPU mask, deferred cleanup, timer/deadline,
network polling, IRQ-affinity, accounting-target, monotonic
clocksource/accounting readiness, one-SQ-consumer, revocation latency,
rollback, and periodic-fallback labels. The accepted QEMU proof uses -smp 4
so an active lease can report ready housekeeping CPUs outside the target CPU,
selected housekeeping placement, and exactly one runnable caller on that
target CPU.
The clockevent/deadline substrate uses a calibrated TSC-backed monotonic
clocksource on normal QEMU/x86_64, with the periodic LAPIC tick disciplining
the TSC epoch so QEMU guest halt windows cannot stall wall-clock progress.
Timer.sleep, finite cap_enter, and park timeouts store absolute monotonic
deadline_ns values, and the LAPIC clockevent backend can program a bounded
one-shot deadline and restore periodic mode.
Automatic nohz activation state machine
When the preflight finds every proof obligation satisfied – a single
runnable entity on the target CPU, a ready housekeeping CPU outside the lease,
no local deferred-cleanup/timer dependency, a valid accounting target, a live
monotonic clocksource, a non-stale one-SQ-consumer when a ring is named, a
bounded revocation latency, and the lease’s allowedCpuMask naming exactly
one scheduler-owned CPU – it performs real per-CPU periodic-tick
suppression for that narrow single-runnable window. The target CPU may be
the CPU running the preflight call (local activation) or a different
scheduler CPU (remote-CPU activation via a reschedule IPI – see Remote-CPU
activation below). The single-runnable shape differs by target: a local
activation requires the caller itself to be that single entity
(exactly-one-runnable-caller); a remote activation requires the target
CPU’s single runnable entity to be some thread pinned there, not the caller
(which runs on a different CPU – exactly-one-runnable-remote-target).
- Admission gates. Two lease shapes can be admitted for tick suppression:
a pure
namedRing = nonecompute lease, and a ring-coupledallowedMode = kernelSqpolllease whose bound ring is being actively driven by a live SQPOLL consumer.- Compute lease (
namedRing = none). Declares no local network/IRQ dependency, so the read-only network-polling and IRQ-affinity admission gates pass. - Ring-coupled SQPOLL lease (
allowedMode = kernelSqpoll,namedRing = callerThread). The lease’s declared kernel-polled work IS the bounded SQPOLL ring poller, which the scheduler keeps progressing throughcap_enter/producer-wake even while the periodic tick is masked. The preflight admits it only when the bound ring is in SQPOLL running/sleeping mode with a non-staleSqpollowner; the one-SQ-consumer label is thenblocked-sqpoll-owner(the worker owns the ring). The preflight ring-state read is a best-effort hint – it never takes the per-ring lock inside the scheduler lock (it usestry_lock, and a contended snapshot does not admit activation). The decisive disqualifier is the IPI/timer re-check below. - A
namedRing = callerThreadlease that is notkernelSqpoll(compute-with-ring) keeps the conservative refusal until network polling and IRQ affinity are routed to a housekeeping CPU, as does any device-owning mode. The kernel still services virtio RX/TX andInterruptwaiters inline from the periodic scheduler path.
- Compute lease (
- Activate. The preflight masks the periodic LAPIC timer on the current
CPU and arms a one-shot deadline at
min(nearest pending timer wakeup, now + max revocation latency). The CPU now runs on a bounded one-shot deadline instead of the periodic tick. The eligible lease generation is registered so revoke/cleanup paths can stale it. - Re-check. On every timer interrupt and on every reschedule IPI the
handler re-checks the activation window before the scheduler picks the next
thread. The reschedule-IPI handler also drains any pending remote-CPU
activation request parked for this CPU (the IPI vector is shared with the
remote-activation path – see Remote-CPU activation below), and the
periodic timer handler drains it too as a backstop.
An unchanged eligible window re-arms the bounded one-shot deadline;
a reschedule IPI (the prompt signal that another CPU woke runnable work onto
this CPU) drives an immediate rollback. The re-check runs in interrupt
context and uses
try_lockto avoid deadlocking against a held scheduler lock. Armed-timer invariant: the masked-periodic one-shot does not auto-rearm, so a timer-interrupt re-check NEVER returns leaving a tickless CPU without an armed timer – on scheduler-lock contention it arms a bounded minimum-delta fallback one-shot (or restores the periodic tick) before returning. A lock-free per-CPUnohz-activebitmask lets the contention path distinguish a tickless CPU (the consumed timer was the nohz one-shot and must be replaced) from a normal CPU (the periodic tick auto-rearms). A reschedule IPI does not consume the one-shot, so its contention skip is safe – the still-armed one-shot bounds the next re-check. - Rollback. Any disqualifying change rolls the CPU back to the periodic
LAPIC tick first, before any further ordinary work: a stale lease
generation (explicit revoke, process exit, service replacement, session
logout), a second runnable entity or stealable sibling work on the target
CPU, a local deferred-cleanup dependency, a direct-IPC target becoming
runnable, a target-CPU mismatch, or a one-shot backend that can no longer
arm a deadline. For a ring-coupled SQPOLL activation the re-check also
carries a
sqpoll-ring-mode-changed-or-owner-staleddisqualifier (the bound ring leaving SQPOLL running/sleeping mode or its owner staling); that re-check runs under the scheduler lock and usestry_lockon the per-ring lock, so a contended ring is treated as disqualifying (fail-closed – restore the periodic tick rather than keep a CPU tickless on an unverifiable ring). That SQPOLL ring-mode branch is defense-in-depth, currently subsumed by lease-generation staling: every reachable SQPOLL-stop path today (stop_sqpoll_for_lease/stop_sqpoll_if_owned) is a revoke/cleanup-path caller that also stales the lease, andstale-lease-generationis checked first – so the lease-generation stale is the load-bearing SQPOLL rollback trigger in practice. The SQPOLL ring-mode branch becomes independently load-bearing, and would then need its own proof, only if a future change introduces a SQPOLL-stop path that keeps the lease live. Runtime accounting stays boundary/counter driven and monotonic, so suppressing the tick never strandsSchedulingContextbudget charging.
Remote-CPU activation
Masking the periodic LAPIC tick and arming the one-shot deadline are per-CPU
operations – only the target CPU can program its own LAPIC timer. When the
preflight runs on CPU A but the lease’s single-CPU allowedCpuMask targets a
different CPU B, the kernel does not refuse: it parks a bounded
remote-activation request in CPU B’s per-CPU slot and sends a
reschedule-style IPI to CPU B. CPU B drains the request from its IPI handler
(and from its periodic timer handler as a backstop), re-runs the full
disqualification check locally under its own scheduler-lock acquisition,
and only then arms its own one-shot deadline. A remote activation is never
trusted blind – the preflight’s eligibility snapshot was taken on a
different CPU and may be stale by the time the IPI is drained, so the target
CPU re-checks before committing. The relevant invariants:
- Bounded request slot, no nesting. The pending-request store is a fixed
[Option<_>; SCHEDULER_CPUS]array – one single-entry slot per CPU, so it can never grow unbounded. If a slot already holds an undrained request, a new preflight fails closed (rejected) rather than queuing behind it. The IPI-context drain never nests the scheduler lock: it takes only the small per-CPU slot mutex, then calls the activation intry_lockmode. - Contention retry. If the IPI-context drain finds the scheduler lock contended, it leaves the request parked and returns; the target CPU’s next periodic timer tick (still live – the tick has not been suppressed) retries the drain. Progress is bounded by the periodic tick the same way the existing local re-check contention path is.
- Fail-closed IPI ordering. A remote rollback
(
rollback_nohz_for_lease) stales the lease generation before clearing the activation record. The drain re-checks the generation before arming, so a rollback that races the drain fails closed (the request is dropped, the periodic tick stays live). If the drain already committed before the rollback cleared the record, the target CPU’s nextnohz_rechecksees thenohz-activebit set with no record and restores its periodic tick. Either ordering converges on the periodic tick. - Compute-only. Remote-CPU activation is limited to
namedRing = nonecompute leases in this slice. A ring-coupled SQPOLL lease whose target differs from its ring owner’s CPU is not an admitted shape; it fails closed.
Generic full-nohz admission for ordinary budgeted compute threads is available
only through an explicit SchedulingContext-targeted compute lease and the same
fail-closed placement gates described above. The SQPOLL nohz state machine now
admits explicitly leased caller-thread rings when the SQPOLL worker is live,
single-consumer, and bounded by producer wake/deadline rollback. Broader
userspace-poller/device-queue admission, automatic CPU-isolation issuance, and
production realtime island admission remain future work; auto_nohz stays
disabled. Timeout-based auto-revoke landed 2026-05-30 15:22 UTC: a CpuIsolationLease
created with leaseLifetimeNs > 0 records an absolute expiry deadline,
auto-revokes through the existing generation-advancing cleanup on first
observation past it (reason=lease-expired), and the nohz activation record
carries the lifetime deadline so a tickless CPU rolls back at the next
timer/IPI recheck (lease-lifetime-expired disqualifier), bounded by
maxRevocationLatencyNs. A leaseLifetimeNs of 0 preserves the prior
revoke/cleanup-only lifecycle. The current
SQPOLL-driven activation is the bounded case: tick suppression for a
ring-coupled kernelSqpoll lease on the CPU running the preflight, rolled
back through lease-generation staling on revoke/cleanup, with the SQPOLL
ring-state re-check as defense-in-depth for any future SQPOLL-stop path that
does not stale the lease.
Lease revocation and cleanup are generation-aware. Explicit revoke, process
exit, service replacement through process termination, and session logout stale
the matching generation so old caps cannot keep isolation eligibility alive,
and rolling the matching lease’s active nohz window back to the periodic tick
is part of the same cleanup path.
make run-scheduler-cpu-isolation-lease is the broad QEMU proof for grant,
info, revoke, cleanup, real nohz activation and fail-closed rollback, bounded
SQPOLL start/sleep/stop, rollback labels, generic full-nohz, and SQPOLL nohz.
make run-scheduler-generic-sqpoll-nohz is the focused SQPOLL proof for
eligible ring admission, producer wake, SQPOLL service, rollback, and stale
owner rejection.
Phase E: endpoint donation and return
Synchronous endpoint delivery now carries a bounded internal donation token
when a caller thread with a bound active SchedulingContext delivers a CALL
to a receiver thread that has no scheduling context of its own. Donation is
strictly passive-server shaped: receivers that already have a scheduling
context keep their own authority, unbound callers donate nothing, and callers
that receive a donation token are blocked from returning to userspace until
the in-flight endpoint call returns or is canceled.
At delivery, the scheduler charges pre-donation caller runtime before moving
the context ledger to the receiver. While the receiver handles the endpoint
message, normal dispatcher runtime charging decrements the donated context.
When endpoint RETURN commits the caller completion, the scheduler first charges
receiver runtime since dispatch, then returns the remaining budget and
next-replenishment state to the caller’s thread metadata and rebinds the
SchedulingContext record to the caller. Return preflight failures leave the
in-flight donation in place, while application-exception RETURN,
invalid-result RETURN errors, delivery failure, return cancellation, endpoint
teardown, process/thread exit, and stale-caller cleanup return or clear the
donation before waking the caller and without allocating new emergency-path
storage. Nested donation of an already donated context is rejected; supporting
stacked donation is deferred until it has an explicit return-token stack
design.
make run-scheduling-context proves the behavior with a same-process endpoint
round trip. The caller binds a fresh context, burns CPU immediately before
CALL, the passive server burns CPU while servicing the endpoint CALL and again
immediately before RETURN, and after RETURN the caller observes the reduced
budget restored. The same smoke covers application-exception RETURN,
oversized-result RETURN under donation, and deterministic rejection of
A-to-B-to-C nested donation. It also submits a delivered donated CALL and then
uses cap_enter(0, 0) while the server delays RETURN, proving the donor cannot
continue outside the donated ledger. A fast-return variant covers the race where
the receiver returns before the caller commits to the donation-blocked scheduler
state. The smoke prints endpoint_donation=ok, endpoint_return=ok,
endpoint_exception_return=ok,
endpoint_invalid_return=ok, endpoint_nested_rejected=ok,
endpoint_donor_block=ok, endpoint_donor_fast=ok,
endpoint_donation_server, endpoint_donation_after,
endpoint_exception_return_after, endpoint_invalid_return_after,
endpoint_nested_after, endpoint_donor_block_elapsed_ns,
endpoint_donor_block_after, endpoint_donor_fast_elapsed_ns, and
endpoint_donor_fast_after.
Phase E: SchedulingContext notifications
Every SchedulingContext now owns fixed notification storage allocated at
context creation or bootstrap. The storage has two coalescing slots:
budgetDepleted and deadlineOrTimeout. Each slot records context
id/generation, a saturating sequence, a saturating coalesced-event count, the
last holder thread, remaining budget, the next replenishment/deadline
timestamp, and whether the holder was using an endpoint-donated context.
Runtime charge records depletion when remaining budget transitions to zero and
records deadline/timeout expiry against the same context generation. Failed
bind attempts do not arm a new budget/deadline window.
SchedulingContext.drainNotifications() returns typed observer results:
ok drains the matching fixed cells, revoked reports the current revoked
generation, and staleGeneration reports an old observer generation without
draining the current record. Explicit revoke() records an explicitRevoke
lifecycle event. These notifications explain already-enforced scheduler state;
they do not donate budget, reorder runnable entities, bypass throttling,
publish result caps, append unbounded queues, allocate on scheduler hard paths,
or imply auto-nohz/SQPOLL/tickless behavior. A pre-armed observer waiter/wakeup
path remains a future extension.
make run-scheduling-context proves the notification slice by repeatedly
draining a depleted context after coalescing, observing deadline expiry,
recording explicit revoke and stale-observer labels, and confirming that
endpoint-donated runtime records notification state on the donated context. The
smoke prints notification_coalescing=ok, deadline_notification=ok,
revoke_notification=explicitRevoke, stale_notification=staleGeneration,
and endpoint_donated_notification=ok.
Phase E: session logout lifecycle hook
UserSession.logout() now notifies the scheduler after the session liveness
cell transitions from live to logged out. That covers explicit
UserSession.logout() calls, including the remote DTO gateway logout command
and connection-teardown path because those paths already call the same kernel
UserSession.logout() method. The hook scans scheduler-owned process/thread
metadata for live processes whose immutable SessionContext shares the logged
out liveness cell, removes each non-donated matching thread binding from the
scheduler ledger, and asks the bound SchedulingContext record to advance its
generation and mark itself revoked. Old ordinary SchedulingContext grants
therefore report stale generation through info() with zero visible remaining
budget and InfoOnlyNoDispatchChange. The focused session-context smoke also
proves stale bindCallerThread() does not rebind, stale create() does not
publish a result cap, stale revoke() does not mutate the current metadata
generation, and stale notification draining reports a stale observer result.
The hook intentionally does not use session code as a second scheduling-context
ledger: session lifecycle code only flips liveness and notifies the scheduler,
and the scheduler owns the scan and binding removal. The scan takes one binding
at a time under the scheduler lock, drops that lock, then calls the
SchedulingContextExitCleanup record hook so it does not invert the existing
SchedulingContext record-lock to scheduler-lock order used by
bindCallerThread().
In-flight endpoint donation uses a conservative counted/skipped logout policy.
If the logged-out session owns a receiver thread that currently holds a
donated context, the logout hook records that the donated binding was skipped
rather than returning donor budget while the endpoint call remains in flight.
The focused session-context smoke proves the donor remains blocked in
cap_enter(0, 0) until the receiver returns, the hook reports
donation_inflight_skipped=1, and endpoint RETURN removes the receiver
binding while restoring only the reduced remaining budget to the donor. This
does not add a new logout-triggered cancellation semantic. Local owner-shell
exit now calls the held UserSession.logout() before clean shell process exit,
so the same scheduler hook observes shell logout with
stale_marked=0 donation_inflight_skipped=0 in the shell smoke. The ordinary
bound-context stale proof remains the focused session-context smoke, because
the normal shell does not hold a bound SchedulingContext. Process and thread
exit cleanup already have their own stale-context coverage and are unchanged.
Realtime islands, SQPOLL, auto-nohz, and CPU placement enforcement remain future Phase F/G work.
Phase D Task 4: migration fairness invariants
Phase D Task 4 (2026-05-08) made three migration-fairness invariants explicit:
virtual_runtime_nstravels with the thread. It lives onThread.cpu_accounting, not on a per-CPU slot, so a migration from CPU A to CPU B preserves the thread’s accumulated weighted-fair share. The accounting field was promoted out ofcfg(measure)in Task 2 and continues to advance throughcharge_runtimeregardless of which CPU charges the quantum.virtual_finish_nsis derived per enqueue, never committed. Every enqueue site – the initial publish inenqueue_ready_thread_on_slot_locked, the post-block requeue inenqueue_unblocked_thread_on_slot_locked, and the steal-insert insteal_from_sibling_queues_locked– routes throughrefresh_virtual_finish_ns_locked, which readsthread.weight,thread.latency_class, andthread.cpu_accounting.virtual_runtime_nsfresh and recomputes the WFQ ordering tag. The field is never carried as committed state across blocking and is never carried with the thread on migration; the destination CPU’s view of weight, latency class, and quantum decides the new tag.- Steal recomputes at the destination. The pop-from-source step in
steal_from_sibling_queues_lockedis followed byrefresh_virtual_finish_ns_lockedagainst the destination slot before the ordered insert, so aSchedulingPolicyCap.setWeightthat landed between source enqueue and steal takes effect at the steal itself.
Migrations counter shape
ThreadCpuAccounting.migrations is cfg(feature = "measure")-gated
and remains a benchmark-only operator-observability counter; it is
not load-bearing for ordering and is not exposed through
SchedulingPolicyCap.snapshot. Phase D Task 4 moved the increment
from the dispatch-time scheduled_measure path to two enqueue-time
arms in kernel/src/sched.rs:
- Placement-time spread (
record_placement_spread_migration_locked) fires frompush_reserved_run_queue_lockedwhen the enqueue target slot differs from the thread’s previously dispatched CPU (ThreadCpuAccounting.last_cpu). A thread that has never been dispatched (last_cpu == None) does not register a migration on first publish; otherwise placement spread is counted exactly once per enqueue. - Steal (
record_steal_migration_locked) fires fromsteal_from_sibling_queues_lockedafter the source-queue removal and before the destination-queue insert. The steal scan skips the destination slot, so the counter increments unconditionally each time the steal arm is reached.
scheduled_measure still maintains last_cpu so the placement-spread
check has the previous CPU available; only the migrations++ moved.
The pre-collapse counter shape is preserved in steady state – a
thread that runs on a different CPU than its previous run still
records exactly one migration – but the increment is now attributed
to the enqueue decision (placement spread or steal) rather than the
dispatch that follows it.
The aggregate process-wide thread_placement counter family in
kernel/src/measure.rs (migrations, migration_to_cpu0..3,
consumed by tools/qemu-thread-scale-harness.sh) is a separate
measurement device. It is incremented from
account_thread_selected_locked at dispatch time and continues to
observe “thread ran on a different CPU than its previously
dispatched CPU” rather than the per-thread Task 4 enqueue-time
shape, so the thread-scale harness regex does not need to change.
The per-thread ThreadCpuAccounting.migrations field and the
aggregate thread_placement counter intentionally measure different
events at different points in the scheduling pipeline; both stay
behind cfg(feature = "measure").
Phase H: per-thread saturation status surface
The Phase H AutoNoHz placement heuristic (a future policy-service
feature) needs to read per-thread saturation observation in the normal
dispatch build, not only under cfg(feature = "measure"). The
non-measure per-thread saturation status surface (2026-05-30)
promoted the inputs it consumes into ordinary ThreadCpuAccounting
state and exports them through SchedulingPolicyCap.snapshot @2:
voluntary_blocksandpreemptionsmoved out ofcfg(feature = "measure"). They are charged at the same sites as before –voluntary_blockswhen a thread blocks itself (cap_enter wait, park, endpoint scheduling-context donation) andpreemptionswhen the timer requeues a still-runnable running thread – so themeasurebuild’s counts are unchanged; only thecfggate was removed. A lowvoluntary_blockscount distinguishes a CPU-saturating thread from an IPC/IO-bound one.runnable_accumulated_nsis a new always-built cumulative counter of runnable-but-not-running time. It is charged at the scheduler-lock-held enqueue/select boundary:push_reserved_run_queue_lockedstamps a monotonicrunnable_since_nswhen a thread is published to a per-CPU run queue without being selected (idempotent across re-publish, so the whole runnable span is counted once), andaccount_thread_scheduledaccumulates the monotonic delta and clears the stamp when the thread is next selected. The stamp/accumulate pair nets to zero for a thread selected at the same monotonic instant it becomes runnable. The clock ismonotonic_ns()only (no wall-clock, no rewind), matchingcharge_runtime’s discipline, and the stamp respects the runnable-ownership rules above (a thread holds a live stamp only between enqueue and selection).
migrations stays measure-gated; it is a placement diagnostic, not a
saturation input. The surface exports raw cumulative counters only –
windowing, smoothing, and the saturation decision are policy-service
choices, never kernel state (see
docs/proposals/tickless-realtime-scheduling-proposal.md). Proof:
make run-thread-fairness reads the extended snapshot on the weighted
workers and asserts the CPU-bound hog reports high runtime_ns with
voluntary_blocks at or near zero while at least one preempted
lower-weight worker reports nonzero preemptions and
runnable_accumulated_ns.
Weight-change-while-enqueued contract
SchedulingPolicyCap.setWeight writes the validated weight directly
to Thread.weight through Process::set_thread_weight and does not
clear Thread.virtual_finish_ns. A weight change observed while the
thread is blocked, running, or already queued takes effect on the
next dequeue and re-enqueue because every enqueue site refreshes
virtual_finish_ns from current weight/latency_class/
virtual_runtime_ns. The kernel proves the contract two ways:
- By construction.
Process::refresh_thread_virtual_finish_nsreads each input field fresh on every call; there is no cached derivation between enqueues. The function bears a doc-comment asserting the contract. - By
debug_assert!. Inside the same function, a debug assertion verifies that the recomputedvirtual_finish_nsis at or beyond the currentvirtual_runtime_ns– a future deadline, never a past one. The assertion catches any future regression where the formula could underflow or where a stale cache could drift below the current vruntime.
The focused QEMU smoke that drives setWeight and verifies the
post-block dispatch picks up the new weight landed under Phase D
Task 5: make run-thread-fairness-weight-change (manifest
system-thread-fairness-weight-change.cue, demo
demos/thread-fairness/). Two competing child threads run a
fixed wallclock window: a baseline worker stays at
DEFAULT_WEIGHT, while a heavy worker self-calls
SchedulingPolicyCap.setWeight(weight=128) and then blocks on
Timer.sleep so it leaves the run queue before the contention
window opens. Each worker snapshots its scheduler state at wake
and at window end via SchedulingPolicyCap.snapshot, and the
parent verifies three independent properties: (1) the heavy
snapshot reads weight == 128 and the baseline snapshot reads
weight == DEFAULT_WEIGHT; (2) the observed runtime_ns ratio
matches the weight ratio inside a configured tolerance; (3) the
heavy worker’s virtual_runtime_ns advances at roughly half
the rate of its runtime_ns (vruntime/runtime ~= 0.5 for
weight=128, ~= 1.0 for DEFAULT_WEIGHT). A scheduler that
re-enqueued or dispatched the heavy worker using a stale
virtual_finish_ns derived from DEFAULT_WEIGHT would not
show the weight-proportional CPU share, and a scheduler that
held a stale weight inside charge_runtime would yield heavy
vruntime/runtime ~= 1.0 instead of ~= 0.5; the smoke trips on
either regression. The capability is bound to
CapCallContext::caller_thread (Phase D Task 2 decision), so
same-thread self-mutation is the only authorized shape for this
proof; cross-thread weight authority remains a Phase H
privileged scheduler-policy service concern.
The thread-scale benchmark was repaired before accepting the milestone. The old
1 MiB/spinning-parent shape was not a valid four-core reference because the
matching Linux pthread baseline also failed at four workers. The accepted
benchmark shape uses a blocking parent join, 262,144 blocks (16 MiB), and
work_rounds=64. The formal accepted-evidence pair is the capos-bench
2026-05-02 21:38 UTC 5-run pair pinned to physical-core logical CPUs
0,1,2,3 against main commit 374f8556: capOS work 1.883x and total
1.787x clear the configured 1.6x gates, while the matching Linux
pthread baseline records 1.988x/1.987x. Its 1-to-4 row became the
diagnostic that justified Phase D’s fair-share enqueue policy: capOS
1.566x/1.538x versus Linux 3.963x/3.858x, a clear bottleneck
in the then-current single-global-queue scheduler. Phase D’s WFQ evidence on
2026-05-10 manually accepted the recorded 1-to-4 diagnostic with capOS
3.088x/2.700x and matching Linux 3.974x/3.850x on the same host/CPU
pin set. The harness still enforced only the configured 1-to-2 work/total
speedup gates. Historical pre-collapse 1-to-2
(1.828x/1.687x) and the post-collapse 3-run diagnostic on
capos-bench 2026-05-02 10:42 UTC (1.890x/1.792x,
1.504x/1.436x) remain in docs/benchmarks.md for reference.
Four-worker capOS scaling was a follow-up rather than a completed claim
under the pre-collapse model: the unsuppressed diagnostic recorded 1-to-4
work/total speedups 3.029x/2.386x, while suppressing scheduler switch
logs recorded 3.272x/2.303x; remaining guest-measure evidence pointed at global
Scheduler lock contention plus exit/join/block/schedule overhead, and normal
scheduler-owned execution is still capped at temporary CPU slots 0-3.
Each process currently owns one or more Thread records; each thread owns its
saved CPU context, kernel stack, FS base, block state, and – since Phase D
Task 2 – the WFQ ordering inputs weight: u16, latency_class: LatencyClass,
and virtual_finish_ns: u64. The Phase D constants in
capos-abi/src/scheduler.rs set the defaults weight = DEFAULT_WEIGHT and
latency_class = LatencyClass::Normal, so unmodified workloads observe no
behavior change versus the pre-Phase-D scheduler. virtual_finish_ns is
recomputed on every enqueue (Task 2 ships the derivation; Task 3 will consume
it for ordered insertion) and is not meaningful while the thread is blocked.
Phase D Task 2 split the per-thread CPU accounting record so the WFQ-load-
bearing fields are available in the normal qemu build:
runtime_ns, virtual_runtime_ns, and last_started_ns are unconditional;
context_switches, preemptions, voluntary_blocks, migrations,
last_cpu, and the *_runtime_stable_observed and blocked/exited
bookkeeping stay behind the measure feature because they are pure
operator-observability counters that do not participate in dispatch ordering
and need a separate operator snapshot path. runtime_ns advances 1:1 with
elapsed CPU time, while virtual_runtime_ns advances by
elapsed_ns * REFERENCE_WEIGHT / weight so per-thread weight changes the
cumulative WFQ share rather than only the enqueue tag. The runtime-charge
path is invoked when a current thread stops running through timer preemption,
blocking cap_enter or park, thread/process exit, or direct switch/handoff
paths that select another current thread; the wrapping helpers in
kernel/src/sched.rs route through Process::charge_thread_runtime /
Process::account_thread_scheduled unconditionally now.
The SchedulingPolicyCap cap surface mutates these per-thread fields through
the caller-thread fallback binding selected in Phase D Task 2: every
method (setWeight, setLatencyClass, snapshot) routes to
CapCallContext::caller_thread, so a holder can only mutate or observe its
own running thread. Cross-thread or cross-process authority is reserved for
the Phase H privileged scheduler policy service. The
SchedulingPolicyCap.snapshot reply intentionally exposes only the four
fields promoted out of the measure feature gate;
context_switches/preemptions/voluntary_blocks/migrations are
benchmark-only and a future operator-observability slice may add them
through a separate cap. The BSP scheduler tick normally arrives through the
local APIC timer on vector 48 with LAPIC EOI after calibrating the LAPIC initial
count against PIT channel 2; if LAPIC setup or calibration is unavailable, the
kernel falls back to the legacy PIT/PIC IRQ0 path on vector 32. On each
user-mode timer tick (kernel-mode ticks bypass the scheduler entirely
through kernel_timer_interrupt_handler, as described under Design),
the kernel wakes timed-out or satisfied cap_enter and park waiters,
processes the current thread’s ring endpoint in timer mode, saves the
current thread context, picks the next ready thread from the single
global run queue (the earlier per-CPU local-first / steal scan was
retired with the queue collapse), switches CR3 when needed, updates
the current CPU’s kernel-entry stack through the per-CPU hook,
restores FS base, mirrors the next ThreadRef into the current
PerCpu, and returns to the next user context.
When APs are online and their LAPIC timers start, scheduler CPU slots 0-3 can
temporarily own scheduler/user execution. The earlier AP-owner proof kept the
BSP in kernel idle; the current same-process scaling slice allows sibling
threads with distinct ring endpoints to run on different scheduler CPUs while
processes that hold broad launch/authority caps or live endpoint objects
remain pinned to the legacy single-owner CPU. Additional APs beyond CPU 3 stay
in kernel idle until a later scheduler-owner policy replaces the temporary CPU
mask. The runnable queues are a per-CPU array of VecDeque<ThreadRef> shared
by the scheduler-owned CPUs under the global scheduler lock and ordered
ascending by virtual_finish_ns; process/thread metadata remains shared under
that lock. A bounded steal path migrates the most overdue sibling
candidate (each sibling queue’s first entry that the destination CPU
considers Runnable) when a CPU’s local queue has no runnable entry.
Syscall entry initializes kernel GS with swapgs, saves the user RSP through
the GS-relative PerCpu.user_rsp slot, and switches to the GS-relative
PerCpu.kernel_rsp slot. Normal syscall returns swap back before sysretq.
Blocking cap_enter, process exit, and ThreadControl.exitThread paths that
leave through scheduler iretq restore use restore_context_after_syscall so
GS ownership is returned to userspace before the next user context resumes.
Timer.sleep records a bounded scheduler waiter keyed by caller ThreadRef,
user data, and an absolute monotonic deadline_ns. Due sleeps validate the
thread generation, post an empty completion directly to the caller’s CQ, and
then flow through the same blocked cap_enter wake scan as other completions.
Each process has a separate sleep waiter quota, so one Timer holder cannot fill
the global sleep queue by itself.
ThreadControl.setFsBase validates runtime-provided FS bases as user-canonical
addresses, updates the caller thread’s saved FS base, and writes the CPU FS
base immediately when the caller is the running thread. There is no
process-global FS base; context switch treats FS base as per-thread state.
The initial thread still uses the compatibility ring at RING_VADDR, while
each spawned child thread receives a kernel-chosen ring mapping in the process
ring arena. Run queues, per-CPU current, direct IPC handoff, Timer sleep
waiters, process/terminal waiters, endpoint caller/receiver records, and
deferred cancellation CQEs store generation-checked ThreadRef values and
route completions to the target thread’s ring endpoint. Process-owned thread
and kernel-stack ledger limits are enforced by ThreadSpawner.create before
additional thread records become runnable. The frozen contract is in
In-Process Threading. Park wait uses a separate
Blocked(Park { ... }) reason and park timeout/wake completions use reserved
CQE credits before marking generation-checked waiter threads runnable. The
authority and ABI contract is in Park Authority.
cap_enter(min_complete, timeout_ns) processes pending SQEs immediately. If
the requested completion count is not available and the timeout permits
blocking, the current thread enters Blocked(CapEnter { ... }) and the syscall
entry path switches to another runnable thread.
The LAPIC user-timer path enters sched::schedule() unconditionally on
every tick. An earlier slice carried a bounded user-mode continuation
fast path with a per-CPU one-skip budget and a release/acquire
slow-path-required summary; that path has been retired (see
docs/backlog/scheduler-evolution.md “Cleanup: Retire Benchmark-Driven
Scaffolding Before Phase D”). The fast path saved at most one scheduler
entry every other tick on an uncontended single-CPU-effective scheduler
while paying for shadow-state publication on every slow-path exit, so
the simpler always-schedule shape is preferred until a future Phase D
or Phase F slice ships an evidence pair where the fast path measurably
reduces scheduler-lock hold time on a contended SMP run.
When endpoint delivery satisfies a blocked server RECV, the scheduler can set a
direct IPC target. The next scheduling decision runs that server before ordinary
round-robin work when it is ready and its ThreadRef generation still matches
the captured direct target. When the direct slot is unavailable, endpoint
completions fall back to the queued path with WakePolicy::QueueCpu(slot)
targeting the current CPU’s per-CPU queue, so the wake scan probes the placed
CPU first.
Design
The implementation keeps ring dispatch outside the global scheduler lock. Timer dispatch extracts ring/cap/scratch handles, releases the scheduler lock, processes bounded SQEs, then reacquires the scheduler lock to choose the next thread. This prevents Cap’n Proto decode, serial output, and capability method bodies from running under the global scheduler lock.
There is no longer a slow-path-required summary or a per-CPU skip
budget for the user-mode timer path. Every user-mode LAPIC timer tick
enters sched::schedule(), which services run-queue entries, direct
IPC targets, deferred process termination/drop and thread-stack
cleanup, Timer sleep waiters, and blocked threads with timer-backed
cap_enter or Park timeouts under the scheduler lock. Those timeout paths
compare absolute monotonic deadlines, but periodic ticks still decide when the
checks run. Ring SQEs and ordinary cap waiters run on the same per-tick
cadence. Kernel-mode timer ticks (e.g., on AP cores parked in the kernel idle
loop) still go through kernel_timer_interrupt_handler, which sends EOI
without entering the scheduler. The shared advance_bsp_tick helper still
increments the compatibility TICK_COUNT only on CPU 0; normal runtime
accounting and timeout comparisons use monotonic_ns() instead. Future per-CPU
fair-share slices may reintroduce a continuation path under explicit Phase D
or Phase F authority; until then the always-schedule shape keeps the
scheduler’s authority over thread metadata and runnable ownership
single-source.
The runnable queues keep a single-owner contract behind the global
scheduler lock. A live generation-checked ThreadRef may have at most
one runnable dispatch owner across per-CPU current/handoff_current
slots, the per-CPU run queues, and the single direct_ipc_target
preference slot. Blocked waiters, sleep waiters, park waiters, endpoint
state, process waiters, and join waiters are not runnable owners; they
may make a thread ready only after liveness and generation checks
succeed.
Migration between per-CPU queues is represented as a scheduler-lock-
contained transfer, not as a second published owner. The source owner
is removed or popped first and the ThreadRef is then inserted in the
destination queue at the position determined by a freshly recomputed
virtual_finish_ns, or selected as the next running thread.
virtual_runtime_ns travels with the thread; virtual_finish_ns is
recomputed at every enqueue and never carried as committed state, so
weight or class mutations applied while the thread was blocked take
effect on the next dequeue and re-enqueue. Retry paths requeue the
candidate after dropping duplicate queued copies. Direct IPC keeps its
preference slot only while the target remains live and runnable; if
the direct target cannot run immediately, it falls back through the
normal queued-owner path on the current CPU’s per-CPU queue.
Idle-to-runnable wake targeting reuses the same ownership boundary. A
thread that becomes ready through endpoint completion, timer sleep,
park wake, process wait, or thread join is pushed to the placement
target’s per-CPU run queue, and wake_idle_scheduler_cpus_locked first
probes the placement target when the policy is QueueCpu, then walks
eligible idle scheduler CPUs to wake the first that accepts a fresh
reschedule IPI; CPUs that already have a pending IPI (or that fail
LAPIC delivery) are skipped without breaking the scan, so a burst of
ready work cross-wakes more than one neighbor instead of stranding the
rest behind one already-targeted CPU. Direct IPC uses the same path.
Measurement builds expose aggregate and per-phase counters for wake
scans, eligible idle CPUs, targeted CPUs, IPIs sent, already-pending
IPI skips, not-ready target skips, missing LAPIC targets, and send
failures.
Each per-CPU run queue is reserved up to the live runnable-capable thread count before publication; the shared live reservation count is released on process/thread exit or pre-publication rollback. Reserving each queue to the full live-thread count is required because the bounded steal path may migrate every live thread into a single sibling queue between two scheduler passes. Timer preemption, unblock, direct- IPC fallback, requeue, and steal-requeue paths therefore must not allocate while the thread is already live.
Process and thread exit cleanup proves the removal side of that ownership contract at the cleanup site. After removing queued owners and clearing a matching direct IPC target, the scheduler lock remains held while the kernel scans every per-CPU runnable queue and the direct target slot; any stale exiting process or thread reference is a kernel assertion failure. The focused spawn smoke asserts the corresponding serial proof markers on exercised process and thread exit paths.
The Phase C migration order is constrained by hardware state, not only by
scheduler data structures. The first gate moved syscall entry/exit off
BSP-symbol-relative PerCpu fields and onto KernelGsBase/swapgs on user
syscall paths, including blocking cap_enter, exit, and
ThreadControl.exitThread paths that leave through iretq rather than the
normal sysretq epilogue. The second gate added xAPIC initialization, a
PIT-calibrated BSP LAPIC timer tick, LAPIC EOI routing, AP LAPIC
initialization, a LAPIC spurious-vector handler, and an IPI vector plus bounded
vector-49-only fixed IPI send primitive. The third gate added address-space
resident CPU masks, per-CPU pending full-TLB flush generations, completion
waits, and a vector-49 TLB shootdown handler for user page-table map,
unmap, and protect. The fourth gate split current-thread tracking into
per-CPU slots, registers AP PerCpu records for current-thread and syscall
stack mirrors, updates AP TSS.RSP0 on context switches, and hands the single
scheduler-owner role to AP cpu=1 when it is online with a programmed LAPIC
timer.
The LAPIC slice replaces the BSP-oriented PIT/PIC scheduler tick on supported
QEMU and hardware paths. kernel/src/arch/x86_64/idt.rs keeps vector 32 for the
PIT/PIC fallback, reserves vector 48 for LAPIC timer delivery plus vector 49 for
cross-CPU requests, and installs vector 255 for LAPIC spurious interrupts.
pic.rs can remap and mask all legacy IRQs once LAPIC ticks are active, and
context.rs sends LAPIC EOI or PIC EOI according to the active timer source.
The IPI vector now handles TLB shootdown requests and bounded reschedule
requests for AP idle-to-runnable handoff.
The TLB slice wraps user page-table mutations that can affect an address space
resident on another CPU. AddressSpace::map, AddressSpace::unmap, and
AddressSpace::protect still perform the local x86_64 mapper flush, then
call the architecture shootdown helper with the address space’s resident CPU
mask. The helper records pending full-TLB flush generations for online resident
CPUs other than the caller, sends vector-49 IPIs, and returns a completion token.
Capability handlers drop the address-space guard and enqueue completion work;
cap_enter and timer polling drain that queue after ring dispatch releases the
cap-table and scratch locks. This keeps a remote syscall that is contending on
the same process locks from blocking maskable IPI delivery forever. Capability
handlers reserve fixed-size deferred queue slots before page-table mutation, so
full queues fail closed as capability overload errors instead of surfacing after
rollback, unmap, or protect has already changed state. Drains flush the current
CPU before waiting so a CPU that is itself in the target mask cannot wait on its
own pending generation. Target CPUs drain the generation in the IPI handler, at
syscall entry, or before returning to userspace from syscall, timer, and
scheduler restore paths.
Generation counters avoid losing overlapping shootdowns while a target CPU is
already draining a prior request. This relies on kernel user-buffer access
continuing through address-space-locked HHDM copy/read helpers rather than raw
user virtual addresses while a delayed flush generation exists. Callers include
VirtualMemoryCap dispatch through parse_map, parse_unmap, and
parse_protect, plus MemoryObjectCap::{map,unmap,protect} in
kernel/src/cap/frame_alloc.rs. Scheduler CR3 handoff now marks the selected
address space resident on the current CPU, including AP cpu=1 during the AP
scheduler-owner proof.
Idle paths
There are two distinct idle paths, and both run genuine CPL0 (kernel-mode) idle. There is no user-mode idle process: when no real work is runnable a CPU runs the kernel idle code at CPL0 on the kernel PML4. The two paths differ only in how the CPU got there.
The cooperative CPL0 kernel-mode idle path is the boot/AP path. start
(BSP), start_ap (APs), and the start_current_cpu loop call
next_start_context; when that returns no real runnable work they fall into
idle_current_cpu_once, which hlts at CPL0 on the per-CPU kernel stack with
interrupts enabled (no CpuContext, no restore_context — the same way
start_current_cpu itself runs). A kernel-CPL timer tick or reschedule IPI
taken during that hlt runs the kernel-mode handler
(kernel_timer_interrupt_handler / handle_reschedule_ipi, both of which call
nohz_recheck), so the nohz one-shot deadline is preserved and re-armed across
the hlt; control then returns to the loop, which re-checks for work.
idle_current_cpu_once increments the KERNEL_IDLE_HLT_ENTRIES counter and
emits a bounded
cpu-isolation: kernel-idle hlt cpu=… idle_path=cooperative-cpl0 … nohz_active=… timer_source=…
log line so this path is observable from the kernel log; the
run-scheduler-cpu-isolation-lease smoke asserts it is reached. Once any
dispatch path restore_contexts into a real thread, the start_current_cpu
frame is abandoned.
The steady-state CPL0 idle-thread path is reached from the four
interrupt/syscall-return dispatch call sites — schedule() (timer),
capos_block_current_syscall, exit_current, and exit_current_thread. When
choose_next_locked falls through to this CPU’s idle thread, each site builds
the dispatch tuple from the per-CPU CPL0 idle-thread context. The dispatch
call sites hand a CpuContext to assembly that restore_contexts (or, for the
timer path, return a context pointer plus a CR3 the timer handler loads), so
they need a schedulable context when no real work is runnable; the CPL0 idle
context is that context.
CPL0 idle-thread context infrastructure. arch::smp::init_idle_kernel_stacks
allocates one dedicated CPL0 idle kernel stack per scheduler CPU slot from
fresh contiguous frame ranges, so they do not overlap the boot kernel stacks,
the per-thread kernel stacks, or the IST slots. CpuContext::new_cpl0_idle
builds a kernel-shaped context (kernel-code/kernel-data selectors,
rip = kernel_idle_entry, rsp into the idle kernel stack). sched::sched_init,
called from kmain, constructs and stores one CpuContext per CPU slot in
CPL0_IDLE_CONTEXTS and then calls register_idle_process_locked to seed the
slot-0 synthetic idle Process record before the scheduler runs (this
keeps the BSP idle process’s low PID and the init-process PID ordering stable);
the remaining per-CPU slots are registered lazily by
current_cpu_idle_thread_locked the first time their CPU reaches idle.
sched_init panics on OOM, as does the lazy path: the CPL0 idle contexts and
the synthetic idle records are scheduler idle infrastructure and there is no
fallback idle path, so a failure to build them is unrecoverable. The idle
kernel stack is sized as a full per-thread kernel stack
(PROCESS_THREAD_KERNEL_STACK_PAGES), not an IST slot, because
kernel_idle_entry runs the deep service_periodic_work() call chain on it
(see periodic-service parity below).
Synthetic idle process records. The idle thread is never a runnable
user-mode process. The synthetic idle Process (Process::new_idle) maps no
user code, no user stack, and no cap ring, and carries an empty cap table. It
exists only so the idle ThreadRef resolves through sched.processes and the
scheduler’s ThreadRef-centric bookkeeping — set_thread_state,
account_thread_selected_locked, current-thread tracking, and the
is_idle_thread guard predicate used pervasively across the scheduler — keeps
working unchanged. Its address_space is a bare page-table root with nothing
user-mapped; it is required by the Process struct but is never loaded as
CR3. Every idle dispatch site routes the CPU onto the kernel PML4 via the
CPL0 idle context, so the synthetic idle AddressSpace is never made resident
and never participates in resident_cpu_mask or TLB-shootdown idle-residency
handling.
Dispatch-tuple rewire. After choose_next_locked returns, when the chosen
thread is idle_threads[current_cpu_slot()], each dispatch site builds the
dispatch tuple from the CPL0 context pointer, the dedicated idle kernel stack
top, the kernel PML4 CR3, and the current FS base (no FS-base change).
sched_init builds one CPL0 idle context per scheduler CPU slot or panics, so
cpl0_idle_context(slot) is infallible at every dispatch site. The
schedule() timer path does not route through a dedicated CR3-loading
restore helper: the existing timer_interrupt_handler already loads the
tuple’s CR3 with write_cr3 before the privilege-agnostic five-element
iretq. The three syscall-path sites (capos_block_current_syscall,
exit_current, exit_current_thread) keep their
restore_context_after_syscall restore tail: they are entered via
syscall_entry (which already executed swapgs), so the exit swapgs is
required to leave the CPL0 idle thread running with the user GS base — the
same GS-base state the timer path’s CPL0 idle thread runs with. Each site emits
a distinct marker: sched: dispatch idle cpu=N idle_path=cpl0-dispatch-timer
(timer), …cpl0-dispatch-block (blocking syscall), and …cpl0-dispatch-exit
(both exit_current and exit_current_thread). debug_assert!s guard the
CPL0 dispatch tuple: context cs/ss are the kernel selectors and their RPL
bits are 0.
CPL0 idle periodic-service parity. schedule()’s timer Phase 2 runs
periodic service work on every tick — deferred process drops, pending
terminations, wake_cap_waiters, service_sqpoll_workers(),
drain_pending_endpoint_cancellations(), terminal_session::poll_input(),
virtio::poll_scheduler(), and the network / pipe / interrupt poll_waiters()
calls. A CPL0 idle thread’s timer ticks are kernel-mode and go through
kernel_timer_interrupt_handler, which never enters schedule() — so without
explicit parity handling that servicing would be stranded whenever a CPU is
parked on the CPL0 idle thread. That work is factored into a single
service_periodic_work() function with one lock discipline: the scheduler lock
is taken only for the bounded deferred-drop / thread-stack-release /
wake_cap_waiters / pending-termination extraction, then dropped before
drop_pending_process / finish_terminated_process and the lock-free poll
block. schedule() calls it after ring dispatch; kernel_idle_entry is its
own cooperative loop that, each iteration, runs service_periodic_work(), then
next_start_context(false) to re-dispatch a real runnable thread the moment
one appears (allow_idle = false so it never re-selects the idle thread), then
idle_current_cpu_once() to hlt. The re-dispatch is required: without it a
kernel-mode timer tick taken during the idle hlt returns through
kernel_timer_interrupt_handler, which does not re-enter schedule(), so the
CPU would be stranded. service_periodic_work() and next_start_context() run
with interrupts disabled in that loop — the CPL0 idle context is built
IF=1 so the periodic tick can preempt the hlt, so the loop must cli
before the deep service call; otherwise a CPL0 timer tick taken during
service_periodic_work() nests a kernel_timer_interrupt_handler frame onto
the idle kernel stack (same-privilege interrupts do not switch stacks).
idle_current_cpu_once re-enables interrupts only across its enable_and_hlt
and disables them again before returning. There is no double-service: a CPU
running a real thread gets the service block via schedule(), a CPU on the
CPL0 idle thread gets it via the kernel_idle_entry loop, and a given tick on
a given CPU is CPL3 (schedule()) xor CPL0-idle (the loop). nohz cadence stays
honest because the loop iterates at the timer/IPI cadence — when the periodic
tick is suppressed the re-armed one-shot still wakes the hlt, so
service_periodic_work() still runs.
iretq CPL0 restoration invariant and CPL0 idle-thread prerequisites
This subsection records the load-bearing x86-64 architectural invariant that any future CPL0 idle-thread context migration must satisfy, along with the prerequisites the implementation will need to meet.
Authoritative reference: Intel 64 and IA-32 Architectures Software
Developer’s Manual (SDM), Volume 2A, IRET/IRETQ instruction reference,
“Operation” pseudocode (the IF OperandSize = 64 / 64-bit-mode path), and
Volume 3A, Section 6.14.3 “Returning from an Exception or Interrupt
Procedure.” The description below applies to IRETQ in 64-bit long mode;
the legacy 32-bit IRET paths behave differently and are called out
explicitly where it matters.
iretq frame layout and the 64-bit unconditional five-element pop.
iretq in 64-bit long mode unconditionally pops five 64-bit (8-byte)
values from the top of the current kernel stack, in order: RIP, CS,
RFLAGS, RSP, SS. This is true regardless of whether the privilege
level changes — both a CPL0→CPL3 return and a CPL0→CPL0 return consume the
same five-element frame and load RSP:SS from it. AMD deliberately removed
the legacy conditional stack switch for long mode: the “skip SS:ESP on a
same-privilege return” behavior exists only in the legacy 32-bit IRET
operand-size paths, never in IRETQ.
- CPL0 → CPL3 (privilege change, ring exit): The target
CShas RPL=3, which differs from the current CPL=0. The CPU installsRIP,CS, andRFLAGSfrom the frame, then loadsRSPandSSfrom the same frame and transfers to the user-space instruction atRIPon the user stack. - CPL0 → CPL0 (same-privilege, no ring change): The target
CShas RPL=0, matching the current CPL=0.iretqstill pops all five elements: it installsRIP,CS, andRFLAGS, and also loadsRSPandSSfrom the frame, exactly as in the CPL3 case. There is no same-privilege short-circuit in 64-bit mode. The practical consequence for a CPL0 restore is the opposite of the legacy intuition: the frame’srspandssfields are load-bearing and must carry a valid kernel stack pointer and a valid RPL=0 stack selector, because the CPU will load them.
Current code. restore_context (kernel/src/arch/x86_64/context.rs
lines 311–328) sets RSP to the supplied CpuContext pointer, pops all
fifteen caller-saved and callee-saved GPRs (lines 315–327), and executes
iretq (line 328). The CpuContext struct (context.rs lines 133–155)
places rip, cs, rflags, rsp, and ss at the high end of the struct
(lines 150–154), matching the hardware interrupt-frame layout that the CPU
pushes when it enters the timer interrupt handler. The comment at line 149
(“Pushed by CPU on interrupt from Ring 3”) reflects how every CpuContext is
populated today, but the five-element iretq frame itself is not
CPL3-specific — iretq consumes the same five elements for any target CPL.
User-thread contexts. Every user-thread CpuContext is built by
Thread::new_user (kernel/src/process.rs), which sets
cs = sel.user_code.0 as u64 (RPL=3, value 0x23) and
ss = sel.user_data.0 as u64 (RPL=3, value 0x1B). Every iretq issued by
restore_context or restore_context_after_syscall into a user thread is
therefore a CPL0→CPL3 privilege change into a fully user-shaped context.
CPL0 idle contexts coexist with user contexts. The blocker for a CPL0
target is not iretq frame arithmetic: iretq pops the same five elements
for a CPL0 target as for a CPL3 target, so a frame carrying kernel selectors
and a valid kernel rsp iretqs correctly. The real requirements are in the
surrounding dispatch plumbing, all of which the CPL0 idle path satisfies:
- CR3. The dispatch call sites set
CR3to the kernel PML4 for the CPL0 idle path, not to any userAddressSpacepage table. The synthetic idleProcess’sAddressSpaceis never loaded as CR3. swapgs/ GS-base. A CPL0 idle context was never entered through thesyscallpath. Theschedule()timer path reaches it through the timer handler’s own CR3 load and the privilege-agnosticiretqtail (noswapgsin that path at all). The three syscall-path sites (capos_block_current_syscall,exit_current,exit_current_thread) keep theirrestore_context_after_syscalltail: those sites were entered viasyscall_entry(which alreadyswapgsed), so the exitswapgsis required to undo it — leaving the CPL0 idle thread running with the user GS base, the same state the timer path produces.- Kernel-code and kernel-data selectors. A CPL0
CpuContextusescs = sel.kernel_code.0 as u64(RPL=0, value0x08) andss = sel.kernel_data.0 as u64(RPL=0, value0x10). Becauseiretqloadsssunconditionally in 64-bit mode,ssmust be a valid RPL=0 stack selector; the GDT data-selector privilege checks require an RPL=0ssto be paired with an RPL=0cs, so the whole context (cs,ss,rsp, CR3, GS base) is kernel-shaped together. - Idle kernel stack. Each CPL0 idle thread has its own dedicated kernel
stack (
arch::smp::init_idle_kernel_stacks) that does not overlap any IST slot, any per-thread kernel stack, or the BSP/AP boot stacks. Becauseiretqloadsrspfrom the frame, the context’srsppoints into this dedicated stack. It is sized as a full per-thread kernel stack becausekernel_idle_entryruns the deepservice_periodic_work()call chain on it. - No user
AddressSpaceresidency. The synthetic idleProcess’sAddressSpaceis never made resident and never participates inresident_cpu_mask, so TLB shootdown never stalls waiting for an idle CPU. - No blocking, no exit. The idle thread never calls
cap_enter, parks, blocks on any waiter, or exits. TheInvariantssection entry “The idle thread must never block incap_enteror exit” carries forward unchanged.
CpuContext::new_cpl0_idle builds the kernel-shaped context,
sched::kernel_idle_entry is the entry point, and sched::sched_init wires
the per-CPU CPL0 idle contexts and seeds the slot-0 synthetic idle process
record (the remaining slots’ records are registered lazily by
current_cpu_idle_thread_locked). All four dispatch call sites — schedule(),
capos_block_current_syscall, exit_current, exit_current_thread — route
idle dispatch onto the CPL0 idle context: the timer path returns the CPL0
context pointer plus the kernel PML4 CR3 in its dispatch tuple and relies on
the existing timer_interrupt_handler CR3-load; the three syscall-path sites
keep their restore_context_after_syscall tail so the syscall-entry swapgs
is undone. The CPL0 contexts are kernel-shaped across cs, ss, rsp, and
CR3 together.
Measurement Policy
Design grounding for this policy: this document’s scheduler invariants,
docs/backlog/scheduler-evolution.md,
docs/proposals/scheduler-evolution-proposal.md,
docs/research/future-scheduler-architecture.md,
docs/research/out-of-kernel-scheduling.md,
docs/research/nohz-sqpoll-realtime.md, and
docs/research/completion-ring-threading.md. In particular,
docs/research/future-scheduler-architecture.md keeps the always-on versus
benchmark-only scheduler telemetry split as an open scheduler question, and the
current answer is intentionally conservative.
The current kernel/src/measure.rs counters are benchmark instrumentation, not
normal operator observability. They stay behind the measure feature and
CAPOS_THREAD_SCALE_GUEST_MEASURE=1 because they add atomics, cycle-counter
reads, phase bookkeeping, and in some cases sampled user RIP values to hot
scheduler, timer, TLB, ring, and serial paths. Normal QEMU and dispatch builds
must not depend on those counters being present.
The per-thread runtime-accounting ledger is split. The WFQ load-bearing core
fields, runtime_ns, virtual_runtime_ns, and last_started_ns, are
unconditional normal-build state on ThreadCpuAccounting: WFQ ordering,
SchedulingPolicyCap.snapshot, and SchedulingContext budget charging depend
on them outside cfg(feature = "measure"). The diagnostic fields
(context_switches, preemptions, voluntary_blocks, migrations,
last_cpu, blocked/exited stability probes, placement buckets, and per-phase
attribution counters) stay behind the measure feature. Permanent operator
observability is still separate work: it should expose low-rate, non-symbolic
snapshots derived from the unconditional ledger plus event counters such as
runnable queue depths or high-water marks, reschedule IPI sent/failed/pending
counts, TLB shootdown request/failure counts, and scheduler policy admission
or denial counts. Those counters must not allocate, log, read raw user PCs, or
perform cycle-timing in timer, unblock, direct-IPC fallback, requeue, or
steal-requeue paths.
Benchmark-only attribution stays in measure: per-phase thread-scale
checkpoints, guest cycle timings for ring/capnp/method/scheduler segments,
scheduler-lock wait and hold cycles, scheduler-lock site attribution, serial
byte attribution, timer-mode breakdown, CR3/TLB event totals, thread-placement
selection/migration buckets, raw user-PC samples,
logging-suppression A/B evidence, and workload/cacheline diagnostics. The
publish-placement publish/caller-aware buckets were retired with the per-CPU
run-queue collapse. Phase D shipped the fair-share enqueue policy but did not
reintroduce those placement counters.
A future branch may promote a specific event count only by adding the
normal-build storage/API and proving the same emergency-path constraints; it
should not simply remove the current cfg(feature = "measure") boundaries from
the benchmark module.
The publish-placement publish/caller-aware buckets are still retired;
Phase D Task 3 brought back per-CPU placement semantics but does not
re-emit the publish counters. Re-instate them through a separate
operator-observability slice that proves the same emergency-path
constraints, not by removing the existing cfg(feature = "measure")
boundary on the historical buckets.
Tickless idle is enabled only for true idle. A scheduler-owned CPU may mask the
periodic LAPIC tick when it is running the CPL0 idle context, has no runnable
non-idle work, has no active CpuIsolationLease nohz record, has no local
deferred cleanup, has no cap-enter polling dependency, and the one-shot
clockevent plus non-tick-derived monotonic clocksource are available. The
replacement one-shot is bounded by the nearest Timer/ParkSpace deadline or
a 100 ms idle housekeeping floor, and the scheduler restores
periodic mode before non-idle dispatch, reschedule-IPI wake, or rollback.
Cap-enter polling waiters, including the current terminal shell path, and
ready threads paused in a SchedulingContext retry window keep the periodic
tick until those dependencies move behind explicit deadlines or housekeeping
placement.
Generic full-nohz for ordinary budgeted compute threads carries the clockevent/deadline substrate into the CPU-isolation state machine and suppresses ticks only after network polling, IRQ affinity, accounting, deadline, lifetime, and rollback obligations pass. SQPOLL nohz applies the same substrate to explicitly leased caller-thread rings once the SQPOLL worker is live and the single-consumer, owner-lease, wake, and rollback gates pass. Automatic policy issuance and broader SQPOLL userspace-poller/device-queue admission remain separate later CPU-isolation features; see Tickless and Realtime Scheduling and NO_HZ, SQPOLL, and Realtime Scheduling.
Exit switches to the kernel PML4 before tearing down the exiting address space,
releases capability authority, completes process waiters, defers final process
teardown until the scheduler is running on another kernel stack, and then
releases remaining thread kernel stacks through the scheduler-owned
OffStackToken path before the Process value is dropped.
Invariants
- The idle thread must never block in
cap_enteror exit. - Ring dispatch must not hold the scheduler lock.
- Timer dispatch copies current-process user buffers through that process’s
locked
AddressSpace; it must not rely on a raw current-CR3 validate/use window. - Blocked
cap_enterwaiters wake when enough CQEs are available or their finite timeout expires. - Timer sleep waiters must be bounded per process, tied to the caller
ThreadRefgeneration, and removed when the caller process exits. - Runtime-controlled FS bases must stay in user canonical space.
- Direct IPC handoff is a scheduling preference, not a bypass of process liveness, generation, or state checks.
- The scheduler must update TSS.RSP0 and the per-CPU syscall kernel RSP
through
percpu::set_kernel_entry_stackon each switch. - Each
PerCpu.current_threadmirrors that CPU’s scheduler current slot; the scheduler lock remains the authority for current-thread and queue ownership even though dispatch/runnable state is now separate from shared process and thread metadata. - Each live
ThreadRefmay appear in the per-CPU runnable queues at most once across all queues, and every per-CPU queue’s capacity must be reserved up to the live runnable-capable thread count before a new process or thread becomes runnable. - A live generation-checked
ThreadRefmust have at most one runnable dispatch owner across per-CPUcurrent/handoff_currentslots, the per-CPU runnable queues, and the direct IPC target. - Queue migration (including the bounded steal path) must be a
scheduler-lock-contained remove-before-publish transfer; no path may
publish the same
ThreadReftwice into any queue or leave a stale direct target after exit. Migration must recomputevirtual_finish_nsat the destination and never carry the source’s WFQ tag as committed state. - Each per-CPU run queue must remain ordered ascending by
virtual_finish_nsafter every enqueue, requeue, or steal-requeue. Local selection scans the queue by index for the first destination-Runnable entry; RetryLater entries are left in place for the next scheduler pass. The bounded steal path scans each sibling queue’s indices ascending for that queue’s first Runnable-for- destination entry — because each queue is ordered ascending, the first Runnable hit per queue is the lowestvirtual_finish_nscandidate the destination can accept on that source — then picks the source queue whose first-Runnable candidate has the lowestvirtual_finish_nsglobally, with ties broken by lower CPU id. The chosen entry is removed from its actual position on the source queue (not necessarily the head). - Process and thread exit cleanup must assert, before releasing the scheduler lock, that the exiting process or thread has no remaining entry in any per-CPU runnable queue and no remaining direct IPC target slot.
- Timer, unblock, direct-IPC fallback, requeue, and steal-requeue paths must use reserved run-queue capacity and avoid allocation.
- Runtime accounting must use the normal monotonic clocksource, not benchmark-only cycle counters, and must charge only running intervals.
- FS base is saved and restored across context switches for TLS.
- Thread records remain generation-checked
ThreadRefidentities; exited records are retained only while a live handle, pending join, or unjoined status can still observe them. - The final teardown of an exiting process must not release thread kernel
stacks until another kernel stack is active, and the implicit
Thread::Droppath must not free kernel-stack frames. - A scheduler CPU must never run the same generation-checked
ThreadReftwice at once; same-process siblings may run on different scheduler CPUs only when their completions route through distinct per-thread ring endpoints. - Park waiters must be keyed by generation-checked
ThreadRefvalues, reserve one waiter CQE credit, and must not allocate in wait, wake, timeout, or process-exit cleanup paths.
Code Map
kernel/src/sched.rs- shared process table plusSchedulerDispatchownership of the per-CPU runnable queues (ordered ascending byvirtual_finish_ns), per-CPU current/handoff slots, idle-thread slots, direct IPC target, run-queue reservation accounting, pending drops, and pending stack releases; also blocking, wakeups, Timer sleep waiters, the bounded steal path, and exit.kernel/src/arch/x86_64/context.rs- CPU context layout, timer entry/restore, tick counter.kernel/src/arch/x86_64/idt.rs- timer and IPI interrupt handler wiring.kernel/src/arch/x86_64/lapic.rs- xAPIC MMIO setup, PIT-calibrated LAPIC timer, LAPIC EOI, spurious-vector handling, and fixed-IPI send primitive.kernel/src/arch/x86_64/tlb.rs- serialized vector-49 TLB shootdown request, pending flush generations, completion token, and interrupt/user-return drain path.kernel/src/arch/x86_64/pic.rsandkernel/src/arch/x86_64/pit.rs- legacy PIC remap and PIT fallback setup.kernel/src/arch/x86_64/gdt.rs- BSP/AP TSS and kernel stack storage.kernel/src/arch/x86_64/syscall.rs- blocking syscall transition forcap_enter.kernel/src/arch/x86_64/percpu.rs- per-CPU syscall stack registry, TSS.RSP0 update hook, and current thread storage.kernel/src/arch/x86_64/tls.rs- FS base save/restore.kernel/src/process.rs- process state, kernel stacks, the synthetic idle process record, and per-thread CPU accounting storage/accessors.
Validation
make run-smokevalidates timer preemption, ring fairness, direct IPC handoff, blockedcap_enterwakeups, process exit, and clean halt.make run-spawnvalidates process wait blocking and child exit completion throughProcessHandle.wait, Timer monotonic now/sleep completion throughtimer-smoke, per-process sleep quota isolation throughtimer-flood, and thread/park lifecycle behavior throughthread-lifecycle.make run-measurevalidates the post-thread park blocked/resume timing path and process exit while a park waiter is parked.cargo build --features qemuverifies QEMU-only scheduler and halt paths.- QEMU smoke output for IPC includes direct handoff diagnostics when the server is woken from a blocked RECV.
Open Work
- Prove SQPOLL/poller progress that does not depend on periodic scheduler ticks before automatic nohz activation. Then implement tickless idle only for no-runnable-work CPU idle. Keep runnable contention on periodic preemption until the activation proof closes the remaining network polling, IRQ affinity, and housekeeping dependencies.
- Keep SMP behind per-CPU scheduler state and review of any path that needs
page pinning beyond the
AddressSpace-locked copy/read contract. - Implement the remaining SMP Phase C slices: split shared scheduler metadata, replace the temporary scheduler-owner mask, and collect accepted benchmark evidence.
- Add priority or policy scheduling only after the current authority and IPC semantics remain stable.
- Add service restart policy outside the static boot graph.
Programming Languages
capOS currently supports native Rust programs that are written for the capOS userspace runtime. Other languages are design tracks, not implemented platform support. The main rule is simple: a language runtime may expose familiar APIs, but authority still comes from the process CapSet and typed capability calls.
Current Support
| Language or runtime | Status | Path |
|---|---|---|
| Rust, capOS-native | Implemented baseline | #![no_std], alloc, capos-rt, static ELF, x86_64-unknown-capos. Phase D best-effort fair scheduling closed at commit 77caafc0 (2026-05-10 19:39 UTC): per-thread weighted vruntime, per-CPU WFQ run queues, bounded steal/migration, and SchedulingPolicyCap weight/latency-class authority. |
Rust std | Not implemented | Future Rust standard-library or adapter work over capabilities |
| C | Phase 0 in tree (libcapos C-substrate v0 + libcapos-posix v0) | libcapos.a exposes capos-rt syscalls, ring CALL, CapSet lookup, a heap shim, typed Console.writeLine, Timer.now, EntropySource.fill, VirtualMemory wrappers, and native ProcessSpawner.createPipe / Pipe wrappers through extern "C"; make run-c-hello proves baseline C wrappers and make run-c-pipe proves a C binary can create a Pipe, write/read a marker, close the writer, and observe EOF without using the POSIX adapter. libcapos_posix.a adds the POSIX adapter v0 surface above libcapos: per-process static fd table (32 fds), TLS errno via __errno_location(), historical UDP socket/sendto/recvfrom/close wrappers over the retired qemu-only kernel UdpSocket cap, clock_gettime(CLOCK_MONOTONIC, ...), gettimeofday(&tv, NULL), time, nanosleep, and sleep over the kernel Timer cap, fail-closed signal stubs, pipe/read/write/dup/dup2 over the kernel Pipe cap, and fork/execve/waitpid/_exit/posix_inherit_stdio plus a direct posix_spawn successor via the recording-shim Move-grant path through ProcessSpawner.createPipe / ProcessSpawner.spawn. See the POSIX adapter row for shipped smokes; the old DNS smoke is retired until resolver networking is rebuilt on the userspace stack. |
| C++ | Future experiment | Depends on C startup, ABI choices, allocator, exceptions/RTTI policy, and a useful freestanding subset |
| Go | Future design | Custom GOOS=capos per Go Runtime proposal; a separate Phase W.8 path (docs/proposals/wasi-host-adapter-proposal.md Task 9) targets a TinyGo / upstream Go GOOS=wasip1 CUE evaluator binary that runs inside the WASI host adapter against a future ScriptPackage cap |
| Python | Future design | Native CPython or MicroPython through a POSIX-style adapter; WASI/Emscripten for sandboxed or compute-only use |
| Lua | Phase 1 in tree (L.3 deterministic memory release) | demos/lua-smoke/ runs a hand-written Lua-subset interpreter that exercises three capability-aware host bindings: console:write_line, timer:now, and L.3 memory:{alloc,write,read,size,release} over capos-rt::VirtualMemoryClient (kernel-mapped address never crosses to Lua; every byte access is bounds-checked host-side; release unmaps the exact rounded region and marks the userdata dead). PUC Lua dialect compatibility is deferred to the future C/libcapos port. See Lua Scripting proposal. |
| JavaScript / TypeScript | Future design | QuickJS-style native runner or WASI-hosted engine; not a browser JS shell |
| WASI / WebAssembly | Phase W.5 landed 2026-05-17 05:42 UTC (Phase W.4 closed 2026-05-07 20:09 UTC; Phase W.3 closed 2026-05-07 18:25 UTC; Phase W.2 closed 2026-05-07 10:53 UTC) | Host imports backed by capabilities; useful for sandboxed code and portable tools. W.1 vendored upstream wasmi (v1.0.9) at vendor/wasmi-no_std/wasmi-1.0.9/ and shipped the capos-wasm/ standalone crate that exposes a Runtime value (wasmi Engine + Store<HostState>). W.2 sub-slice 1 added the wasm-host userspace binary in capos-wasm/src/bin/wasm-host.rs, the system-wasm-host.cue focused-proof manifest, and make run-wasm-host, which still asserts the empty-instantiation regression. W.2 sub-slice 2 grew the same binary with the Preview 1 import resolver in capos-wasm/src/wasi/preview1.rs: 46 wasi_snapshot_preview1 imports land on the wasmi linker; clock_time_get(CLOCKID_MONOTONIC) is backed by the manifest-granted Timer cap; proc_exit exits via capos_rt::syscall::exit; fd_write(1, …) / fd_write(2, …) route through the manifest-granted Console cap with a fixed 4 KiB iov-total scratch ceiling and a 1 KiB per-call chunk that matches the kernel Console cap’s MAX_SERIAL_CAP_WRITE_BYTES; everything else (including random_get, which Phase W.4 promotes against EntropySource) returns ERRNO_NOSYS. A 114-byte hand-encoded probe module imports random_get, calls it once, stores the returned errno in an exported global, and the host refuses to print the [wasm-host] preview1 imports linked: clock_time_get, fd_write, proc_exit, args/environ empty; nosys=52 proof line unless that errno equals ERRNO_NOSYS. W.2 sub-slice 3 added demos/wasi-hello-rust/ (a one-liner println! Rust crate built for the upstream wasm32-wasip1 target), system-wasi-hello-rust.cue (now grants console, timer, and the optional boot (BootPackage) cap), tools/qemu-wasi-hello-rust-smoke.sh, and make run-wasi-hello-rust. The wasm-host binary keeps running the sub-slice 1 + 2 regression first; when the manifest grants boot, it also reads the manifest blob through BootPackage, decodes binaries[] via raw capnp readers (new capos_wasm::payload module), instantiates the wasi-payload wasm, explicitly invokes the _start export (wasmi’s instantiate_and_start runs the WebAssembly start section, NOT WASI’s _start), and lets the payload’s println! reach the kernel Console cap through Preview 1 fd_write. capos-rt grew narrow re-exports (capos_capnp and default_reader_options) so capos-wasm keeps a single direct path-dep on capos-rt and the vendored wasmi tree. The slice also kept the W.2 sub-slice 1 userspace-image budget bump (USER_STACK_BASE 0x100_0000) for wasmi’s ~3 MiB BSS. W.2 sub-slice 4 closed Phase W.2 by adding demos/wasi-hello-c/ (a single printf("Hello, wasi from capOS C\n") C main() built directly with system clang-18 against the Ubuntu wasi-libc + libclang-rt-18-dev-wasm32 apt packages: clang --target=wasm32-wasi --sysroot=/usr -O2 -Wall -Wextra produces a ~46 KiB wasm32-wasi module), system-wasi-hello-c.cue, tools/qemu-wasi-hello-c-smoke.sh, and make wasi-hello-c-build / make run-wasi-hello-c. C runs on capOS without any libcapos/POSIX work in tree because the wasm-host payload-load path landed in sub-slice 3 carries the C .wasm payload through the same wasm-host binary unchanged. Phase W.3 backed args_get / args_sizes_get with the manifest-supplied initConfig.init.wasiArgs text grant: the wasm-host walks the field through raw capnp readers in capos_wasm::payload::read_wasi_args, validates against WASI_ARGS_MAX_COUNT = 32 / WASI_ARGS_MAX_ARG_BYTES = 4096 / WASI_ARGS_MAX_TOTAL_BYTES = 8192 (rejecting interior NUL bytes), packs the bytes into a per-instance HostState argv buffer, and reflects them through Preview 1 to the wasm guest. A 2026-05-13 bounded environment grant mirrors that path for initConfig.init.wasiEnv: the wasm-host walks capos_wasm::payload::read_wasi_env, validates against WASI_ENV_MAX_COUNT = 32 / WASI_ENV_MAX_ENTRY_BYTES = 4096 / WASI_ENV_MAX_TOTAL_BYTES = 8192 (rejecting interior NUL bytes), packs KEY=value entries into a per-instance environment buffer, and reflects them through Preview 1 environ_get / environ_sizes_get; absent grants remain empty. The W.2 sub-slice 2 “args/environ empty” proof line stays byte-identical because the regression module passes empty argv and no environment. The new demos/wasi-cli-args/ Rust smoke (println! of argv[1]), system-wasi-cli-args.cue, tools/qemu-wasi-cli-args-smoke.sh, and make wasi-cli-args-build / make run-wasi-cli-args close the per-instance argv plumbing; demos/wasi-env/, system-wasi-env.cue, tools/qemu-wasi-env-smoke.sh, and make wasi-env-build / make run-wasi-env prove one granted environment value reaches a Rust wasm32-wasip1 payload. Schema/schema/capos.capnp is unchanged because initConfig is already a CueValue and unknown sub-fields under initConfig.init are ignored by the existing manifest decoder. Phase W.4 wires Preview 1 random_get through the kernel EntropySource cap. The wasm-host (capos-wasm/src/bin/wasm-host.rs) looks up an optional per-instance EntropySource cap from the CapSet under the well-known name random and installs the typed EntropySourceClient on HostState AFTER the W.2 sub-slice 2 probe regression has run, keeping the closed-fail nosys=52 proof line byte-identical. Preview 1 random_get (capos-wasm/src/wasi/preview1.rs) drains arbitrary wasm-supplied byte ranges through EntropySourceClient::fill_wait, chunked at the kernel cap’s MAX_ENTROPY_FILL_BYTES = 64 ceiling and capped per Preview 1 invocation at RANDOM_GET_MAX_BYTES = 65_536. RDRAND-unavailable / truncated kernel responses surface as ERRNO_IO; oversized requests as ERRNO_INVAL; out-of-bounds wasm pointer writes as ERRNO_FAULT. Manifests without the grant keep returning ERRNO_NOSYS from the closed-fail refusal branch which never enters the kernel, so an instance without an EntropySource grant cannot leak entropy. Wall-clock support stays deferred until capOS has a typed WallClock/RealTimeClock cap; clock_time_get(CLOCKID_REALTIME) keeps the W.2 sentinel ERRNO_NOSYS. The new demos/wasi-random/ Rust smoke (raw Preview 1 random_get binding reading N=64 bytes), system-wasi-random.cue (granted), system-wasi-random-ungranted.cue (ungranted), tools/qemu-wasi-random-smoke.sh, tools/qemu-wasi-random-ungranted-smoke.sh, make wasi-random-build, make run-wasi-random, and make run-wasi-random-ungranted close Phase W.4. A 2026-05-13 compatibility-import smoke adds demos/wasi-stdio-fd/, system-wasi-stdio-fd.cue, tools/qemu-wasi-stdio-fd-smoke.sh, and make run-wasi-stdio-fd; it directly imports clock_res_get(MONOTONIC), sched_yield, fd_fdstat_get(1/2), and fd_seek(1/2) and requires every promoted import to return a non-ERRNO_NOSYS result without granting filesystem, socket, or stdin authority. A 2026-05-13 harness-hardening smoke adds demos/wasi-preview1-refusals/, system-wasi-preview1-refusals.cue, tools/qemu-wasi-preview1-refusals-smoke.sh, and make run-wasi-preview1-refusals; it directly imports path_open, fd_prestat_get, fd_read, sock_send, and sock_recv and asserts the documented fail-closed errno when no Namespace/File/Store/socket authority exists. Phase W.5 (2026-05-17 05:42 UTC) wires the Preview 1 preopened-directory filesystem against the kernel Directory / File cap interface: the wasm-host looks up an optional per-instance Directory cap from the CapSet under the well-known name root and installs it as a single Preview 1 preopen at fd 3 named /preopen-0. capos-wasm/src/wasi/fs.rs implements path_open, fd_read, fd_write, fd_seek, fd_close, fd_filestat_get, fd_prestat_get, and fd_prestat_dir_name over DirectoryClient / FileClient; the resolver mirrors POSIX P1.4 Slice 4’s libcapos-posix/src/path.rs – intermediate segments walk Directory.sub, the leaf mints either an existing or freshly created File via `Directory.open(flags=CREATE |
| POSIX-shaped software | Partial implementation | Compatibility adapter over explicit file, directory, socket, stdio, timer, process, and namespace caps. See POSIX Adapter proposal and plan. P1.1, P1.2, and P1.3 are closed; the former direct DNS smoke is retired with the qemu-only kernel UdpSocket owner, while make run-posix-pipe-smoke, make run-posix-spawn-smoke, and make run-posix-stdio-smoke cover pipe/fork-for-exec, direct posix_spawn, and Console-backed stdio surfaces. P1.4 file/directory fd work closed at commit f97d9833 (2026-05-23 06:23 UTC): make run-posix-file proves open(), write(), lseek(), read(), opendir(), readdir(), and closedir() through a live C process over the RAM-backed root Directory cap. Closed P1.4 successors now include printf/string (make run-posix-printf), identity stubs (make run-posix-identity), and signal/time stubs (make run-posix-signal-time). Remaining P1.4 work is dash vendoring/patching, the multi-translation-unit C build, and make run-posix-shell-smoke; long-form decomposition lives in POSIX Adapter Dash Port. |
Native Rust Today
The implemented path is Rust without the standard library. Programs use
core and may use alloc types such as Vec, String, Box, and
BTreeMap because capos-rt installs a userspace allocator. They do not get
std::fs, std::net, std::thread, println!, environment variables,
process arguments, or a libc syscall table.
capos-rt owns the repeated runtime machinery:
- the
_startentry point andcapos_rt_mainhandoff; - fixed heap initialization;
- panic output through an emergency Console path when available;
- raw
exitandcap_entersyscall wrappers; - CapSet lookup and interface-id checks;
- a single-owner ring client;
- typed clients for implemented kernel and service capabilities;
- result-cap adoption and queued local release.
Native programs should keep ordinary Rust business logic in normal modules and push OS interaction to typed capOS clients. That keeps pure logic host-testable while making authority visible at capability lookup and child-spawn sites.
Why std Is Different
Rust std is not just “more Rust.” It is an operating-system binding. It
expects an implementation of filesystem, networking, threads, time, standard
I/O, process, environment, synchronization, and platform error APIs. On Linux
those calls are ambient: a process can ask the kernel to open a path or create
a socket and the kernel consults global process credentials.
capOS does not have that ambient authority model. A future Rust std path must
choose how each std feature gets authority:
std area | capOS authority source |
|---|---|
std::io::{stdin, stdout, stderr} | StdIO, Console, or TerminalSession caps |
std::fs | scoped Directory, File, Store, or Namespace caps |
std::net | socket or listener caps minted by a network service |
std::thread | ThreadSpawner, ThreadControl, ThreadHandle, and ParkSpace support |
std::time | Timer and future wall-clock caps |
| process spawn and wait | ProcessSpawner, RestrictedLauncher, and ProcessHandle caps |
std::env and current directory | synthetic runtime state backed by manifest or namespace caps |
That mapping can be implemented as a capOS std backend, a Rust compatibility
crate, or a POSIX-style adapter. The project has not selected one shared ABI
for all language runtimes.
Compatibility Terms
Use these terms instead of the vague phrase “compatibility layer”:
- Native runtime adapter: language-specific runtime glue that talks to
capOS capabilities directly.
capos-rtis the implemented Rust example;GOOS=caposwould be the Go example. - Capability-native bindings: generated or handwritten bindings that expose Cap’n Proto interfaces as language-level APIs without POSIX names.
- POSIX compatibility adapter: a libc or library surface that translates
open,read,write,socket,poll,clock_gettime, and similar APIs into operations on granted capabilities. - WASI host adapter: a WebAssembly host implementation whose imports are backed by granted capOS capabilities.
The adapter may make code look familiar, but it cannot create authority. A process without a namespace cap still cannot open a file. A process without a network cap still cannot create a socket. A process without a launcher or spawner cap still cannot create children.
Language Tracks
Rust
Rust is the only implemented userspace language. The current target is
targets/x86_64-unknown-capos.json, which exposes target_os = "capos" while
keeping the booted userspace baseline no_std, static, and panic = "abort".
init, demos, shell, and the capos-rt smoke binary build through this
custom target.
Open work before broader Rust support:
- generated clients after the schema surface stabilizes;
- runtime ParkSpace clients and multi-threaded ring demultiplexing;
- a decision on Rust
stdover native capabilities versus a POSIX adapter; - package/build conventions for out-of-tree capOS Rust programs.
C and C++
C support is in tree as a Phase 0 substrate. The libcapos/ crate compiles to
libcapos.a, a thin Rust staticlib that exposes the capos-rt syscall, ring
CALL, CapSet lookup, typed Console.writeLine, Timer.now,
EntropySource.fill, VirtualMemory wrappers, native
ProcessSpawner.createPipe / Pipe wrappers, and the global allocator under
an extern "C" ABI. C binaries link statically against the archive and run on
the same userspace ELF layout as Rust demos; make run-c-hello boots a C
main() that calls the baseline wrappers, and make run-c-pipe boots a C
main() that creates a Pipe, round-trips a marker, closes the writer, and
observes EOF. The substrate is intentionally narrow – no
errno, no fd table, no POSIX surface – so the separate
libcapos-posix layer can own those decisions without churning the
substrate. The same archive is what later runtimes such as CPython,
MicroPython, Lua, and QuickJS will link against.
libcapos-posix/ builds libcapos_posix.a on top of libcapos.a and
ships the v0 POSIX surface: a 32-fd static table, __errno_location()
TLS, UDP socket/sendto/recvfrom/close over the kernel
UdpSocket cap, clock_gettime(CLOCK_MONOTONIC, ...) and
gettimeofday(&tv, NULL) over the kernel Timer cap,
pipe/read/write/dup/dup2 over the kernel Pipe cap, file/directory
fd operations (open, lseek, opendir, readdir, closedir) over the
RAM-backed root Directory cap, and a recording-shim
fork/execve/waitpid/_exit/posix_inherit_stdio path plus direct
posix_spawn with posix_spawn_file_actions support, all routed through
ProcessSpawner.createPipe / ProcessSpawner.spawn when spawning is needed.
The shipped smokes are make run-posix-pipe-smoke,
make run-posix-spawn-smoke, make run-posix-stdio-smoke, and
make run-posix-file. The former make run-posix-dns-smoke target is retired
with the qemu-only kernel UdpSocket owner. The remaining v0
phase is the dash port (Phase P1.4) over the kernel RAM-backed
File/Directory/Store/Namespace caps from Storage Phase 3 slices 1-3.
See docs/backlog/posix-adapter-dash-port.md for the long-form
decomposition.
C++ should wait until the C substrate exists and the project decides its C++ ABI policy: exceptions, RTTI, TLS, allocation, unwind behavior, and standard library scope. A freestanding container/arena subset is plausible earlier than full hosted C++.
Go
Go is a dedicated future design because its runtime is close to a userspace
operating system. A native GOOS=capos port needs virtual memory reservation
and commitment, TLS setup, OS-thread creation, park/wake, monotonic time,
debug output, process exit, and eventually network polling.
The current kernel/runtime substrate already proves useful pieces:
VirtualMemory, Timer, ThreadControl, ThreadSpawner, ThreadHandle,
and private ParkSpace wait/wake exist at the capOS level. The missing work is
the Go runtime port and the runtime-side integration contract, not a new
ambient syscall namespace.
Go through WASI may be sufficient for CPU-bound tools such as CUE evaluation;
that path is tracked as Phase W.8 in
WASI Host Adapter (TinyGo
or upstream Go GOOS=wasip1 against a future ScriptPackage cap). Native
GOOS=capos remains the path for Go network services and full runtime
behavior.
Python
Python is not currently supported on booted capOS. The practical paths are:
- Native CPython through a POSIX compatibility adapter. This depends on the C/libc substrate plus file, stdio, timer, networking, and process adapters. It is the likely path for trusted system scripts, configuration tooling, and Python programs that need capOS networking or storage.
- MicroPython through the same native C substrate. This is a smaller early scripting option with less runtime surface than CPython.
- WASI or Emscripten-hosted Python. This is useful for sandboxed or compute-oriented Python. It still runs a Python interpreter; WebAssembly is the sandbox/host ABI, not a way to avoid Python runtime work.
Current upstream CPython support is relevant but not sufficient by itself:
PEP 11 lists
wasm32-unknown-wasip1 as a Tier 2 CPython platform and
wasm32-unknown-emscripten as Tier 3, while
PEP 776 records Emscripten support for
Python 3.14. Those targets help the WASM path. They do not provide native capOS
file, socket, thread, or capability bindings.
Lua
Lua is a capability-scoped scripting runner. The target is not a POSIX Lua
shell. A capos-lua process should receive an exact CapSet, load curated
standard libraries, expose capabilities as unforgeable host userdata, deny raw
CapIds, and flush owned handles at script exit.
Phase 0 lives in demos/lua-smoke/ as a hand-written Lua-subset interpreter
written entirely on top of capos-rt. It exists to validate the long-term
capability-aware host API design (typed userdata, obj:method(args) dispatch
through a host registry, no raw SQE or method-id leak into Lua, errors
surfaced as Lua runtime errors) without committing capOS to a particular Lua
dialect. The interpreter accepts a strict subset (local, if/elseif/
else, numeric for, while, integer/float arithmetic, string concat,
comparison, obj:method(args) calls); tables, closures, coroutines,
metatables, and the Lua standard library are not implemented.
Upstream PUC Lua is a small C implementation, so the dialect-compatible path
waits on the C/libcapos substrate. The Phase 0 interpreter is not a promise
of PUC Lua compatibility and the smoke binary is explicitly labelled
runtime = "capos-lua-subset" rather than lua-5.x. When the C/libcapos
port lands, the embedded interpreter is replaced or kept as a research-grade
sandbox; the host binding shape stays.
JavaScript and TypeScript
JavaScript support means running an engine as an ordinary capOS process. A small QuickJS-style runner is the plausible early native path once C support exists. V8 or SpiderMonkey are much larger C++ runtime ports and should be treated as later experiments. TypeScript would normally compile before execution; capOS should not make a TypeScript compiler part of the kernel or base runtime.
WASI and Browser WebAssembly
WASI support is a host-runtime track: the host imports become capability
calls. The full design is in the
WASI Host Adapter proposal, and
the implementation decomposition is in
WASI Host Adapter. The
proposal selects wasmi for the v0 phases (no_std + alloc userspace
runtime, fuel metering, externref support) and frames wasmtime / WAMR as
the W.7+ migration targets. Each WASI import is backed by a typed capOS
capability the host adapter already holds; ungranted authority is refused,
not synthesised. WASI is a good fit for code that is already designed
around explicit imports and sandboxed execution. It is not a replacement
for native runtime ports when the language expects OS threads, signals,
sockets, memory mapping, or a large POSIX surface.
The browser/WebAssembly proposal is separate. It explores running capOS concepts in a browser using worker-per-process isolation and SharedArrayBuffer-backed rings. It is a teaching and demo target, not current native userspace language support.
Proposal Map
- Userspace Runtime documents the current
capos-rtimplementation. - Userspace Binaries owns the native binary, language, POSIX-adapter, and WASI roadmap.
- Go Runtime owns the native Go plan.
- Go VirtualMemory Contract freezes the allocator-facing memory contract needed by Go-style runtimes.
- Lua Scripting owns the Lua runner design.
- WASI Host Adapter owns the WebAssembly host adapter design; the implementation decomposition lives in WASI Host Adapter.
- POSIX Adapter owns the POSIX-compatibility roadmap above the libcapos C-substrate; the implementation decomposition lives in POSIX Adapter.
- Shell distinguishes native shell behavior from POSIX shell compatibility.
- Storage and Naming defines the
Directory,File,Store, andNamespacesurfaces that future POSIX and language runtimes will consume. - Browser/WASM owns the browser-hosted WebAssembly experiment.
- LLVM Target records target-triple, Rust
std, Go, C, TLS, and ABI grounding.
Validation
Current language-runtime validation is Rust-only:
tools/check-userspace-runtime-surface.shverifies thatcapos-rtowns_start, panic handling, allocator setup, raw syscalls, and entry macros.make capos-rt-check,make init-capos-build,make demos-capos-build,make shell-capos-build, andmake capos-rt-capos-buildbuild the booted userspace artifacts against the capOS custom target.make run-smoke,make run-spawn,make run-shell, andmake run-terminalexercise the runtime surface through QEMU.
No page should claim support for Python, Go, Lua, C, C++, JavaScript, WASI, or
Rust std until there is a booted artifact and a validation target for that
runtime.
Trust Boundaries
This page gives reviewers one place to find the hostile-input boundaries, trusted inputs, and current isolation assumptions that matter for capOS security review.
Current Boundaries
Ring 0 to Ring 3
- Trust rule: the kernel trusts no userspace register, pointer, SQE, CapSet, or result-buffer field.
- Implemented: syscall arguments, user-buffer ranges, page permissions,
opcodes, and capability-table lookups are validated before privileged use in
kernel/src/arch/x86_64/syscall.rs,kernel/src/mem/paging.rs,kernel/src/mem/validate.rs, andkernel/src/cap/ring.rs. - Validation: Panic Surface Inventory
and
REVIEW.mdat the repository root.
Capability Table to Kernel Object
- Trust rule: a process acts only through a live table-local
CapIdwith matching generation and interface. - Implemented:
capos-lib/src/cap_table.rsowns generation-tagged slots; kernel capability dispatch goes throughCapObject::call. - Validation:
cargo test-libplus QEMU ring and IPC smokes recorded in done task records.
Capability Ring Shared Memory
- Trust rule: userspace owns SQ writes, but the kernel owns validation, dispatch, completion, and failure semantics.
- Implemented: SQ/CQ headers and entries live in
capos-config/src/ring.rs; kernel dispatch bounds indexes, opcodes, transfer descriptors, and CQ posting, and copies CALL/RECV/RETURN buffers while holding the owning processAddressSpacelock. - Validation:
cargo test-ring-loomplus QEMU ring corruption, reserved opcode, fairness, IPC, and transfer smokes.
Endpoint IPC and Transfer
- Trust rule: IPC cannot create or destroy authority except through explicit copy, move, release, or spawn transactions. Delegating an imported client facet must preserve service-visible identity, and shared-service handlers should derive caller identity from live caller-session metadata.
- Implemented:
kernel/src/cap/endpoint.rs,kernel/src/cap/transfer.rs, andcapos-lib/src/cap_table.rsimplement queued calls,RECV/RETURN, copy/move transfer, legacy receiver metadata propagation, and rollback. Legacy badge metadata is a debug/test surface only. Normal chat, stdio bridge, and shared demo handlers use live caller-session metadata instead of caller-selected badge identity. - Validation: Authority Accounting and any open transfer review-finding task records.
Manifest and Boot Package
- Trust rule: boot manifest bytes and embedded binaries are untrusted until parsed and validated. Only BootPackage holders can request chunked manifest bytes; ordinary services receive no default boot-package authority.
- Implemented:
tools/mkmanifestvalidates the embeddedinitConfiggraph before serialization. Kernel bootstrap validates the kernel-owned manifest boundary before loading one init process; init BootPackage validation resolves service graph references before spawning children.kernel/src/cap/boot_package.rs,capos-lib/src/elf.rs, and load paths still enforce manifest-read, ELF, load-range, CapSet, and interface bounds. - Validation:
cargo test-config,cargo test-mkmanifest,cargo test-lib, manifest and ELF fuzz targets, andmake run.
Process Spawn Inputs
- Trust rule: parent-supplied spawn params, ELF bytes, grants, legacy badges, and result-cap insertion must fail closed. Endpoint kernel grants must not share owner caps with the parent. Delegated client facets must not be relabeled into another service identity.
- Implemented:
ProcessSpawnervalidates ELF load, grants, frame exhaustion, parent cap-slot exhaustion, child-local endpoint creation, and parent-only client result facets. DelegatedClientEndpointgrants preserve source identity; explicit relabel encodings fail closed except for owner or trusted parent endpoint-result caps. - Validation: spawn QEMU smoke evidence and review-finding task records.
Console Authentication and Setup
- Trust rule: console input, account selectors, password verifiers, setup tokens, and passkey challenge state are hostile or sensitive until a login/session component validates them.
- Implemented:
Consoleremains output-only. The first interactive boundary is session-scopedTerminalSessionwith boundedreadLine, visible/hidden echo, structured cancellation, one move-only foreground holder, caller-session-checked output, and stale-input scrubbing on cancel or owner teardown.CredentialStoreverifies one manifest-supplied Argon2id operator credential and one bounded volatile RAM-overlay password created by first-boot setup.capos-shelldrivesloginandsetup; there is no separateConsoleLoginservice. - Validation: Boot to Shell,
boot-to-shell gates in
../../docs/tasks/README.md,make run-terminal,make run-login, andmake run-login-setup. - Open/future: durable multi-account credential storage, multiple verifier records, rotate/disable state, broader anti-enumeration audit policy, and bounded single-use setup-token/challenge state.
Session Authority and Audit
- Trust rule: authenticated sessions receive only broker-issued narrow caps. Audit output, service logs, terminal output, and failed-auth diagnostics must not disclose secrets or verifier material.
- Implemented:
SessionManagermints entropy-backedUserSessionmetadata for operator, explicitly seeded guest, and anonymous profiles. Endpoint caller-session references are HMAC-SHA256 values scoped by an entropy-backed boot key and non-reused endpoint service-scope id.AuthorityBrokervalidates session/profile matches before minting bundles.RestrictedLauncherreturns shell-scoped launch paths instead of BootPackage or broad ProcessSpawner authority. - Validation: User Identity and Policy,
Boot to Shell,
make run-login,make run-login-setup, and future auth/session hostile input tests. - Open/future: audit records, stable service-audit identity across endpoint
replacement, opaque record IDs, mutable session liveness cells,
UserSession.logout, owner-shell/gateway close propagation, narrow renewal paths, broader policy evaluation, and web-terminal origin/RP-ID validation.
SSH Remote Shell Ingress
- Trust rule: SSH network input, keys, usernames, channel requests, PTY state, environment requests, and disconnects are hostile until the gateway validates protocol state, authenticates the user, and receives a broker-issued shell bundle.
- Implemented: current proofs cover schema stubs, one development-only
host-key fixture, manifest-seeded
AuthorizedKeyStore, public-key session minting, unsupported-feature policy, restricted shell launch, and bounded terminal-host wiring over host-local plain TCP. The proposal keeps SSH transport authority inSshGateway; the spawned shell receives only anSshTerminalFactory-producedTerminalSessionplus the normal scoped session bundle throughRestrictedShellLauncher. Fixture host-key signing, authorized-key mapping, public-key session minting, SSH policy rejection, and restricted launcher inputs fail closed for malformed or unsupported cases. - Validation: SSH Shell Gateway,
Runtime, Networking, and Shell,
make run-ssh-host-key,make run-ssh-authorized-key,make run-ssh-public-key-session,make run-ssh-public-key-auth,make run-ssh-feature-policy, andmake run-restricted-shell-launcher(the plain-TCPrun-ssh-gateway-terminal-hostterminal-host proof is retired with the kernel socket owner). - Open/future: full SSH transport transcript and channel binding, password-auth verifier/backoff wiring, production host-key storage, broader account storage, and production remote-shell hardening.
Identity Metadata and Account Records
- Trust rule: users, principals, accounts, sessions, roles, and profile names are policy metadata, not kernel subjects, ambient authority, or substitutes for held capabilities.
- Implemented: sessions receive capabilities only after authentication and broker policy evaluation; a principal or account record does not run or call the kernel.
- Validation: Local Users, Storage, and Policy and User Identity and Policy.
- Open/future: durable local account-store behavior, profile persistence, rollback checks, and quota enforcement.
Host Tools and Filesystem
- Trust rule: manifest/config input must not escape intended source directories or invoke unconstrained host commands.
- Implemented:
tools/mkmanifestvalidates references and path containment, rejects unpinned CUE compilers, and Makefile targets route CUE and Cap’n Proto through pinned tool paths. - Validation: Trusted Build Inputs,
make generated-code-check, andmake dependency-policy-check.
Generated Code and Schema
- Trust rule: schema, generated bindings, and no_std patches are trusted build inputs.
- Implemented:
schema/capos.capnp, build scripts,tools/generated/capos_capnp.rs, andtools/check-generated-capnp.shmake generated-code drift review-visible. - Validation: Trusted Build Inputs
and
make generated-code-check.
Device DMA and MMIO
- Trust rule: current userspace receives no raw DMA buffer, device physical address, virtqueue pointer, BAR mapping, or device interrupt handle.
- Implemented: the QEMU virtio-net path is allowed only through kernel-owned bounce buffers and kernel-owned MSI-X source records routed through the kernel device interrupt dispatch table, bounded device MSI vector pool, and kernel-owned route lifecycle checks.
- Validation: DMA Isolation
and
make run-net. - Open/future: typed
DMAPool,DeviceMmio, andInterruptcapabilities for userspace-driver transition.
Panic and Emergency Paths
- Trust rule: hostile input should produce controlled errors, not panic, allocate unexpectedly, or expose stale state.
- Implemented: ring dispatch is mostly controlled-error; remaining panic surfaces are classified by reachability and tracked as hardening work.
- Validation: Panic Surface Inventory
and
REVIEW.md.
Security Invariants
- All authority is represented by capability-table hold edges; no syscall or host tool path should bypass the capability graph.
- Identity metadata is not authority. Principals identify audit and policy subjects, accounts store durable policy inputs, profiles select bundle and quota templates, and sessions receive caps only through explicit broker minting.
- The interface is the permission: method authority is expressed by the typed Cap’n Proto interface or by a narrower wrapper capability, not by ambient process identity.
- Kernel operations at hostile boundaries validate structure, bounds, ownership, generation, interface ID, and resource availability before mutating privileged state.
- Failed transfer, spawn, manifest, and DMA setup paths must leave ledgers, cap tables, frame ownership, and in-flight call state unchanged or explicitly rolled back.
- Trusted build inputs must be pinned or drift-review-visible before their output becomes part of the boot image or generated source baseline.
- Authentication/session code must treat credential records, setup tokens, passkey challenges, session IDs, and audit logs as security boundaries, not ordinary console text.
Open Work
- Unify fragmented resource ledgers into the authority-accounting model so reviewers can audit quotas without following parallel counters.
- Harden open panic-surface entries that become more exposed as spawn, lifecycle, SMP, or userspace drivers expand hostile input reachability.
- Keep DMA in kernel-owned bounce-buffer mode until the
DMAPool,DeviceMmio, andInterrupttransition gates have code and QEMU proof. - Do not expand production authentication or remote-shell surfaces without hostile-input tests for bounded terminal input, credential failure paths, challenge expiry/replay, audit redaction, and narrow shell cap bundles.
Verification Workflow
This page maps capOS claims to the commands, QEMU smokes, fuzz targets, proof tools, and review documents that currently support them.
Local Command Set
Use the repo aliases and Makefile targets instead of bare host commands. The
workspace default Cargo target is x86_64-unknown-none, so host tests rely on
aliases that set the host target explicitly.
| Scope | Command | What it checks |
|---|---|---|
| Formatting | make fmt-check | Rust formatting across kernel, shared crates, standalone userspace crates, and demos. |
| Config and manifest logic | cargo test-config | Cap’n Proto manifest encode/decode, CUE value handling, CapSet layout, and config validation. |
| Ring concurrency model | cargo test-ring-loom | Bounded SQ/CQ producer-consumer invariants and corrupted-SQ recovery behavior. |
| Deferred-completion concurrency model | make model-dma-deferred-completion-loom | Bounded Loom over the kernel DeferredCompletionQueue reservation budget and the multi-CPU TLB shootdown generation re-read (kernel/src/arch/x86_64/tlb.rs); budget never exceeded, no completion dropped/double-popped, no retire ahead of a covering flush. |
| DMA authority lifecycle model | make model-dma-tla | Pinned TLC bounded check of models/dma/dma_authority.tla: allocate->map->publish->complete->revoke->scrub->reuse ordering plus generation-keyed stale completion, record-before-PTE-install split, drive-pin/quarantine, and queue-enable epoch-fence interleavings. Fails closed on any invariant violation, deadlock, or analyzer error. |
| DMA assurance aggregate | make dma-assurance-model-check | Local aggregate over the DMA Alloy, TLA+, Loom, and Kani gates. Requires installed cargo-kani; GitHub CI splits the same evidence across the DMA Assurance Models and Kani Proofs jobs. |
| Shared library logic | cargo test-lib | ELF parser, frame bitmap, frame ledger, capability table, and property-test coverage. |
| Manifest tool | cargo test-mkmanifest | Host-side manifest conversion and validation behavior. |
| Userspace runtime | tools/check-userspace-runtime-surface.sh; make capos-rt-check; make init-capos-build demos-capos-build shell-capos-build | Runtime primitive ownership, custom-target boot build path, entry ABI, typed clients, ring helpers, and no_std constraints. |
| Kernel build | cargo build --features qemu | Kernel build with the QEMU exit feature enabled. |
| Generated code | make generated-code-check | Cap’n Proto compiler path/version, schema binding output equality, no_std patch anchors, Adventure and Paperclips content freshness, locked generator dependencies, and checked-in generated-output drift. |
| Dependency policy | make dependency-policy-check | cargo-deny and cargo-audit policy across root and standalone Cargo lockfiles, plus npm lockfile validation and audit checks for the docs toolchain. |
| Mandatory Kani gate | make kani-lib | Bounded capos-lib harness set for frame allocation, stale-handle rejection, frame-grant and cap-slot fail-closed accounting, and transfer-origin fail-closed behavior. |
| DMA-authority core Kani gate | make kani-dma-authority | Bounded Kani over the extracted pure DMA-authority core (capos_lib::dma_authority): a recycled slot’s generation strictly increases and never aliases a live handle, a stale-generation completion is rejected without mutating completion/free/reuse state, and a buffer cannot be re-exposed until its in-flight completion is observed. Faithful model of the kernel/src/device_dma.rs authority arithmetic; the kernel call-through is a tracked follow-up. |
| Full image build | make | Kernel, userspace demos, runtime smoke binaries, manifest, Limine artifacts, and ISO packaging. |
| Default interactive boot | make run | Operator-facing default init-owned boot path from layered system.cue: standalone init starts the foreground shell, resident demo services, and the remote-session CapSet gateway, forwards only the remote CapSet endpoint on loopback, and keeps console/debug output logged separately. |
| Default QEMU smoke | make run-smoke | Scripted focused shell-led boot from system-smoke.cue: kernel boot-launches capos-shell as init, grants the shell bootstrap cap bundle, then proves anonymous-session bootstrap, login failed-auth redaction, successful password auth, broker upgrade to operator bundle, terminal isolation, and clean halt. |
| Focused spawn QEMU smoke | make run-spawn | Narrower init-owned ProcessSpawner graph: kernel boot-launches only standalone init with Console, BootPackage, and ProcessSpawner; init validates BootPackage metadata, spawns endpoint/IPC/VirtualMemory/Timer/FrameAllocator children, waits for them, exercises hostile spawn checks, and halts cleanly. |
| Shell, terminal, and local-auth smokes | make run-shell; make run-terminal; make run-credential; make run-login; make run-login-setup | Anonymous shell behavior, TerminalSession line input and cancellation, CredentialStore verifier behavior, username-aware password login, broker-issued operator bundle upgrade, volatile first-boot setup credential creation, terminal isolation, and stale-handle release. |
| Focused service smokes | make run-chat; make run-adventure; make run-paperclips; make run-revocable-read; make run-memoryobject-shared; make run-ringtap-failing-call | Resident-service demos, the clean-room Paperclips terminal demo, revocation behavior, MemoryObject sharing, and debug-tap viewer behavior. |
| Networking smoke | make run-net; make qemu-net-harness | QEMU virtio-net attachment, kernel PCI/device-discovery path, descriptor-accounting guard evidence, ARP, and ICMP. TCP/UDP socket proof lives under the Phase C userspace network-stack gates. |
| SSH gateway proof smokes | make run-ssh-host-key; make run-ssh-authorized-key; make run-ssh-public-key-session; make run-ssh-public-key-auth; make run-ssh-feature-policy; make run-restricted-shell-launcher | Development host-key fixture validation, authorized-key mapping, public-key session minting, public-key authentication failure privacy, unsupported SSH feature policy, and restricted shell launch authority. The bounded host-local socket-to-TerminalSession wiring proof is retired with the kernel socket owner. |
Do not claim full verification unless the relevant command actually ran in the
current change. For doc-only changes, use an appropriately narrower check such
as mdbook build.
Review Workflow
- Identify the changed trust boundary or state that the change is docs-only.
- Read
REVIEW.md(at the repository root) for the applicable security, unsafe, memory, performance, capability, and emergency-path checklist. - Read the relevant review-finding task records under
docs/tasks/before judging correctness so known open findings are not treated as solved behavior. - For system-design work, list the concrete design and research files read; reviewers should reject vague grounding such as “docs” or “research”.
- Run the smallest command set that exercises the changed behavior, then add QEMU proof for user-visible kernel or runtime behavior.
- Record unresolved non-critical findings as task records under
docs/tasks/ordocs/tasks/on-hold/with concrete remediation context before treating the task as reviewed.
Evidence by Claim
| Claim type | Required evidence |
|---|---|
| Parser or manifest validation | Host tests for valid and malformed input; fuzz target when arbitrary bytes can reach the parser. |
| Kernel/user pointer safety | QEMU hostile-pointer smoke plus code review of address, length, permissions, and validation-to-use windows. |
| Ring or IPC transport behavior | Host model/property tests where possible, plus QEMU process output proving success and failure paths. |
| Userspace runtime primitive ownership | tools/check-userspace-runtime-surface.sh plus review of capos-rt/src/entry.rs, alloc.rs, panic.rs, and syscall.rs. |
| Capability transfer or release | Rollback tests for copy/move/release failure, cap-slot exhaustion, stale caps, and process-exit cleanup. A release-only proof shows local cleanup only; any claim that peers, children, sessions, or delegated holders lose authority needs a separate explicit revoke, session-expiry, object-epoch, or service-specific invalidation proof. |
| Resource accounting | Tests that prove quota rejection, matched release on success and failure, and process-exit cleanup. |
| Generated code, schema, or generated content changes | make generated-code-check and a checked-in baseline diff generated by the pinned compiler or pinned CUE/generator path. |
| Dependency or toolchain changes | Dependency-class review plus make dependency-policy-check; update Trusted Build Inputs when trust assumptions change. |
| Device or DMA work | make run-net or a targeted QEMU smoke; no userspace-driver transition without the gates in DMA Isolation. |
| Panic-surface hardening | Updated Panic Surface Inventory when reachability or classification changes. |
| Authentication and session work | Host tests for TerminalSession line-input bounds, secret-mode echo suppression, cancellation behavior, exclusive terminal handoff, non-inheritance without an explicit grant, verifier encoding, entropy-unavailable fail-closed behavior, bootstrap-plus-RAM-overlay credential handling, volatile credential/disable-state disclosure, bounded single-use setup-token/challenge first-consume/expiry/replay semantics, generic failure/backoff policy, and audit redaction with opaque record IDs plus pre-auth serial-safe failure events; QEMU proof for setup/login, failed auth, successful capos-shell launch through TerminalSession/CredentialStore/SessionManager/AuthorityBroker, lack of terminal access for an ungranted child, absence of broad BootPackage/raw ProcessSpawner caps in the shell, and fail-closed behavior when the secure-randomness path is unavailable. |
Fuzzing and Proof Tracks
The current fuzz corpus lives under fuzz/ and covers manifest Cap’n Proto
input, exported JSON conversion for mkmanifest, arbitrary ELF parser input,
Telnet IAC filtering, terminal line discipline, and ring SQE wire validation.
Run fuzzers when a change alters those parsers, schema shape, terminal/network
byte-stream handling, SQE validation, or related validation rules.
Kani coverage is intentionally narrow and lives in capos-lib, where pure
logic can be bounded without hardware state. Add or refresh Kani harnesses for
ledger, cap-table, bitmap, and parser invariants when those invariants become
part of a security claim. The required local/CI gate is make kani-lib. The
extracted DMA-authority core (capos_lib::dma_authority) has its own bounded
gate, make kani-dma-authority, which proves ownership-generation bump on
recycle, stale-handle rejection without mutation, and no-re-expose before
completion — a faithful model of the kernel/src/device_dma.rs arithmetic whose
kernel call-through is a tracked follow-up.
Loom coverage belongs in shared ring logic. Extend cargo test-ring-loom when
SQ/CQ ownership, ordering, corruption recovery, or wake semantics change.
DMA assurance model files live under models/dma/ and are bounded checked
evidence for device and cloud-backend claims. The Alloy relational authority graph
(models/dma/dma_authority.als) is now an analyzer-checked gate:
make model-dma-alloy runs the pinned Alloy Analyzer 6.2.0 headless at scope
for 4 and fails on any counterexample (free-page reachability, same-domain
IOVA uniqueness, and the ownership-generation stale-handle gate), with the
checked verdict table recorded in models/dma/README.md. The focused Loom gate
for the DeferredCompletionQueue reservation budget and the multi-CPU
generation re-read is also checked (make model-dma-deferred-completion-loom,
pinned loom 0.7.2). The TLA+ lifecycle model (models/dma/dma_authority.tla) is
now a model-checked gate as well: make model-dma-tla runs the pinned TLC
2.19 (tla2tools 1.7.4) over the bounded configuration (2 devices / 2 domains /
2 pages / 2 iovas, generations 0..1) and fails closed on any invariant
violation, deadlock, or analyzer error, with the checked result recorded in
models/dma/README.md. It covers the lifecycle ordering plus the landed
generation-keyed stale completion, record-before-PTE-install split, drive-pin/
quarantine, and queue-enable epoch-fence interleavings. The extracted pure
DMA-authority core is checked by Kani as well (make kani-dma-authority, pinned
kani-verifier 0.67.0): ownership-generation bump on recycle, stale-handle
rejection without mutation, and no-re-expose before completion over
capos_lib::dma_authority.
The DMA checked gates are wired into CI. The GitHub dma-assurance-models job
runs make model-dma-alloy, make model-dma-tla, and
make model-dma-deferred-completion-loom; the kani-proofs job runs
make kani-dma-authority after the mandatory make kani-lib gate. The local
make dma-assurance-model-check aggregate runs all four when cargo-kani is
installed. Do not claim Verus evidence – or any Alloy/TLC/Loom/Kani
DMA-authority result beyond what these targets actually check – unless the
exact command, checker version, configuration, model bounds, and output are
recorded in the task closeout.
For DMA work, map claims through
DMA Assurance Model:
TLA+ for lifecycle ordering and races, Alloy for authority topology, Kani for
pure Rust validators/accounting, and Loom for atomic or queue interleavings such
as DeferredCompletionQueue. The model supplements the required QEMU or cloud
evidence; it does not replace hardware-facing smokes.
Documentation Sources
REVIEW.md(at the repository root): rules for security, unsafe code, capability invariants, resource accounting, and emergency paths.docs/tasks/: open remediation backlog, review-finding task records, and latest verification task records.- Trusted Build Inputs: trusted compiler, generated-code, dependency, bootloader, manifest, and host-tool inputs.
- Panic Surface Inventory: classified panic-like surfaces and commands used to generate the inventory.
- Authority Accounting: authority graph, quota, transfer, rollback, and ProcessSpawner accounting invariants.
Security Verification Track Registry
The S.x labels used across this manual are registry identifiers for the
Security Verification Track. They are not product stages. When a section
mentions one of these labels, read it as shorthand for the track name below.
S.1— CI bootstrap. Status: Landed.S.2— Miri and proptest on capos-lib. Status: Landed.S.3— Manifest and mkmanifest fuzzing. Status: Landed.S.4— Ring Loom harness. Status: Landed.S.5— Kani on capos-lib. Status: Initial bounded gate landed.S.6— Security review docs stay aligned. Status: Ongoing.S.7— Stage-6-aware security refresh. Status: Planned/ongoing.S.8— Untrusted-service hardening gate. Status: Planned.S.9— Authority graph and resource accounting. Status: Design landed.S.10— Supply-chain and generated-code trusted computing base. Status: Partially landed.S.11— Device and DMA isolation gate. Status: Design accepted; implementation gates open.S.12— Kani harness bounds refresh. Status: Planned.S.13— ELF parser arbitrary-input coverage. Status: Landed.S.14— Telnet IAC filter fuzz coverage. Status: Landed.S.15— Telnet differential round-trip and line-discipline extraction. Status: Landed.S.16— Ring SQE wire-validation extraction and fuzz target. Status: Landed.S.17— Sanitizers on host tests. Status: Planned.
Subtracks Used In This Manual
S.10.0underS.10— Trusted build input inventory.S.10.2underS.10— Generated-code drift check.S.10.3underS.10— Dependency policy and no_std review gate.S.11.1underS.11— DMA capability invariants.S.11.2underS.11— Userspace-driver ownership-transition gate.
The S.11.2.0 through S.11.2.9 labels in the DMA chapter are local checklist rows for the userspace-driver transition gate. They are acceptance criteria under S.11.2, not separate project tracks.
Trusted Build Inputs
This inventory covers the build inputs currently trusted by the capOS boot
image, generated bindings, host tooling, and verification paths. It started as
the Security Verification Track S.10.0 inventory, records the Security
Verification Track S.10.2 generated-code drift check, and now also records the
Security Verification Track S.10.3 dependency policy plus the shared no_std
generated-code patch helper. The consolidated long-horizon supply-chain risk
view – floating Rust nightly, repo-pinned qemu-system-x86_64 /
xorriso digests (CI now apt-installs qemu-system-x86=1:8.2.2+ds-0ubuntu1.16,
xorriso=1:1.5.6-1.1ubuntu3, and ovmf=2024.02-2ubuntu0.8 so package identity
is captured; the OVMF firmware blob is now repo-pinned by SHA-256
(OVMF_CODE_SHA256, landed at commit f1c8c8fb, merged at ca5a1fea) and
the ovmf-verify Makefile gate fails the build on drift, but
download-and-verify of the qemu-system-x86_64 / xorriso tool blobs
remains a future step), PR-blocking CI environment provenance comparison, and
the remaining immutable-runner-image / repo-managed tool-digest gap – is
tracked as R13 in
docs/design-risks-register.md; the gap text below stays consistent with that
entry.
Summary
| Input | Current source | Pinning status | Drift-review status |
|---|---|---|---|
| Limine bootloader binaries | Makefile:5-10, Makefile:34-49 | Git commit and selected binary SHA-256 values are pinned. | make limine-verify fails if the checked-out commit or copied bootloader artifacts drift. |
| Rust toolchain | rust-toolchain.toml:1-4, .github/workflows/ci.yml | Date-pinned nightly-2026-04-20 channel with target triples and the rust-src component required by custom-target -Zbuild-std userspace builds. The CI host-baseline, dma-assurance-models, and qemu-smoke jobs explicitly request the same dated nightly. The Kani job remains pinned separately to nightly-2025-11-21 paired with the Kani-compatible bundle installed by cargo kani setup. | The dated channel resolves to rustc 1.97.0-nightly (e22c616e4 2026-04-19) (the 2026-04-20 manifest carries the previous day’s rustc commit). Bumps are review-visible as rust-toolchain.toml and workflow diffs; the advance procedure is recorded in the Rust Toolchain section below. |
| Workspace cargo dependencies | Cargo.toml, crate Cargo.toml files, Cargo.lock | Lockfile pins exact crate versions and checksums for the root workspace. Manifest requirements remain semver ranges. | make dependency-policy-check runs cargo deny check plus cargo audit against the root workspace and lockfile in CI. |
Standalone cargo dependencies (covered by make dependency-policy-check) | init/Cargo.lock, demos/Cargo.lock, demos/wasi-hello-rust/Cargo.lock, demos/wasi-cli-args/Cargo.lock, demos/wasi-env/Cargo.lock, demos/wasi-fs/Cargo.lock, demos/wasi-random/Cargo.lock, demos/wasi-preview1-refusals/Cargo.lock, demos/wasi-stdio-fd/Cargo.lock, tools/adventure-content-gen/Cargo.lock, tools/paperclips-content-gen/Cargo.lock, tools/mkmanifest/Cargo.lock, tools/remote-session-client/Cargo.lock, tools/ringtap-viewer/Cargo.lock, capos-rt/Cargo.lock, capos-service/Cargo.lock, shell/Cargo.lock, libcapos/Cargo.lock, libcapos-posix/Cargo.lock, capos-wasm/Cargo.lock, fuzz/Cargo.lock | Each standalone workspace has its own lockfile. The Makefile DEPENDENCY_POLICY_MANIFESTS / DEPENDENCY_POLICY_LOCKFILES lists drive the gate. | make dependency-policy-check runs the shared deny/audit baseline against every standalone manifest and lockfile listed above (root workspace Cargo.lock plus the 21 standalone lockfiles in this row). Cross-workspace version drift remains review-visible and intentional where lockfiles differ. |
| Standalone cargo dependencies (not yet under policy gates) | tools/remote-session-client/src-tauri/Cargo.lock, vendor/wasmi-no_std/wasmi-1.0.9/Cargo.lock | Two checked-in lockfiles fall outside DEPENDENCY_POLICY_LOCKFILES. tools/remote-session-client/src-tauri/Cargo.lock is the Tauri scaffold lockfile; make remote-session-tauri only exposes deterministic policy and check modes and reviewed dev mode – distributable package and desktop automation modes are blocked. vendor/wasmi-no_std/wasmi-1.0.9/Cargo.lock is part of the vendored upstream snapshot covered separately by the wasmi =1.0.9 path-dependency pin in capos-wasm/Cargo.toml. | Both lockfiles are review-visible through ordinary diffs but are not run through cargo deny check / cargo audit today. Promoting either into DEPENDENCY_POLICY_LOCKFILES is gated on the matching authority decision (Tauri scaffold scope decision; wasmi refresh procedure in vendor/wasmi-no_std/VENDORED_FROM.md). |
| Cap’n Proto compiler | Makefile:12-80, tools/capnp-build/src/lib.rs, capos-config/build.rs, tools/check-generated-capnp.sh, tools/mkmanifest/src/lib.rs, tools/mkmanifest/src/main.rs | Official capnproto-c++-1.2.0.tar.gz source tarball URL, version, and SHA-256 are pinned in Makefile; make capnp-ensure builds $(CAPOS_TOOLS_ROOT)/capnp/1.2.0/bin/capnp under the per-user tool cache so linked worktrees reuse it. The build rule patches the distributed CLI version placeholder to the pinned version before compiling. | The shared build helper defaults to the pinned path and rejects CAPOS_CAPNP when it points elsewhere. Make targets export the pinned path and CI persists it through $GITHUB_ENV. make generated-code-check verifies both the exact compiler path and Cap'n Proto version 1.2.0 before regenerating bindings through Cargo. mkmanifest cue-to-capnp also rejects missing or non-canonical CAPOS_CAPNP, checks Cap'n Proto version 1.2.0, and delegates schema-aware JSON-to-binary conversion to that pinned compiler. |
| Cap’n Proto Rust runtime/codegen crates | capos-config/Cargo.toml, kernel/Cargo.toml, tools/capnp-build/Cargo.toml, Cargo.lock | Cargo manifests use exact capnp = "=0.25.4" and capnpc = "=0.25.3" requirements where declared; lockfiles pin exact crate versions and checksums. | Security Verification Track S.10.3 now requires dependency-class and no_std review before these changes are accepted. |
| Kani verifier toolchain | .github/workflows/ci.yml, Makefile, tools/run-kani-proofs.sh, tools/cloudbuild-kani.yaml, .gcloudignore | GitHub CI pins kani-verifier 0.67.0; cargo kani setup installs the matching Kani bundle plus nightly-2025-11-21-x86_64-unknown-linux-gnu into the user-local Kani/rustup paths. Local make kani-lib and make kani-dma-authority expect a compatible cargo-kani install. The high-memory make kani-lib-full path uses Google Cloud Build image digest rust@sha256:adab7941580c74513aa3347f2d2a1f975498280743d29ec62978ba12e3540d3a on E2_HIGHCPU_32, installs rustup from https://sh.rustup.rs, sources /usr/local/cargo/env, initializes minimal git metadata for build tooling that expects a repository, then pins nightly-2025-11-21 plus cargo-kani 0.67.0. | The CI kani-proofs job installs kani-verifier 0.67.0, runs cargo kani setup, and executes the bounded make kani-lib harness list plus the DMA-authority make kani-dma-authority harness group. The Cloud Build config installs the same Kani version and runs make kani-lib-full; it depends on explicit source staging and logs in maintainer-private GCS buckets configured in tools/cloudbuild-kani.yaml, .gcloudignore secret exclusions, and account/project IAM for Cloud Build submission and the selected runtime service account. Version, image, worker, bucket, IAM, rustup bootstrap, synthetic git metadata, or setup-path changes are review-visible in the workflow, Cloud Build config, runner script, and this inventory. |
| Alloy Analyzer (DMA assurance model checker) | Makefile ALLOY_VERSION/ALLOY_TARBALL_URL/ALLOY_TARBALL_SHA256, tools/run-dma-alloy-model.sh, models/dma/dma_authority.als | Self-contained linux/amd64 Alloy Analyzer 6.2.0 app image (bundled Temurin JRE + native SAT solvers) pinned by SHA-256; make alloy-ensure downloads and verifies it into $(CAPOS_TOOLS_ROOT)/alloy/6.2.0/ (the jar is not vendored). This slice owns the Alloy pin shared with the scheduler lease model track. | make model-dma-alloy verifies the tarball SHA-256, checks the launcher reports version 6.2.0, and runs the relational authority-graph checks/witnesses headless at scope for 4, failing on any counterexample or analyzer error. GitHub CI runs it in the dma-assurance-models job. |
| TLC model checker (DMA assurance lifecycle model) | Makefile TLA_TOOLS_VERSION/TLA_TOOLS_JAR_URL/TLA_TOOLS_JAR_SHA256/TLA_JRE_URL/TLA_JRE_SHA256, tools/run-dma-tla-model.sh, models/dma/dma_authority.tla | tla2tools.jar 1.7.4 (TLC 2.19) pinned by SHA-256 plus a SHA-256-pinned Temurin JRE 17.0.19+10 (the bare jar needs a JVM, unlike the self-contained Alloy app image); make tla-ensure downloads and verifies both into $(CAPOS_TOOLS_ROOT)/tla/ (neither is vendored). This slice owns the TLC pin shared by the scheduler/IRQ TLA+ model tracks. | make model-dma-tla re-verifies the jar SHA-256 and the pinned JRE version, then runs TLC over the bounded .cfg (2 devices / 2 domains / 2 pages / 2 iovas, generations 0..1), failing closed on any invariant violation, deadlock, or analyzer error (exit code and the “No error” marker are both asserted). GitHub CI runs it in the dma-assurance-models job. |
| Generated capnp bindings | capos-config/src/lib.rs:10-12, tools/generated/capos_capnp.rs, tools/check-generated-capnp.sh | Generated into Cargo OUT_DIR; the expected patched output is checked in under tools/generated/. | make generated-code-check regenerates the canonical capos-config output and fails if that output differs from the checked-in baseline or if kernel-generated output reappears. |
| no_std patching of generated bindings | tools/capnp-build/src/lib.rs, capos-config/build.rs, tools/check-generated-capnp.sh | One shared build-support crate asserts the patch anchor and injects the no_std imports after generation. capos-config/build.rs calls that helper as the single schema binding owner. | make generated-code-check verifies the patched output contains the expected no_std imports and matches the checked-in baseline. |
| Generated adventure content | demos/adventure-content/content/prototype.cue, tools/adventure-content-gen/, demos/adventure-content/src/generated.rs, tools/check-generated-adventure-content.sh | Prototype mission content is authored in checked-in CUE and generated by a standalone locked Cargo host tool into a checked-in no_std Rust content blob. The checker requires the pinned CUE path under $(CAPOS_TOOLS_ROOT) and cue version v0.16.0. | make generated-code-check runs generated-adventure-content-check, which exports the CUE source as JSON, runs tools/adventure-content-gen with cargo run --locked, formats the generated output, and fails on drift from the checked-in baseline. |
| Generated Paperclips content | demos/paperclips-content/content/paperclips.cue, schema/paperclips-content.capnp, tools/paperclips-content-gen/, demos/paperclips-content/src/generated.rs, tools/check-generated-paperclips-content.sh | Paperclips game content is authored in checked-in CUE, schema-validated through the typed PaperclipsContent Cap’n Proto root, and generated by a standalone locked Cargo host tool into checked-in typed Cap’n Proto bytes embedded by a no_std Rust wrapper. The checker requires the pinned CUE path under $(CAPOS_TOOLS_ROOT), cue version v0.16.0, and the pinned Cap’n Proto compiler path/version used for schema-aware conversion. | make generated-code-check runs generated-paperclips-content-check, which exports the CUE source as JSON, converts it through mkmanifest cue-to-capnp against schema/paperclips-content.capnp, runs tools/paperclips-content-gen with cargo run --locked, formats the generated output, and fails on drift from the checked-in generated content. |
| Userspace custom target | targets/x86_64-unknown-capos.json, .cargo/config.toml, Makefile, system*.cue | Source-controlled target specification plus Cargo aliases, Makefile build wrappers, and manifest paths for booted init, demos, shell, and capos-rt runtime builds. The target JSON uses Rust nightly custom-target support and builds core,alloc from rust-src. | make init-capos-build demos-capos-build shell-capos-build capos-rt-capos-build verifies the userspace crates against target_os = "capos"; QEMU smokes embed target/x86_64-unknown-capos/release userspace artifacts. |
| Userspace runtime surface check | tools/check-userspace-runtime-surface.sh | Source-controlled script that treats capos-rt as the only owner of _start, panic, allocator, raw syscall, and entry-point macro definitions. | Run directly when runtime or userspace entry code changes; it is not a QEMU transcript assertion and does not live inline in Makefile. |
| Linker script build scripts | kernel/build.rs, init/build.rs, demos/*/build.rs, capos-rt/build.rs, capos-wasm/build.rs | Source-controlled scripts and linker scripts. capos-rt/build.rs emits the runtime linker script for both the legacy target_os = "none" userspace build path and the booted custom target_os = "capos" path. capos-wasm/build.rs mirrors the same pattern for the wasm-host bin (Phase W.2 onward) and uses cargo:rustc-link-arg-bins so the linker script applies only to the bin and not the lib. | Build rerun boundaries are explicit; generated link args are not independently audited. |
| CUE manifest compiler | Makefile CUE_TARBALL_URL/CUE_TARBALL_SHA256, tools/mkmanifest/src/main.rs, tools/mkmanifest/src/lib.rs, .github/workflows/ci.yml | make cue-ensure downloads the official cue_v0.16.0_linux_amd64.tar.gz release binary, verifies its SHA-256, extracts cue into $(CAPOS_TOOLS_ROOT)/cue/0.16.0/bin/cue, and checks the reported version – the same download-and-verify pattern used for Typst and uv. CAPOS_TOOLS_ROOT defaults to $HOME/.capos-tools (per-user shared cache); operators may override it explicitly. This replaces the prior go install cuelang.org/go/cmd/cue, which compiled from source under a floating Go toolchain rather than verifying a pinned binary by hash. | Make exports CAPOS_CUE and CAPOS_TOOLS_ROOT to tools/mkmanifest, and CI records that exact path through $GITHUB_ENV before both the host-baseline cargo test-mkmanifest gate and QEMU smoke. mkmanifest::expected_cue_path derives the same per-user path, rejects missing or non-canonical CAPOS_CUE, and checks cue version v0.16.0 before export. The same path and version checks now gate both boot-manifest compilation and mkmanifest cue-to-capnp data-message conversion. |
| Default boot manifest defaults package | cue/defaults/defaults.cue, cue.mod/module.cue, system.cue, tools/mkmanifest/src/lib.rs | cue/defaults/defaults.cue declares package defaults and exports #DefaultSystem, the shared scaffold for the default boot manifest. cue.mod/module.cue pins module: "capos.local" with language v0.16.0. system.cue imports the defaults via capos.local/cue/defaults, declares package capos, and mkmanifest --package capos system.cue manifest.bin exports the unified package. | make invokes mkmanifest with --package capos only when MANIFEST_SOURCE is system.cue; focused-proof system-*.cue manifests stay in single-file mode. The defaults package is a manifest-rule prerequisite, so edits trigger rebuilds. |
| Operator overlay surface | system.local.cue.example, system.local.cue (gitignored), .gitignore | The repo-root overlay file is system.local.cue (package capos); system.local.cue.example is the committed worked-example template. CUE’s package mode unifies it with system.cue automatically. | Operators copy the example, edit, and rebuild — system.local.cue is a wildcard-resolved manifest-rule prerequisite. The overlay is gitignored explicitly to avoid accidental commits of host-specific keys or principals. |
| Host-user manifest tag | Makefile, system.cue _user @tag(user) / _displayName @tag(displayName), tools/mkmanifest/src/lib.rs cue_export_args / cue_tags_from_env_values, target/.cue-tags.<manifest> | make run sets CAPOS_CUE_USER=$(USER). mkmanifest reads that structured account variable, derives displayName from the same account’s first GECOS/comment field in /etc/passwd when CAPOS_CUE_DISPLAY_NAME is unset, and falls back to the account name when the passwd comment is unavailable. It also reads generic CAPOS_CUE_TAGS (and --tag key=value CLI repeats) and forwards each entry to cue export --inject; structured CAPOS_CUE_USER / CAPOS_CUE_DISPLAY_NAME override duplicate generic keys. The target/.cue-tags.<manifest-bin> sentinel records the active tag state via a FORCE-prereq rule that touches the file only when content differs, so a tag change invalidates the cached manifest.bin; the recipe reads exported environment values at shell runtime rather than splicing tag text into shell syntax. | The injected user value reaches the manifest via system.cue’s _user: string | *"operator" @tag(user) and surfaces as the default local operator seed account name; displayName reaches the seed account display name. Untagged system.cue keeps the operator account-name/display-name defaults, while focused demo and smoke manifests pin their own demo fixtures. |
| mdBook documentation tools | Makefile, book.toml | GitHub release assets for mdBook v0.5.0 and mdbook-mermaid v0.17.0 are pinned by version and SHA-256 under $(CAPOS_TOOLS_ROOT), which defaults to $HOME/.capos-tools. mdbook-mermaid supplies the pinned mermaid.min.js browser bundle used by both mdBook HTML rendering and docs-PDF Mermaid rasterization. | make docs and make cloudflare-pages-build verify the tarball checksums and executable versions, refresh the Mermaid assets, and build target/docs-site. |
| Typst typesetter (paper and docs PDF builds) | Makefile TYPST_VERSION, papers/schema-as-abi/main.typ, docs/manual.typ | GitHub release asset for Typst v0.14.2 is pinned by version and SHA-256 under $(CAPOS_TOOLS_ROOT)/typst/0.14.2, mirroring the mdBook pinning pattern. typst-ensure verifies the tarball checksum and the binary’s reported version before paper and docs-PDF targets invoke it. Bundled New Computer Modern font keeps builds reproducible across hosts. | make paper rebuilds target/papers/schema-as-abi/main.pdf using the pinned Typst binary; make cloudflare-pages-build additionally publishes the PDF as target/docs-site/papers/schema-as-abi.pdf. make docs also uses Typst to compile the generated system manual PDF from docs/manual.typ plus per-page converted Markdown body content. Generated PDFs are not checked in; source main.typ, references.bib, docs/manual.typ, and documentation inputs are checked in. |
| Documentation PDF converter | Makefile UV_VERSION / MD2TYPST_VERSION, .node-version, package.json, package-lock.json, tools/md2typst-constraints.txt, tools/docs-bundle.js, tools/build-typst-manual.js, tools/mermaid-puppeteer-config.json, docs/manual.typ, docs/manual-overrides/*.typ | GitHub release asset for uv 0.11.8 (uv-x86_64-unknown-linux-gnu.tar.gz) is pinned by version and SHA-256 under $(CAPOS_TOOLS_ROOT)/uv/0.11.8. uv-ensure verifies the tarball checksum and the binary’s reported version before PDF generation. uv tool run --constraints tools/md2typst-constraints.txt --from md2typst==0.3.3 md2typst pins the Markdown-to-Typst converter and its Python dependency set. Node version 22.16.0 is declared by .node-version and package.json; the current Makefile invokes node and npm from PATH, so host Node selection remains an operator/CI environment responsibility. package-lock.json pins @mermaid-js/mermaid-cli and its Puppeteer dependency tree; make mermaid-cli-ensure runs npm ci --ignore-scripts with PUPPETEER_SKIP_DOWNLOAD=1, so Puppeteer’s install script cannot fetch a browser during dependency installation. Mermaid rasterization uses the explicit MERMAID_BROWSER_BIN Chromium/Chrome executable, passes it to Puppeteer as PUPPETEER_EXECUTABLE_PATH, and renders PDF diagrams at MERMAID_PDF_SCALE=3 by default; tools/mermaid-puppeteer-config.json disables the browser sandbox for local and gVisor build containers. tools/docs-bundle.js reads the explicit manual page list from docs/manual.typ, generates target/docs-bundle/manual.md plus one Markdown file per manual page, and docs/manual.typ owns the PDF title page, contents, page order, page styling, and override placeholders. | make docs-pdf converts each generated manual Markdown page to Typst with md2typst, normalizes anchors and links with tools/build-typst-manual.js, uses any matching checked-in docs/manual-overrides/<page-id>.typ instead of the generated page, rasterizes Mermaid diagrams through the explicit browser executable, and compiles target/docs-bundle/manual.pdf with pinned Typst. make docs copies that generated PDF to target/docs-site/manual.pdf for Cloudflare Pages publication. Generated Markdown, Typst body pages, and PDF files are ignored build artifacts, not tracked source. |
| QEMU and firmware | Makefile:85-96, tools/build-provenance.sh, .github/workflows/ci.yml qemu-smoke | The qemu-smoke CI job installs qemu-system-x86=1:8.2.2+ds-0ubuntu1.16 (amd64, noble-updates/main or noble-security/main, Ubuntu 24.04) and ovmf=2024.02-2ubuntu0.8 (amd64, noble-updates/main, Ubuntu 24.04). OVMF delivers /usr/share/ovmf/OVMF.fd – the first entry in the Makefile’s OVMF_CODE_CANDIDATES list, so the wildcard discovery resolves to that path on the pinned runner. The Makefile now also pins the selected OVMF firmware blob by SHA-256 (OVMF_CODE_SHA256) and gates the ISO and cloud-disk rules on ovmf-verify, which fails on hash drift and emits a NOTICE skip when no OVMF candidate is installed. Local boot verification still uses the host-installed qemu-system-x86_64. | make build-provenance records the current QEMU version, selected executable path, package identity when discoverable, OVMF selected path or explicit absence, OVMF package identity when discoverable, and OVMF firmware hash when the configured firmware path exists. QEMU and OVMF are identified on the CI runner by package name, exact version, architecture, normalized apt source pocket, and selected path; the QEMU binary identity is captured via dpkg-query/apt-cache policy by make build-provenance per run and the OVMF firmware-blob SHA-256 is captured the same way. make ovmf-verify fails the build when the on-host OVMF firmware blob does not match the pinned OVMF_CODE_SHA256. |
| ISO and host filesystem tools | Makefile:317-341, tools/build-provenance.sh, .github/workflows/ci.yml qemu-smoke | The qemu-smoke CI job installs xorriso=1:1.5.6-1.1ubuntu3 (amd64, noble/main, Ubuntu 24.04), make=4.3-4.1build2 (amd64, noble/main, Ubuntu 24.04), and git=1:2.43.0-1ubuntu7.3 (amd64, noble-updates/main or noble-security/main, Ubuntu 24.04). Local builds still use host-installed xorriso, sha256sum, git, make, and shell utilities. | make build-provenance records selected executable paths and package identities when discoverable for xorriso, sha256sum, make, git, and related local build tools, plus final ISO hashes. xorriso, make, and git are identified on the CI runner by package name, exact version, architecture, normalized apt source pocket, and selected path; the per-run identity is captured via dpkg-query/apt-cache policy by make build-provenance. The remaining host tools, including sha256sum, shell, build-essential, and curl, remain host-provided or package-observed rather than repo-digest-pinned. |
| Boot manifest and embedded binaries | system.cue:1-144, tools/mkmanifest/src/lib.rs:339-379, tools/mkmanifest/src/main.rs, tools/build-provenance.sh, tools/compare-build-provenance.py, Makefile:168-169, Makefile:332-341 | Source manifest is checked in; embedded ELF payloads are build artifacts or inline manifest bytes. | Manifest validation checks references and path containment. make build-provenance now writes a local provenance record with runner OS/kernel/architecture identity, Rust toolchain details, selected host-tool paths and package identities when discoverable, hashes for the selected manifest, ISO, kernel, OVMF firmware when present, and every embedded binary reported by mkmanifest --print-binaries, including file-backed and inline payloads. make build-provenance-compare compares two retained records for material drift while ignoring generated timestamp and allowed local target/ or .capos-tools/ path-root movement. |
| Vendored upstream snapshots | vendor/wasmi-no_std/, vendor/dns-c-wahern/, vendor/fatfs-no_std/, vendor/rustls-webpki/, vendor/webpki-roots/, vendor/embedded-tls/, each with a VENDORED_FROM.md | Each vendored tree is a static, pinned snapshot recorded by version/tag, commit SHA when available, commit date when available, vendoring date, and license. vendor/wasmi-no_std/wasmi-1.0.9/ pins wasmi v1.0.9 (commit 61ba65e6563d8b2f5b699b018349d3330b28b9f3, Apache-2.0 OR MIT) consumed by capos-wasm/; vendor/dns-c-wahern/src/ pins William Ahern’s dns.c rel-20160808 (commit 4ec718a77633c5a02fb77883387d1e7604750251, MIT). vendor/rustls-webpki/rustls-webpki-0.103.13/ pins rustls-webpki 0.103.13 (artifact SHA-256 61c429a8…f756e, commit 2879b2ce…728e86, ISC) and vendor/webpki-roots/webpki-roots-1.0.7/ pins webpki-roots 1.0.7 (artifact SHA-256 52f5ee44…2eb9d, commit be948464…221688, CDLA-Permissive-2.0); both are the certificates/TLS Phase-1 verifier deps consumed by the capos-tls/ Phase-1 verifier crate. vendor/embedded-tls/embedded-tls-0.19.0/ pins the embedded-tls 0.19.0 crates.io package (embedded VCS commit 865e1fd983c583228e3bbeb9f4996f1abc454ca3, Apache-2.0) consumed only by the local TLS client handshake smoke. No source patches (one integration-only empty-[workspace] marker per crate); each path dep carries an exact version = "=X.Y.Z" pin so cargo-deny’s wildcards gate stays happy. | The wasmi snapshot is exercised by make capos-wasm-build and the WASI smokes plus make dependency-policy-check (cargo-deny + cargo-audit against capos-wasm/Cargo.lock). The rustls-webpki / webpki-roots snapshots are exercised by capos-tls/ under cargo build / cargo build --features qemu (bare-metal x86_64-unknown-none) and by make dependency-policy-check against the root Cargo.lock. The embedded-tls snapshot is exercised by make run-cloud-tls-client-handshake, the focused capOS demo build, and make dependency-policy-check against demos/Cargo.lock. The dns.c snapshot is not yet on the v0 build path; demos/posix-dns-resolver/ compiles only main.c with a commented-out dns.h include. Refreshes follow the procedure recorded in each VENDORED_FROM.md. No vendor/dash/ source-build is present. |
| Build downloads | Makefile, Cargo lockfiles, rust-toolchain.toml | Limine, CUE, and documentation tool tarballs are explicitly fetched and SHA-256-verified; Cargo and rustup downloads are implicit when caches/toolchains are absent. The build no longer uses a Go toolchain: CUE is now a hash-verified release binary rather than a go install compile, so the actions/setup-go CI step and the floating go-version pin were removed. | Limine artifacts, the CUE release binary, and documentation tool tarballs are verified by SHA-256. Cargo downloads rely on upstream tooling and lockfiles, with no separate repo policy beyond the lockfile checksums. Rustup downloads are now gated by the dated nightly-2026-04-20 channel pin (see Rust Toolchain section); only the dist tarballs themselves are not yet mirrored. |
| GitHub Actions identities and runner OS | .github/workflows/ci.yml | Every third-party Action is pinned by 40-character commit SHA with a trailing # v<X.Y.Z> comment marker. The runner OS is pinned to ubuntu-24.04 rather than the floating ubuntu-latest label. | Pin bumps are review-visible as workflow diffs and the trailing version comment makes the intended release auditable. See the GitHub Actions Runner and Workflow Pinning section below for the current pin table and the bump procedure. |
Security Verification Track S.10.3 Dependency Policy
Dependency changes are accepted only if they satisfy this policy and are recorded in the owning task checklist.
Dependency classes
Use these classes when reviewing a dependency change:
- Kernel-critical no_std: crates used directly by
kernel,capos-lib,capos-config, andcapos-abi. - Userspace-runtime no_std: crates used by
init,demos, andcapos-rt. - Host/build: crates used by
tools/*,build.rshelpers, and generated output pipelines. - Test/fuzz/dev: crates gated by
dev-dependenciesortarget-specificfor fuzz/proptests/smoke support.
Required pre-merge criteria
For any added dependency (or bump in any class):
- Manifest and features are explicit. Dependency entries must include
explicit feature choices; avoid
default-features = trueunless justified. - No_std compatibility is proven for no_std classes. Kernel-critical and
userspace-runtime dependencies must compile in a
#![no_std]mode withallocwhere expected.cargo build -p <crate> --target x86_64-unknown-nonemust succeed for every kernel/no_std crate affected. - Security policy checks run and pass. CI-equivalent checks for the
touched workspace are required through
make dependency-policy-check, which runscargo deny checkon every Cargo manifest andcargo auditon every lockfile. - Dependency class change is justified in review. PR text must include target class, ownership rationale, transitive graph impact, and why the crate is not a transitive replacement for an already-allowed dependency.
- Lockfile behavior is explicit. Update only intended lockfiles and record intentional cross-workspace drift in this document if workspace purpose differs.
No_std add/edit checklist
- Reject crates that require
std, OS I/O, or unsupported platform APIs in the dependency path intended for kernel classes. - Reject dependencies that re-export broad platform facades or large unsafe surface unless there is a replacement with smaller scope and better audit visibility.
- Record a license and supply-chain review result (via policy checks) before merge.
- Confirm no
unsafecontract escapes are added without a review surface note in the relevant module.
Standing requirements
- Add Security Verification Track S.10.3 checks to the target branch plan item for any kernel/no_std crate dependency change and document the exact pass command set.
- Keep lockfile deltas review-visible in normal PR flow; lockfile pinning is the minimum bar, not the gate.
- Keep transitive drift in sync with the trust class: class-wide divergence across lockfiles requires explicit justification.
Remaining gaps after Security Verification Track S.10.3 policy
- Mirror the resolved dated nightly dist tarballs (and their SHA-256 checksums) into the per-user tool cache as a further hardening step, so bumping the pin does not depend on rustup retaining its historical manifests. The dated pin closes the floating-channel gap; tarball mirroring would close the historical-availability gap.
- Decide whether the local
make kani-libworkflow should grow a repo-managed installer/bootstrap helper or continue to rely on separately provisioned user-localcargo-kaniplus the Kani bundle/toolchain setup path. - CI now publishes
target/build-provenance.txtas a named artifact on everyqemu-smokerun (seeactions/upload-artifactstep in.github/workflows/ci.yml) and, onpull_requestevents, downloads the most recent successful main-branch artifact viaactions/download-artifactand runsmake build-provenance-compare BUILD_PROVENANCE_COMPARE_POLICY=ci-environmentagainst it as a blocking PR gate. Missing base provenance is a CI failure, not a silent skip; artifact retention is therefore part of the gate.
Build Provenance Retention And Comparison Policy
Status 2026-06-07 06:35 UTC: this policy applies to local and CI proof
artifacts produced by make build-provenance. The qemu-smoke CI job now
publishes the candidate record as a named artifact on every run and, on
pull_request events, runs make build-provenance-compare against the most
recent successful main-branch artifact with
BUILD_PROVENANCE_COMPARE_POLICY=ci-environment. That CI policy is
PR-blocking for runner, tool, Rust, OVMF package, and OVMF hash drift while
allowing expected base-vs-head source commit, ISO/kernel/manifest hash, and
embedded-payload hash differences. This remains a reproducibility evidence
policy, not a claim that production images are third-party reproducible before
the unresolved pinning gates below are closed.
Package-pin bumps for qemu-system-x86, xorriso, make, git, or ovmf
are the only planned baseline-refresh case where the PR comparison can fail on
purpose: the candidate provenance records the new reviewed package identity,
while the base-branch artifact still records the old one. That failure is not a
green PR exception and is not a workflow bypass. The bump can land only through
a reviewed local-main integration or maintainer push path after the branch’s
qemu-smoke build, make run-smoke, and make build-provenance steps pass
with the new pins, and the compare diff contains only the reviewed
package-identity changes introduced by the same branch. After the bump lands,
the next successful main-branch qemu-smoke push artifact becomes the refreshed
base provenance; unrelated PRs must wait for that artifact before their
blocking environment comparisons can pass.
For every externally cited QEMU proof, release candidate, paper artifact, or public performance/security claim, retain the following as one immutable evidence bundle:
target/build-provenance.txtfrom the exact checked commit, manifest, and recorded worktree state;- the kernel, manifest, ISO, OVMF firmware if used, and embedded-binary hashes recorded in that provenance file;
- the exact command set and QEMU transcript or host-test log used as evidence;
- the source commit hash, clean-tree assertion or retained
git diffplus untracked-file inventory, and any non-default Make variables such asMANIFEST_SOURCE,CAPOS_CUE_TAGS,QEMU_NET,MERMAID_BROWSER_BIN, orCAPOS_TOOLS_ROOT; - the
system.local.cueoverlay state for default-manifest builds: record explicit absence, or retain the file content plus SHA-256 and size. Because the overlay is gitignored and unified intosystem.cue, a commit hash alone is not enough to reconstruct a default-manifest build that used it; - the runner identity: either a pinned CI/container image digest, or the host package identities for Rust, QEMU, xorriso, make, git, OVMF firmware package, and the operating-system image when the runner is not pinned.
Retention requirements:
- Keep evidence bundles for any tagged release, published paper result, public benchmark, or public security claim for at least the lifetime of that claim.
- Keep pre-merge task evidence until the reviewed branch has merged and the next full relevant verification has superseded it.
- Keep failed evidence when it explains a known regression, review finding, or release blocker; otherwise failed local scratch logs may be discarded.
- Do not rely on
target/as the retention store.target/artifacts are local build output; retained evidence must be copied to the release, CI, or paper artifact store that owns the claim.
Comparison requirements:
- Run local comparisons with
make build-provenance-compare BASE_PROVENANCE=... CANDIDATE_PROVENANCE=...ortools/compare-build-provenance.py BASE CANDIDATE. The command exits zero only when records differ by generated timestamp and allowed local path roots such as worktreetarget/or.capos-tools/, while all hashes, versions, package identities, and runner identities match. - Run PR base-vs-head environment comparisons with
make build-provenance-compare BUILD_PROVENANCE_COMPARE_POLICY=ci-environment BASE_PROVENANCE=... CANDIDATE_PROVENANCE=.... This policy compares the default manifest source, host target, runner identity, GitHub-hosted image identity when present, Rust toolchain, selected executable identities, tool versions, OVMF selection, and OVMF hash, but ignores expected source commit, kernel/manifest/ISO hash, and embedded-binary hash changes between the base branch and PR head. - For package-pin bump branches, treat a
ci-environmentcomparison failure as acceptable review evidence only when every reported difference is an intended package-identity change forqemu-system-x86,xorriso,make,git, orovmffrom the same branch. Any runner image, Rust toolchain, OVMF firmware hash, tool-version, or unrelated package drift remains blocking. - Compare two provenance records by commit, clean or retained-diff state,
system.local.cueabsence/hash/content policy, manifest source, manifest binary hash, kernel hash, ISO hash, embedded-binary table, OVMF hash or explicit absence, host-tool versions, package identities, and operating-system image identity. - A byte-identical ISO requires all recorded hashes to match. Equal source commits with different Rust, QEMU, xorriso, OVMF, or host package identities are compatible proof reruns, not reproducible-production evidence.
- If a comparison differs only in paths under
.capos-toolsor worktree-localtarget/directories while all hashes and versions match, treat the result as the same proof environment. - If a comparison differs in worktree state, overlay state, package identity, operating-system image identity, host-tool version, OVMF hash, Rust compiler commit/date, embedded-binary hash, or ISO hash, record the difference in the owning review or release note before citing the result.
Minimum runner identity for production-hardening branches:
- Rust must be a date-pinned nightly or stronger hash-pinned toolchain, not the
floating
nightlychannel. - QEMU, xorriso, make, and git must come from a pinned runner image digest or a documented package set with package name, version, architecture, repository, and distribution release.
- OVMF firmware must be either repo-pinned by digest or identified by package name, version, architecture, repository, distribution release, selected path, and SHA-256.
- Any runner image used for production reproducibility claims must be cited by immutable digest. Mutable tags are acceptable only for local proof evidence.
Production hardening must treat the following as unresolved supply-chain gates, not as cosmetic reproducibility work:
immutable runner image digest or repo-managed tool digests for qemu/xorriso/make/git
The Rust nightly date pin (currently nightly-2026-04-20) closes the
floating-channel gate; tarball mirroring is tracked as a further hardening
step in the Remaining gaps section above. The qemu-smoke job now installs
qemu-system-x86=1:8.2.2+ds-0ubuntu1.16 (amd64,
noble-updates/main or noble-security/main, Ubuntu 24.04),
xorriso=1:1.5.6-1.1ubuntu3 (amd64, noble/main, Ubuntu 24.04),
make=4.3-4.1build2 (amd64, noble/main, Ubuntu 24.04),
git=1:2.43.0-1ubuntu7.3 (amd64, noble-updates/main or
noble-security/main, Ubuntu 24.04), and ovmf=2024.02-2ubuntu0.8 (amd64,
noble-updates/main, Ubuntu 24.04) so the QEMU, ISO writer, make, git, and
OVMF firmware identities are all captured for UEFI smoke builds: package name,
exact version, architecture, normalized apt source pocket, and the per-run
identity captured via dpkg-query/apt-cache policy by
make build-provenance. OVMF additionally records the selected path
(/usr/share/ovmf/OVMF.fd) and per-run SHA-256, and the Makefile now pins the
selected firmware blob by SHA-256 through the ovmf-verify gate wired into
the ISO and cloud-disk rules. Repo-pinned digests (download-and-verify rather
than apt-installed) for qemu-system-x86, xorriso, make, and git, or an
immutable runner image digest that contains them, remain future hardening
tracked in docs/design-risks-register.md (R13).
xorriso has no version in noble-updates; the pin uses the only available
noble/main version, which is what every Ubuntu 24.04 host resolves to.
Until those gates land, generated ISO/manifest/payload artifacts plus
target/build-provenance.txt are suitable for local and CI proof evidence, but
not for claims that a third party can reproduce an identical production boot
image from source alone.
Bootloader and ISO Inputs
The Makefile now pins Limine at commit
aad3edd370955449717a334f0289dee10e2c5f01 and verifies these copied artifacts:
| Artifact | Checksum reference |
|---|---|
$(LIMINE_DIR)/limine-bios.sys | LIMINE_BIOS_SYS_SHA256 in Makefile |
$(LIMINE_DIR)/limine-bios-cd.bin | LIMINE_BIOS_CD_SHA256 in Makefile |
$(LIMINE_DIR)/limine-uefi-cd.bin | LIMINE_UEFI_CD_SHA256 in Makefile |
$(LIMINE_DIR)/BOOTX64.EFI | LIMINE_BOOTX64_EFI_SHA256 in Makefile |
$(LIMINE_DIR) resolves to
$(CAPOS_TOOLS_ROOT)/limine/<LIMINE_COMMIT> (default
$HOME/.capos-tools/limine/<commit> unless CAPOS_TOOLS_ROOT is
overridden), shared with the rest of the per-user pinned tool cache.
make limine-ensure clones https://github.com/limine-bootloader/limine.git
only when $(LIMINE_DIR)/.git is absent, fetches the pinned commit if needed,
checks it out detached, and runs make inside the Limine tree (the
limine-ensure recipe). make limine-verify then checks the repository HEAD
and artifact checksums (the limine-verify recipe). The ISO copies the
kernel, generated manifest.bin, Limine config, and verified Limine
artifacts into iso_root/, runs xorriso, then runs limine bios-install
(the $(ISO) recipe).
Remaining reproducibility gap: Limine source is pinned, but the Limine build host compiler and environment are not pinned or recorded.
Rust Toolchain
rust-toolchain.toml specifies:
channel = "nightly-2026-04-20"targets = ["x86_64-unknown-none", "aarch64-unknown-none", "wasm32-wasip1"]components = ["rust-src"]
The wasm32-wasip1 target is needed for the WASI Preview 1 demo payloads
(demos/wasi-hello-rust/, demos/wasi-cli-args/, demos/wasi-random/) built
by make wasi-hello-rust-build, make wasi-cli-args-build, and
make wasi-random-build; the wasm-host binary itself is built for the booted
x86_64-unknown-capos userspace target instead.
The pinned dated channel resolves to:
rustc 1.97.0-nightly (e22c616e4 2026-04-19)- host target
x86_64-unknown-linux-gnu
The 2026-04-20 manifest packages the rustc commit cut on 2026-04-19; that is the
upstream dist naming convention, not a drift. Rustup will continue to install
the same dist tarball for nightly-2026-04-20 as long as upstream retains it.
The Makefile derives HOST_TARGET from rustc -vV (Makefile:12) and uses
that for tools/mkmanifest (Makefile:28-29). Cargo aliases in
.cargo/config.toml:4-48 hard-code x86_64-unknown-linux-gnu for host tests.
The custom userspace target aliases in .cargo/config.toml use
targets/x86_64-unknown-capos.json plus -Zjson-target-spec and
-Zbuild-std=core,alloc, so rust-src is a required toolchain component.
The CI host-baseline and qemu-smoke jobs install the same
nightly-2026-04-20 toolchain so CI matches the local
rust-toolchain.toml resolution. The kani-proofs job stays on
nightly-2025-11-21 because Kani requires its own paired nightly bundle
installed by cargo kani setup; advancing the Kani pin is tracked separately
through that bundle’s compatibility matrix.
Rust Nightly Date Pin Policy
The pin is one of the supply-chain-trust controls listed in this proposal
alongside the Limine commit, OVMF firmware SHA-256, capnp tarball SHA-256,
CUE binary, mdBook/mdbook-mermaid release assets, Typst binary, uv binary,
and pinned cargo-deny/cargo-audit/cargo-kani releases. All of these
must be pinned at the same trust level – date- or hash-anchored, never a
floating channel or moving tag. This subsection states the policy for the
Rust nightly entry; the next subsection states the mechanical advance
procedure.
Where the pin lives. Exactly one source: rust-toolchain.toml as a
date-anchored nightly channel of the form nightly-YYYY-MM-DD. The CI
workflow’s host-baseline and qemu-smoke toolchain: values must
mirror that same dated channel. No other file may declare a nightly date;
no float, no nightly shorthand, no commit-hash override.
Promotion criteria. A bump is accepted only when the candidate nightly satisfies all of the following against the worktree where the pin lands:
makebuilds the full workspace clean (kernel + standalone userspace + ISO) with no new warnings undercargo build --features qemu.make fmt-checkpasses across the workspace and all standalone crates.make workflow-checkpasses (CLAUDE.md token budget, mandatory-context budgets, slice trailers).make checkpasses (the aggregate build/test gate that includesgenerated-code-checkand the host-test aliases).make run-smokepasses on the developer host when QEMU smoke is feasible there; if QEMU is unavailable locally, the bump branch’s CIqemu-smokerun is the authoritative gate.- Any new rustc warning, lint, or unrelated build failure introduced by the new nightly is treated as a real gate failure. Do not relax capOS code or silence the lint to land the bump.
Rollback. If a promotion exposes a regression in a downstream crate
that capOS depends on (limine, x86_64, spin, smoltcp, wasmi,
capnp/capnpc, or any cargo-deny/cargo-audit pinned tool), revert the
pin to the prior dated channel on main, file a tracking note in
docs/tasks/ with the failing date, the failing crate, and the upstream issue
if one exists, and resume normal cadence only after the downstream regression
is resolved or worked around.
Cadence. Bump the pin at least once per quarter even without a
specific feature trigger so production-provenance evidence does not lag
upstream. Bump out of cadence when (a) a security advisory affects the
current pinned nightly’s rustc/cargo/std (consult
rust-lang/rust, rustsec/advisory-db, and the cargo-audit output for
the pinned dist), or (b) a compiler feature, fix, or lint that capOS
depends on lands upstream. Unbounded float is not permitted: the dated
channel must always resolve to a concrete YYYY-MM-DD.
Approvals. Maintainer-driven, single reviewed slice per bump. No
automated promotion bot. The pin bump is its own contract change and
must not be bundled with unrelated behavior changes; the reviewed diff
must show only rust-toolchain.toml, the CI workflow, this proposal’s
summary table and resolved-rustc line, and any minimal lint/code
adjustments forced by the new nightly with an inline justification.
Trust-input dimension. The pin closes the floating-channel supply-
chain gate listed in the Build Provenance Retention And Comparison
Policy (“Minimum runner identity for production-hardening branches:
Rust must be a date-pinned nightly or stronger hash-pinned toolchain,
not the floating nightly channel”). Mirroring the resolved dist
tarballs into the per-user tool cache (the same shape as Limine, capnp,
CUE, mdBook, and Typst pins) remains a future hardening step tracked in
the Remaining gaps section.
Advance procedure (bumping the dated nightly)
When to bump:
- A compiler feature, fix, or lint that capOS depends on lands in upstream
nightly after
2026-04-20. Example triggers: a Cargo or rustc fix that unblocks a build path; acore/allocchange that affects-Zbuild-std; arustfmtchange required for the project formatting baseline. - Toolchain drift hygiene: schedule a bump at least once per release window even without a specific feature trigger, so production-provenance evidence does not lag too far behind upstream.
How to bump:
-
Choose a candidate nightly date and verify all required targets and the
rust-srccomponent are simultaneously available for that date:rustup toolchain add nightly-<YYYY-MM-DD> \ --target x86_64-unknown-none \ --target aarch64-unknown-none \ --target wasm32-wasip1 \ --component rust-srcIf any target or component is missing, try adjacent dates (rustup’s nightly dist manifests sometimes drop a target for a single day) until one is found that provides the full set.
-
Update both files in the same commit:
rust-toolchain.tomlchannelvalue..github/workflows/ci.yml– both thehost-baselineandqemu-smoketoolchain:values. Leavekani-proofson its own pin.
-
Run the full local gate set against the candidate before pushing:
make fmt-check,cargo build --features qemu,make check,make workflow-check,make run-smoke. Treat any new warning or unrelated build failure as a real gate failure – do not patch around compiler drift by relaxing capOS code. -
Update the
Rust toolchainrow in this file’s summary table, the resolvedrustcline above, and thelast_reviewedfront-matter timestamp. Cite the new dated channel. -
Land the pin bump as its own reviewed slice, not bundled with unrelated behavior changes. The pin is itself the provenance contract.
Remaining reproducibility gap: rustup retains nightly dist manifests for a finite window. A future hardening slice may mirror the resolved dist tarballs plus their SHA-256 checksums into the per-user tool cache the same way Limine and capnp are pinned today, so a bump-without-mirror does not become a silent loss of historical reproducibility.
CI Runner Package Pins
The qemu-smoke CI job installs qemu-system-x86, xorriso, make, git,
and ovmf via apt on an ubuntu-24.04 runner. Those packages provide the QEMU
emulator that executes every QEMU smoke, the ISO writer that builds the
bootable image consumed by smokes, the build and repository tools used after
checkout, and the UEFI firmware blob selected by make run-uefi and the
cloud-disk path. A floating apt install (no =<version> specifier) would let
upstream Ubuntu silently roll any of them on the next CI run, so this section
names the version pins, the file that owns them, and the procedure for
advancing them.
CI Package Pin Policy
The pin is one of the supply-chain-trust controls listed in this proposal
alongside the Limine commit, OVMF firmware SHA-256, Rust nightly date pin,
capnp tarball SHA-256, CUE binary, mdBook/mdbook-mermaid release assets,
Typst binary, uv binary, and pinned cargo-deny/cargo-audit/cargo-kani
releases. All of these must be pinned at the same trust level – date- or
hash-anchored, never a floating channel or moving tag. This subsection
states the policy for the QEMU, xorriso, make, git, and OVMF package entries;
the next subsection states the mechanical advance procedure.
Where the pin lives. Exactly one source: the Install boot smoke dependencies step of the qemu-smoke job in .github/workflows/ci.yml.
Each package must be invoked as <name>=<exact-version> (no * wildcard
and no major-only floor) so the apt resolver fails closed rather than
silently rolling forward. The currently pinned versions are
qemu-system-x86=1:8.2.2+ds-0ubuntu1.16 (amd64,
noble-updates/main or noble-security/main, Ubuntu 24.04),
xorriso=1:1.5.6-1.1ubuntu3 (amd64, noble/main, Ubuntu 24.04),
make=4.3-4.1build2 (amd64, noble/main, Ubuntu 24.04),
git=1:2.43.0-1ubuntu7.3 (amd64, noble-updates/main or
noble-security/main, Ubuntu 24.04), and ovmf=2024.02-2ubuntu0.8 (amd64,
noble-updates/main, Ubuntu 24.04). The summary table, the QEMU and firmware
and ISO and host filesystem tools rows, the Host Tools section, and the
Build Provenance Retention And Comparison Policy mirror these strings; the
policy text is the single source of truth and the other locations track it.
Promotion criteria. A bump is accepted only when all of the following hold against the bump branch:
- The Ubuntu base image rolls (
noble/noble-updates/noble-securitypublishes a newer version of the package) or a security advisory affects the currently pinned version. Cosmetic version bumps without an upstream trigger are not accepted; the pin moves forward when there is a reason to move it. apt-cache madison <package>on a current Ubuntu 24.04 host lists the candidate version, and the candidate is available fromnoble-updates/main (ornoble/main when nonoble-updatesentry exists, as is the case forxorrisotoday). Versions sourced from third-party PPAs or*-proposedpockets are not accepted.- The bump branch’s
qemu-smokeexecution reaches and passes the new-pin build evidence steps:makebuild,make run-smoke,make build-provenance, and candidate provenance artifact upload. The pull-requestmake build-provenance-compare BUILD_PROVENANCE_COMPARE_POLICY=ci-environmentstep is expected to fail only on reviewed package-identity fields that match the new pinned strings rather than the previous ones; every other comparison difference remains blocking. - No new QEMU, xorriso, make, git, or OVMF behavior is silently relied on: if the bump unlocks a smoke that previously failed, that smoke must be enabled and reviewed in the same bump branch rather than treated as incidental.
- The
Trusted Build Inputssummary table,QEMU and firmwarerow,ISO and host filesystem toolsrow,Host Toolssection, and Build Provenance Retention And Comparison Policy text are updated to cite the new versions and the new resolved repository (noble/noble-updates) in the same commit.
Rollback. If a promotion exposes a regression in the QEMU smoke path,
the ISO writer, build orchestration, repository operations, or UEFI boot,
revert the .github/workflows/ci.yml change to the prior pinned version on
main, file a tracking task under
docs/tasks/ with the failing version, the failing smoke, and the upstream
Ubuntu/QEMU/xorriso/OVMF issue if one exists, and resume normal cadence only
after the regression is resolved or worked around. Reverting also requires
reverting the summary-table and policy text mirrors so the recorded versions
stay consistent with the workflow file.
Cadence. Bump the pins at least once per quarter even without a
specific security trigger, so production-provenance evidence does not lag
upstream Ubuntu point releases. Bump out of cadence when (a) a security
advisory affects the current pinned version of any of the packages
(consult the Ubuntu Security Notices, the QEMU security mailing list, the
Git security advisories, GNU make release notes, and the ovmf/edk2
advisories), or (b) a fix that capOS depends on lands
in a newer Ubuntu point release. Unbounded float is not permitted: each
package must always resolve to a concrete
<epoch>:<upstream>-<debian> version string.
Approvals. Maintainer-driven, single reviewed slice per bump. No
automated promotion bot. The pin bump is its own contract change and
must not be bundled with unrelated behavior changes; the reviewed diff
must show only .github/workflows/ci.yml, this proposal’s summary table and
resolved-version mirrors, the relevant production-provenance task record when
sub-items move, and any minimal smoke adjustments forced by the new package
versions with an inline justification.
Trust-input dimension. The pin closes the runner/OS/tool identity
gate listed in the Build Provenance Retention And Comparison Policy
(“Minimum runner identity for production-hardening branches: QEMU,
xorriso, make, and git must come from a pinned runner image digest or a
documented package set with package name, version, architecture,
repository, and distribution release”) for the apt-installed package set it
owns. A pinned runner image digest (replacing the ubuntu-24.04 mutable label
with an immutable image SHA) or repo-managed tool digests for those packages
remain future hardening tracked in docs/design-risks-register.md (R13).
Advance procedure (bumping the apt-pinned versions)
When to bump:
- An Ubuntu Security Notice affects the currently pinned version of
qemu-system-x86,xorriso,make,git, orovmf. - A QEMU, xorriso, make, git, or OVMF point release lands in
noble-updates/main that capOS needs (typically a virtio, MSI-X, ISO writer, build-tool, repository-tool, or UEFI fix). - Quarterly hygiene cadence with no specific feature trigger, so the pin does not lag too far behind upstream.
How to bump:
-
On a current Ubuntu 24.04 host (or a
ubuntu:24.04container that has refreshedapt-get update), list available versions of each package:apt-cache madison qemu-system-x86 apt-cache madison xorriso apt-cache madison make apt-cache madison git apt-cache madison ovmfPick the highest stable version from
noble-updates/main. If a package has nonoble-updatesentry (as is the case forxorrisotoday), pick fromnoble/main. Do not select from*-proposed,*-backports, or third-party PPAs. -
Update the single source in the
Install boot smoke dependenciesstep of theqemu-smokejob in.github/workflows/ci.ymlso each package line reads<name>=<exact-version>. -
Update the mirrors in this file in the same commit: the summary-table rows for
QEMU and firmwareandISO and host filesystem tools, theHost Toolssection, the Build Provenance Retention And Comparison Policy text, and theRemaining gaps for Security Verification Track S.10.2/S.10.3block underManifest, Embedded Binaries, and Downloaded Artifacts. Refresh thelast_reviewedfront-matter timestamp. -
If the OVMF package version moves, the OVMF firmware blob SHA-256 may change. Recompute
OVMF_CODE_SHA256inMakefilefrom the resolved firmware path (/usr/share/ovmf/OVMF.fdon Ubuntu 24.04) and verifymake ovmf-verifypasses against the new digest. Land theOVMF_CODE_SHA256change in the same commit as the package bump. -
Push the bump branch and let
qemu-smokeexercise the new pins throughmake,make run-smoke,make build-provenance, and candidate provenance artifact upload. The acceptance gate for the bump itself is those steps passing plus a reviewed PRmake build-provenance-compare BUILD_PROVENANCE_COMPARE_POLICY=ci-environmentfailure whose diff is limited to the intended package-identity strings replacing the previous ones. Land the bump through the reviewed local-main integration or maintainer push path; after it reachesmain, the next successful main-branchqemu-smokepush artifact is the new base record for unrelated PR comparisons. -
Land the pin bump as its own reviewed slice, not bundled with unrelated behavior changes. The pin is itself the provenance contract.
Remaining reproducibility gap: the ubuntu-24.04 runner label is still
managed by GitHub Actions, not by an immutable image digest, so the host
package set underneath the apt-installed qemu-system-x86, xorriso,
make, git, and ovmf pins can still roll between runs. A future hardening slice may
move the qemu-smoke job to a self-built runner image referenced by
digest, mirror the apt package files into the per-user tool cache the
same way Limine and capnp are pinned today, or both, so a bump-without-
mirror does not become a silent loss of historical reproducibility.
Cargo Dependencies
The root workspace members are capos-abi, capos-config, capos-lib,
capos-tls, kernel, and the host-only tools/capnp-build build-support
crate. Cargo.toml keeps default members to capos-config, capos-lib,
capos-tls, and kernel so ordinary root bare-metal builds do not build the
host helper as a target package but do build the capos-tls certificates/TLS
verifier-dependency probe. The vendored rustls-webpki / webpki-roots path
dependencies declare their own [workspace] and are listed in the root
Cargo.toml exclude set (the same isolation as the vendored fatfs crate),
so they are not workspace members. The vendored embedded-tls client-state
machine snapshot follows the same workspace isolation and is consumed only by
the standalone demos/ workspace. init/, demos/, tools/mkmanifest/,
tools/ringtap-viewer/, capos-rt/, shell/, libcapos/,
libcapos-posix/, capos-wasm/, and fuzz/ are standalone workspaces
with their own lockfiles.
Important direct dependencies and current root-lock resolutions:
| Dependency | Manifest references | Root lock resolution |
|---|---|---|
capos-abi | capos-config/Cargo.toml, capos-lib/Cargo.toml | local path package in Cargo.lock |
argon2 | capos-lib/Cargo.toml; optional capos-config/Cargo.toml credential-validation feature used by kernel/init/mkmanifest bootstrap validation | 0.5.3 in Cargo.lock |
capnp | capos-config/Cargo.toml, capos-lib/Cargo.toml, kernel/Cargo.toml | 0.25.4 in Cargo.lock |
capos-capnp-build | capos-config/Cargo.toml | local path package in Cargo.lock |
capnpc | tools/capnp-build/Cargo.toml | 0.25.3 in Cargo.lock |
limine crate | kernel/Cargo.toml:8 ("0.6" range) | 0.6.3 in Cargo.lock |
spin | kernel/Cargo.toml:9 ("0.9" range) | 0.9.8 in Cargo.lock |
x86_64 | kernel/Cargo.toml:10 ("0.15" range) | 0.15.4 in Cargo.lock |
linked_list_allocator | kernel/Cargo.toml:11 ("0.10" range) | 0.10.6 in Cargo.lock |
smoltcp | kernel/Cargo.toml:16 ("0.13.0" caret range) | 0.13.0 in Cargo.lock |
loom | capos-config/Cargo.toml:27 | 0.7.2 in Cargo.lock |
proptest | capos-lib/Cargo.toml | 1.11.0 in Cargo.lock |
rustls-webpki (vendored path) | capos-tls/Cargo.toml (=0.103.13, default-features = false, alloc) | local path package (vendor/rustls-webpki/rustls-webpki-0.103.13) in Cargo.lock |
webpki-roots (vendored path) | capos-tls/Cargo.toml (=1.0.7, default-features = false) | local path package (vendor/webpki-roots/webpki-roots-1.0.7) in Cargo.lock |
rustls-pki-types | transitive of the vendored rustls-webpki/webpki-roots (alloc) | 1.14.1 in Cargo.lock |
untrusted | transitive of the vendored rustls-webpki | 0.9.0 in Cargo.lock |
zeroize | transitive of rustls-pki-types (alloc) | 1.8.2 in Cargo.lock |
The four kernel-critical crates limine, spin, x86_64, and smoltcp are
declared with semver-range requirements ("0.6", "0.9", "0.15", and the
caret "0.13.0"), not the exact =X.Y.Z requirements applied to capnp
(=0.25.4) in kernel/Cargo.toml and sha2 (=0.10.9 in
capos-lib/Cargo.toml). This requirement-level asymmetry is currently
unintentional drift in manifest style rather than a deliberate policy: the
exact crate version that ships is still pinned by the checked-in Cargo.lock
checksums above and is review-visible through lockfile diffs, so a range
requirement does not widen what actually compiles without a lockfile change.
Tightening these four manifest requirements to =X.Y.Z to match capnp/sha2
is a separate build-risk change (a manifest edit plus lockfile regeneration and
re-verification), tracked as a doc-accuracy gap here rather than changed in
this inventory pass.
Standalone lockfile drift observed during this inventory:
The TLS client handshake smoke adds a userspace-runtime no_std dependency in the
standalone demos/ workspace: embedded-tls = "=0.19.0" as a path dependency
under vendor/embedded-tls/embedded-tls-0.19.0/, with default-features = false and only the rustpki feature enabled. demos/Cargo.lock pins the
resulting RustCrypto TLS 1.3 closure. The capOS custom target forces the
software AES and POLYVAL backends in .cargo/config.toml so those crypto
dependencies do not select x86 accelerated backend code that is outside the
custom-target build contract.
| Lockfile | Notable direct/runtime resolution |
|---|---|
init/Cargo.lock | capnp 0.25.4, capnpc 0.25.3, linked_list_allocator 0.10.6 |
demos/Cargo.lock | capnp 0.25.4, capnpc 0.25.3, linked_list_allocator 0.10.6 |
demos/wasi-hello-rust/Cargo.lock | Single-package leaf lockfile for the wasm32-wasip1 Rust hello payload; no third-party direct dependencies. |
demos/wasi-cli-args/Cargo.lock | Single-package leaf lockfile for the Phase W.3 argv-grant wasm32-wasip1 Rust payload; no third-party direct dependencies. |
demos/wasi-env/Cargo.lock | Single-package leaf lockfile for the WASI environment-grant wasm32-wasip1 Rust payload; no third-party direct dependencies. |
demos/wasi-fs/Cargo.lock | Single-package leaf lockfile for the WASI filesystem wasm32-wasip1 Rust payload; no third-party direct dependencies. |
demos/wasi-random/Cargo.lock | Single-package leaf lockfile for the Phase W.4 random_get wasm32-wasip1 Rust payload; no third-party direct dependencies. |
demos/wasi-preview1-refusals/Cargo.lock | Single-package leaf lockfile for the WASI Preview 1 refusal-coverage wasm32-wasip1 Rust payload; no third-party direct dependencies. |
demos/wasi-stdio-fd/Cargo.lock | Single-package leaf lockfile for the WASI stdio-fd wasm32-wasip1 Rust payload; no third-party direct dependencies. |
tools/mkmanifest/Cargo.lock | capnp 0.25.4, capnpc 0.25.3, serde_json 1.0.149 |
tools/adventure-content-gen/Cargo.lock | Host generator for adventure content; locked dependencies include serde_json and the cue-export-to-JSON pipeline; no capnp runtime dependency. |
tools/paperclips-content-gen/Cargo.lock | Host generator for Paperclips content; locked dependencies include serde_json and capnp 0.25.4 for schema-aware JSON-to-binary conversion through mkmanifest cue-to-capnp. |
tools/remote-session-client/Cargo.lock | Standalone Linux host-side remote-session client; pins capnp 0.25.4 and serde 1.0.228; no transitive wasmi, Argon2, or smoltcp dependency. Covered by make dependency-policy-check. |
tools/ringtap-viewer/Cargo.lock | capnp 0.25.4, capnpc 0.25.3; no Argon2 because it uses baseline capos-config |
capos-rt/Cargo.lock | capnp 0.25.4, capnpc 0.25.3, linked_list_allocator 0.10.6 |
capos-service/Cargo.lock | capnp 0.25.4, capnpc 0.25.3, linked_list_allocator 0.10.6 (the same allocator resolution as capos-rt/demos/libcapos; no cross-workspace drift). |
libcapos/Cargo.lock | capnp 0.25.4, capnpc 0.25.3, linked_list_allocator 0.10.6 plus the local capos-rt path dependency. |
libcapos-posix/Cargo.lock | capnp 0.25.4, capnpc 0.25.3, linked_list_allocator 0.10.6, plus local capos-rt and libcapos path dependencies. |
shell/Cargo.lock | blake2 0.10.6, capnp 0.25.4, capnpc 0.25.3, linked_list_allocator 0.10.6; no Argon2 because it uses baseline capos-config |
capos-wasm/Cargo.lock | capnp 0.25.4, capnpc 0.25.3, linked_list_allocator 0.10.6, wasmi 1.0.9 (vendored static-pinned at vendor/wasmi-no_std/wasmi-1.0.9/); no Argon2. |
fuzz/Cargo.lock | capnp 0.25.4, capnpc 0.25.3, libfuzzer-sys 0.4.12 |
tools/remote-session-client/src-tauri/Cargo.lock (not yet under policy gates) | Tauri scaffold lockfile carrying ~435 transitive packages pinned through tauri = "=2.11.1"; only reachable through make remote-session-tauri policy / check / dev modes. Not covered by make dependency-policy-check today; promotion is gated on the Tauri authority decision. |
vendor/wasmi-no_std/wasmi-1.0.9/Cargo.lock (vendored snapshot lockfile) | Upstream wasmi workspace lockfile preserved with the static-pinned snapshot; capos-wasm consumes wasmi only through its own =1.0.9 path dependency, which lands in capos-wasm/Cargo.lock and is covered there. The vendored lockfile is not separately gated; see vendor/wasmi-no_std/VENDORED_FROM.md for the refresh procedure and policy re-check. |
Cargo lockfiles pin exact crate versions and crates.io checksums, so ordinary crate upgrades are review-visible through lockfile diffs. They do not, by themselves, define whether a dependency is acceptable for kernel/no_std use, whether multiple lockfiles must converge, or whether advisories/licenses block the build.
Security Verification Track S.10.3 policy gate:
-
deny.tomldefines the shared license, advisory, ban, and source baseline. -
The allowed license set is intentionally limited to permissive licenses used by current locked dependencies.
BSD-3-Clauseis accepted for the Argon2 credential-validation dependency closure (subtlethroughpassword-hash,digest, andblake2); it is OSI-approved, FSF-free, and carries only the standard non-endorsement clause beyond the already-allowedBSD-2-Clause.0BSDis accepted for the smoltcp networking dependency closure (smoltcpandmanaged); it is OSI-approved and carries no attribution or non-endorsement condition beyond the existing permissive-license baseline. -
make dependency-policy-checkrunscargo deny checkon the root workspace,init,demos,tools/mkmanifest,tools/ringtap-viewer,capos-rt,shell, andfuzz. -
The same target runs
cargo audit --deny warningson every checked-in lockfile, with one explicit audit ignore:RUSTSEC-2026-0173(proc-macro-error2unmaintained warning). The ignored path is pulled into lockfiles throughsmoltcp’s optionaldefmtlogging feature; capOS does not enabledefmtforsmoltcp, butcargo auditscans lockfiles rather than the target feature set. Remove the ignore when upstreamsmoltcp/defmtno longer resolves that crate. -
The same target copies
package.jsonandpackage-lock.jsoninto a private temporary directory and runs a dry-run install there:PUPPETEER_SKIP_DOWNLOAD=1 npm ci --ignore-scripts --dry-runThat preserves the
npm cipackage/lock synchronization check without modifying the worktree install. It also runsnpm audit --package-lock-only --audit-level=high. Lifecycle scripts stay disabled for the docs dependency install path; the browser used by Mermaid PDF rendering is an explicit host executable selected byMERMAID_BROWSER_BIN. -
capos-configkeeps Argon2 behind thecredential-validationfeature. Bootstrap/config validation remains available in the baseline feature set, while validators that need to parse PHC credential strings enable the feature. Runtime clients and inspection tools that only need ring/schema/CapSet data use the baseline feature set. -
Local packages are marked
publish = falseso cargo-deny treats them as private, and local path dependencies includeversion = "0.1.0"so registry wildcard requirements can remain denied. -
CI installs pinned
cargo-deny 0.19.4andcargo-audit 0.22.1and runs the target.
Remaining dependency-policy gap: decide whether standalone lockfiles may
intentionally drift from the root lockfile, especially for capnp and
allocator crates used by userspace.
Cap’n Proto Compiler, Runtime, and Generated Bindings
The trusted Cap’n Proto inputs are:
schema/capos.capnp, the source schema.- Repo-local pinned
capnp, invoked through thecapnpcRust build dependency viaCAPOS_CAPNP. capnpruntime crate withdefault-features = falseandalloc.capnpccodegen crate.- Generated
capos_capnp.rswritten to CargoOUT_DIR. - Local no_std patching applied after generation by
tools/capnp-build.
capos-config/build.rs delegates schema generation to tools/capnp-build.
That shared helper runs capnpc::CompilerCommand over
schema/capos.capnp, reads the generated capos_capnp.rs, asserts that the
expected #![allow(unused_variables)] anchor is present, and injects:
#![allow(unused)]
#![allow(unused_imports)]
fn main() {
use ::alloc::boxed::Box;
use ::alloc::string::ToString;
}
The generated code used by builds is included from OUT_DIR in
capos-config/src/lib.rs:10-12. The expected patched output is checked in as
tools/generated/capos_capnp.rs, so schema, compiler, capnpc crate, and
patch-output changes must update that baseline and become review-visible as a
source diff.
Security Verification Track S.10.2 generated-code drift check:
make generated-code-checkfirst builds the checked-in init ELF required by kernel build-script validation, exports its absolute path asCAPOS_INIT_ELF, and runstools/check-generated-capnp.sh,tools/check-generated-adventure-content.sh, andtools/check-generated-paperclips-content.sh.- The script invokes the actual Cargo build-script path for
capos-configin an isolated target directory, so it checks the generated artifact that crate would include fromOUT_DIR. - During that build,
tools/capnp-buildalso copies the patched binding to a deterministic package-scoped path under the isolated target directory. The checker consumes those explicit paths rather than searching Cargo’s hashed build-script output directories. - The script verifies that the patched file still contains the capnpc anchor
plus the local no_std patch imports, compares the output against
tools/generated/capos_capnp.rs, and fails if a kernel-generated output path appears in the isolated target directory. - Any intentional schema/codegen/patch change must update the checked-in baseline in the same review, making generated output drift review-visible.
make checkrunsfmt-checkplusgenerated-code-checkfor a single local or CI entry point.- Current pinned compiler source is
capnproto-c++-1.2.0.tar.gzfromhttps://capnproto.org/with SHA-256ed00e44ecbbda5186bc78a41ba64a8dc4a861b5f8d4e822959b0144ae6fd42ef. The checked-intools/generated/capos_capnp.rsbaseline must be regenerated with that compiler when schema or codegen behavior intentionally changes. The current pinned baseline SHA-256 is5ab84731324fe9cc984d7aba7dd97963a773800cc52c4c1693fcb6bb448329a6.
Adventure content generation uses:
demos/adventure-content/content/prototype.cueas the checked-in source.tools/adventure-content-gen, a standalone Cargo host tool withtools/adventure-content-gen/Cargo.lock.demos/adventure-content/src/generated.rsas the checked-in generated no_std Rust baseline consumed bydemos/adventure-content/src/lib.rs.tools/check-generated-adventure-content.sh, which derives the same$(CAPOS_TOOLS_ROOT)/cue/0.16.0/bin/cuepath as the Makefile, rejects a mismatchedCAPOS_CUE, checkscue version v0.16.0, exports explicit JSON, runs the generator withcargo run --locked, formats the output withrustfmt --edition 2024, and fails if the result differs fromdemos/adventure-content/src/generated.rs.
Any intentional content-source or generator change must update the checked-in
generated Rust baseline in the same review. The generator manifest and
lockfile are included in make dependency-policy-check.
The no_std patch source is single-owned by tools/capnp-build;
capos-config/build.rs emits its crate-specific rerun directives and calls the
helper.
Alloy Analyzer (DMA Assurance Model)
The DMA assurance Alloy model (models/dma/dma_authority.als) is checked by a
pinned Alloy Analyzer, the same trust level as the Limine, capnp, CUE, Typst,
and uv pins.
- Pinned artifact:
alloy-6.2.0-linux-amd64.tar.gzfrom the officialAlloyTools/org.alloytools.alloyGitHub releasev6.2.0(https://github.com/AlloyTools/org.alloytools.alloy/releases/download/v6.2.0/alloy-6.2.0-linux-amd64.tar.gz), SHA-2565a5494a4bac6e243e471590bb44a91e25a35794a5af1ae1f332be30b9c54a9e7. This is the self-contained linux/amd64 app image: it bundles a Temurin JRE underlib/runtime/and the native SAT solver libraries, so the gate needs no host JVM and pins the analyzer and its runtime by one hash. Theorg.alloytools.alloy.dist.jar(bare jar, host-JVM dependent) is deliberately not used. - Where the pin lives:
MakefileALLOY_VERSION/ALLOY_PLATFORM/ALLOY_TARBALL_URL/ALLOY_TARBALL_SHA256.make alloy-ensuredownloads the tarball (curl with retry), verifies the SHA-256, extracts the app image into$(CAPOS_TOOLS_ROOT)/alloy/6.2.0/(shared per-user cache, default$HOME/.capos-tools), and confirms the launcher reports version6.2.0. The jar is not vendored into the repository. - Drift review:
make model-dma-alloyre-verifies the tarball SHA-256 and the reported launcher version on every run before invoking the model. A bump is aMakefileALLOY_VERSION+ALLOY_TARBALL_SHA256diff plus a refreshed checked-result record inmodels/dma/README.md. - Why output is parsed, not exit-code-gated: the Alloy CLI
execsubcommand always exits 0; acheckthat finds a counterexample, arunthat finds no instance, and a syntax/resolution error all return success with the failure visible only in the printed verdict table.tools/run-dma-alloy-model.shparses that table and fails closed on anycheckthat is notUNSAT, anyrunthat is notSAT, or any analyzer error marker. - Platform/CI scope: the pinned app image is linux/amd64 (the dev/CI host
architecture). GitHub CI runs
make model-dma-alloyin thedma-assurance-modelsjob onubuntu-24.04. Other architectures would need the matching Alloy app image (or the bare jar plus a host JVM). Ownership of the Alloy pin is shared with the scheduler lease model track (scheduler-cpu-isolation-lease-authority-model).
TLC Model Checker (DMA Assurance Lifecycle Model)
The DMA assurance TLA+ lifecycle model (models/dma/dma_authority.tla) is checked
by a pinned TLC, the same trust level as the Limine, capnp, CUE, Typst, uv, and
Alloy pins. Unlike the self-contained Alloy app image, tla2tools.jar is a bare
Java jar, so a JVM is pinned alongside it.
- Pinned artifacts:
tla2tools.jarfrom the officialtlaplus/tlaplusGitHub releasev1.7.4(TLC 2.19),https://github.com/tlaplus/tlaplus/releases/download/v1.7.4/tla2tools.jar, SHA-256936a262061c914694dfd669a543be24573c45d5aa0ff20a8b96b23d01e050e88; and a Temurin JRE17.0.19+10linux/x64 tarball (OpenJDK17U-jre_x64_linux_hotspot_17.0.19_10.tar.gzfrom theadoptium/temurin17-binariesrelease), SHA-256adb5a2364baa51de1ef91bb9911f5a61d24b045fe1d6647cb8050272a3a8ee75. Pinning the JRE as well as the jar fixes both the checker and its runtime by hash. - Where the pin lives:
MakefileTLA_TOOLS_VERSION/TLA_TOOLS_JAR_URL/TLA_TOOLS_JAR_SHA256/TLA_JRE_URL/TLA_JRE_SHA256.make tla-ensuredownloads both (curl with retry), verifies their SHA-256, extracts the JRE into$(CAPOS_TOOLS_ROOT)/tla/jre/and places the jar at$(CAPOS_TOOLS_ROOT)/tla/1.7.4/tla2tools.jar(shared per-user cache, default$HOME/.capos-tools), and confirms the launcher reports17.0.19. Neither is vendored into the repository. - Drift review:
make model-dma-tlare-verifies the jar SHA-256 and the JRE launcher version on every run before invoking the model. A bump is aMakefilepin diff plus a refreshed checked-result record inmodels/dma/README.md. - Why output is parsed and exit-code-gated: TLC returns a non-zero exit
code (12) on an invariant violation, a deadlock, or a parse/semantic error, but
tools/run-dma-tla-model.shadditionally asserts theModel checking completed. No error has been found.marker and rejects any violation/error marker, so a future TLC behaviour change cannot turn a violation into a green gate. The model is checked with deadlock detection enabled; the spec provides an explicit terminating self-loop for the all-pages-parked state, so any other stuck state is a genuine modelling gap. - Platform/CI scope: the pinned JRE tarball is linux/x64 (the dev/CI host
architecture). GitHub CI runs
make model-dma-tlain thedma-assurance-modelsjob onubuntu-24.04; other architectures would need the matching Temurin JRE. Ownership of the TLC pin is shared by the scheduler/IRQ TLA+ model tracks (scheduler-nohz-activation-model,irq-msix-waiter-determinism-model).
Cargo Build Scripts
Build scripts currently do these trusted operations:
| Script | Behavior |
|---|---|
kernel/build.rs | Watches kernel/linker-x86_64.ld and itself. |
capos-config/build.rs | Calls tools/capnp-build to watch schema/capos.capnp, generate bindings, and apply the shared no_std patch. Checked by make generated-code-check. |
tools/capnp-build/src/lib.rs | Host build-support helper for pinned capnp path validation, schema generation, and no_std generated-binding patching. Unit tests cover patch injection and missing-anchor rejection. |
tools/adventure-content-gen/src/main.rs | Host generator for the prototype adventure CUE source. Checked by make generated-code-check through tools/check-generated-adventure-content.sh, which uses pinned CUE and locked Cargo dependencies. |
init/build.rs | Emits a linker script argument for init/linker.ld. |
demos/*/build.rs | Emits a linker script argument for demos/linker.ld. |
capos-rt/build.rs | Emits a linker script argument for capos-rt/linker.ld when building current target_os = "none" userspace or custom-target target_os = "capos" probes. |
capos-wasm/build.rs | Emits a linker script argument for capos-wasm/linker.ld (Phase W.2 onward; uses cargo:rustc-link-arg-bins so the script applies only to the wasm-host bin and not the lib). |
The linker build scripts derive CARGO_MANIFEST_DIR from Cargo and only emit
link arguments plus rerun directives. The capnp build scripts read and rewrite
generated code under OUT_DIR. None of these scripts fetch network resources.
Security Verification Track S.10.2 coverage: make generated-code-check
exercises the canonical capos-config capnp build script through Cargo,
validates the patched generated file, fails if kernel-generated output
reappears, and fails if the canonical output no longer matches the checked-in
generated baseline.
Manifest, Embedded Binaries, and Downloaded Artifacts
system.cue declares named binaries and services. Makefile builds
manifest.bin by running tools/mkmanifest on the host. mkmanifest runs:
- Resolve the pinned CUE compiler from
$(CAPOS_TOOLS_ROOT), reject missing or mismatchedCAPOS_CUE, checkcue version v0.16.0, then runcue export system.cue --out jsonor package-mode equivalent. - JSON-to-
CueValueconversion and manifest validation (tools/mkmanifest/src/lib.rs). - Binary embedding from relative paths (
tools/mkmanifest/src/lib.rs). - Binary-reference validation and Cap’n Proto serialization
(
tools/mkmanifest/src/main.rs).
The adjacent mkmanifest cue-to-capnp subcommand uses the same pinned CUE
export path but does not parse the result as SystemManifest. Instead, it
resolves and validates CAPOS_CAPNP, checks Cap'n Proto version 1.2.0, and
passes the exported JSON to Cap’n Proto:
capnp convert json:binary <schema.capnp> <RootType>
It is the supported schema-aware path for CUE-authored data messages rooted at arbitrary specified Cap’n Proto structs; live capabilities and interface objects are outside that data-file contract.
Path handling rejects absolute paths, parent traversal, non-normal components,
and canonicalized paths that escape the manifest directory
(tools/mkmanifest/src/lib.rs). The generated manifest.bin is copied
into the ISO as /boot/manifest.bin and loaded by Limine via
limine.conf:5.
Downloaded or generated artifacts in the current build:
| Artifact | Producer | Pinning/drift status |
|---|---|---|
$(LIMINE_DIR) checkout ($(CAPOS_TOOLS_ROOT)/limine/<commit>) | git clone/git fetch in the limine-ensure recipe | Commit-pinned and artifact-verified. |
| Cargo registry crates | cargo build, cargo run, tests, fuzz | Lockfile-pinned checksums plus CI-enforced deny/audit checks through make dependency-policy-check. |
| Node registry packages | npm ci --ignore-scripts for docs Mermaid rendering | package-lock.json pins package tarball integrity. Lifecycle scripts are disabled, Puppeteer’s browser download path is skipped, and make dependency-policy-check enforces the npm ci package/lock synchronization invariant plus high-severity npm audit state. |
| Chromium/Chrome for Mermaid PDF rendering | Host executable selected by MERMAID_BROWSER_BIN or auto-detected from chromium-browser, chromium, google-chrome-stable, or google-chrome | Host-provided browser, not repo-pinned. The docs PDF target fails closed if no executable is available and passes the selected path to Puppeteer as PUPPETEER_EXECUTABLE_PATH, rather than allowing Puppeteer’s npm install script to download an implicit browser artifact. |
Rust toolchain, targets, and rust-src | rustup from rust-toolchain.toml when absent | Date-pinned nightly-2026-04-20 channel; rust-src is declared for custom-target -Zbuild-std userspace builds. The advance procedure for bumping the pin lives in the Rust Toolchain section above. |
target/ kernel and host artifacts | Cargo | Generated, not checked in. |
init/target/, demos/target/, capos-rt/target/, capos-wasm/target/ ELFs | Cargo standalone builds | Generated, embedded into manifest.bin where referenced; make build-provenance records hashes for embedded file-backed and inline payloads. |
target/x86_64-unknown-capos/, init/target/x86_64-unknown-capos/, demos/target/x86_64-unknown-capos/, shell/target/x86_64-unknown-capos/, capos-rt/target/x86_64-unknown-capos/, libcapos/target/x86_64-unknown-capos/, libcapos-posix/target/x86_64-unknown-capos/, and capos-wasm/target/x86_64-unknown-capos/ userspace artifacts | Cargo aliases using targets/x86_64-unknown-capos.json | Generated artifacts for booted userspace manifests, the capos-rt smoke binary, the wasm-host Phase W.2 binary, and the libcapos / libcapos-posix C-substrate staticlibs. |
manifest.bin | tools/mkmanifest | Generated from system.cue plus ELF payloads; not checked in. Hash is recorded by make build-provenance. |
iso_root/ and capos.iso | Makefile, xorriso, Limine installer | Generated and gitignored; Limine inputs verified. Final ISO hash is recorded by make build-provenance. |
target/build-provenance.txt | tools/build-provenance.sh via make build-provenance | Generated and gitignored; records runner OS/kernel/architecture identity, GitHub Actions image identity when present, Rust toolchain details, selected executable paths, package identities when discoverable, OVMF selected path/package/absence state, tool versions, git commit, manifest/ISO/kernel/OVMF hashes, and embedded payload origin plus hashes. CI publishes the artifact as build-provenance-<sha> on every qemu-smoke run (30-day retention). On pull_request events the qemu-smoke job locates the most recent successful main-branch build-provenance-<sha> artifact, downloads it via actions/download-artifact, and runs make build-provenance-compare BUILD_PROVENANCE_COMPARE_POLICY=ci-environment against the candidate record as a blocking PR gate. |
Remaining gaps for Security Verification Track S.10.2/S.10.3:
- CI now publishes
target/build-provenance.txtas a named artifact on everyqemu-smokerun (30-day retention) and, onpull_requestevents, downloads the most recent successful main-branchbuild-provenance-<sha>artifact and runsmake build-provenance-compareagainst the candidate record withBUILD_PROVENANCE_COMPARE_POLICY=ci-environment. The compare step is PR-blocking for runner/tool/Rust/OVMF environment drift and fails when the base artifact cannot be found. qemu-smokeapt-pinsqemu-system-x86,xorriso,make,git, andovmf, andmake build-provenancerecords normalized package identity where the runner exposes it. Repo-pinned digests (download-and-verify rather than apt-installed packages) forqemu-system-x86,xorriso,make, andgit, or an immutable runner image digest containing that package set, remain future production-reproducibility hardening tracked indocs/design-risks-register.md(R13).- Decide whether CI should record the pinned
cue exportJSON or finalmanifest.binbytes if manifest reproducibility becomes release-critical.
Vendored Upstream Snapshots
The repository carries static, pinned snapshots of selected upstream sources
under vendor/. Each snapshot has its own VENDORED_FROM.md recording the
upstream URL, tag/version, commit SHA, commit date, vendoring date, license,
vendoring posture, and refresh procedure. Snapshots are kept byte-identical to
their pinned upstream artifact (git commit, or the crates.io published crate as
noted below); the only non-upstream changes permitted without a patches/
unified diff are the documented integration-only empty-[workspace] marker and
build-inert files restored from the same upstream commit when the publish
include omitted them (for rustls-webpki, src/test_utils.rs and
rustfmt.toml, both recorded in its VENDORED_FROM.md). Any future functional
patch must be recorded as a unified diff under the snapshot’s patches/
directory plus a Patches entry per the procedure in the snapshot’s
VENDORED_FROM.md.
| Snapshot | Upstream | Tag/Version | Commit SHA | License | Consumer |
|---|---|---|---|---|---|
vendor/wasmi-no_std/wasmi-1.0.9/ | https://github.com/wasmi-labs/wasmi | v1.0.9 | 61ba65e6563d8b2f5b699b018349d3330b28b9f3 | Apache-2.0 OR MIT (dual) | capos-wasm/ (WASI host adapter wasm-host bin and Preview 1 import surface) |
vendor/dns-c-wahern/src/ | https://github.com/wahern/dns | rel-20160808 | 4ec718a77633c5a02fb77883387d1e7604750251 | MIT | POSIX adapter Phase P1.2 Phase B DNS smoke; not yet on the v0 build path (the smoke compiles only demos/posix-dns-resolver/main.c with a commented-out dns.h include) |
vendor/rustls-webpki/rustls-webpki-0.103.13/ | https://github.com/rustls/webpki | 0.103.13 (crates.io crate) | 2879b2ce7a476181ac3050f73fe0835f04728e86 | ISC | capos-tls/ Phase-1 verifier (WebPKI X.509 path building + signature verification, no_std + alloc, no crypto provider in the default build) |
vendor/webpki-roots/webpki-roots-1.0.7/ | https://github.com/rustls/webpki-roots | 1.0.7 (crates.io crate) | be948464fd5907af6227213a066743a161221688 | CDLA-Permissive-2.0 | capos-tls/ Phase-1 trust-anchor bootstrap (compiled-in Mozilla NSS root bundle, no_std) |
vendor/embedded-tls/embedded-tls-0.19.0/ | https://github.com/drogue-iot/embedded-tls | 0.19.0 (crates.io crate) | 865e1fd983c583228e3bbeb9f4996f1abc454ca3 | Apache-2.0 | demos/cloud-tls-client-handshake-smoke/ TLS 1.3 client state machine (no_std + alloc, default std/tokio features disabled, rustpki enabled) |
The rustls-webpki and webpki-roots snapshots use a published-crate posture
distinct from the git-clone snapshots above: each is the crates.io published
.crate artifact, SHA-256-verified against the crates.io index
(61c429a8…f756e and 52f5ee44…2eb9d respectively), with the upstream commit
recorded from the artifact’s embedded .cargo_vcs_info.json. For
rustls-webpki, two build-inert files the publish include omitted
(src/test_utils.rs, a #[cfg(test)] module, and rustfmt.toml) are restored
from the same upstream commit so cargo fmt resolves the module tree and
formats the snapshot under upstream’s own style config; see its
VENDORED_FROM.md. capos-tls/
(a root-workspace member and the Phase-1 host verifier crate) depends on both as
exact-pinned path dependencies (version = "=0.103.13" / version = "=1.0.7")
and forces them to link for x86_64-unknown-none under cargo build and
cargo build --features qemu. rustls-webpki is selected with
default-features = false, features = ["alloc"], so neither std nor a
ring / aws-lc-rs crypto provider is compiled; the active compiled closure is
rustls-pki-types (alloc, pulling zeroize) and untrusted (ISC), all
pinned in the root Cargo.lock and covered by make dependency-policy-check.
The ring optional dependency appears in Cargo.lock as an unselected optional
entry; aws-lc-rs is a feature-gated optional / dev-only dependency and does
not resolve into the root lockfile at all. Neither is ever feature-activated, so
no crypto provider is compiled and cargo deny does not evaluate ring
(cargo tree -p capos-tls -e features activates only rustls-pki-types,
zeroize, untrusted, webpki-roots, and rustls-webpki[alloc]). The
webpki-ring feature is host-test-only and supplies the signature algorithms
for cargo test-tls; the default bare-metal build remains provider-free.
The embedded-tls snapshot uses the same published-crate posture: the vendored
tree is the crates.io 0.19.0 package with .cargo_vcs_info.json recording
upstream commit 865e1fd983c583228e3bbeb9f4996f1abc454ca3. The local
handshake smoke depends on it with an exact path pin, disables default std and
tokio, and enables only rustpki so the TLS 1.3 client path can run under
target_os = "capos" over a TcpSocket cap. The empty [workspace] marker is
the only local integration change.
capos-wasm/Cargo.toml pins the wasmi path dependency to version = "=1.0.9" so cargo-deny’s wildcards gate continues to pass; the snapshot is
exercised by make capos-wasm-build, every make run-wasi-* smoke, and
make dependency-policy-check (cargo-deny + cargo-audit on
capos-wasm/Cargo.lock). Refreshing wasmi to a newer tag requires the rsync
pattern, manifest pin bump, lockfile regeneration, and policy re-check
recorded in vendor/wasmi-no_std/VENDORED_FROM.md.
The dns.c snapshot is intentionally a strict subset (only src/dns.c,
src/dns.h, LICENSE, and README.md); ancillary upstream files
(cache, mem, spf, zone, regress) are excluded because the v0
build path does not need them. Future POSIX-adapter phases that widen
libcapos-posix enough to compile dns.c whole will start consuming the
snapshot in the build instead of carrying it as a documentation-only
reference.
vendor/dash/ is not present at this revision. If a future POSIX-adapter
phase imports dash, add a new row above plus a vendor/dash/VENDORED_FROM.md
recording the same provenance fields.
Host Tools
Current local host versions observed during this inventory:
| Tool | Observed version | Build role |
|---|---|---|
capnp | 1.2.0 | Repo-selected schema compiler built by make capnp-ensure from a SHA-256-pinned official source tarball into $(CAPOS_TOOLS_ROOT). |
cue | v0.16.0 | Repo-selected manifest compiler installed by make cue-ensure into $(CAPOS_TOOLS_ROOT) from the SHA-256-verified official release binary. |
qemu-system-x86_64 | 10.2.2 | Boot verification via make run and make run-uefi. |
xorriso | 1.5.8 | ISO generation. |
make | 4.4.1 | Build orchestration. |
git | 2.53.0 | Limine checkout/fetch and review workflow. |
These are local environment observations, not repository pins. On the
qemu-smoke CI runner, qemu-system-x86, xorriso, make, and git are
apt-pinned to
qemu-system-x86=1:8.2.2+ds-0ubuntu1.16 (amd64,
noble-updates/main or noble-security/main, Ubuntu 24.04),
xorriso=1:1.5.6-1.1ubuntu3 (amd64, noble/main, Ubuntu 24.04),
make=4.3-4.1build2 (amd64, noble/main, Ubuntu 24.04), and
git=1:2.43.0-1ubuntu7.3 (amd64, noble-updates/main or
noble-security/main, Ubuntu 24.04) by the “Install boot smoke dependencies”
step; the per-run identity for each is captured via dpkg-query and normalized
apt source pockets by tools/build-provenance.sh. The bump procedure mirrors
OVMF: run apt-cache madison <tool> on a current Ubuntu 24.04 host, pick the
highest stable version from noble-updates/main (or noble/main when no
noble-updates entry exists, as is the case for xorriso and make today),
and update the pinned version string in .github/workflows/ci.yml plus this
row.
make run-uefi selects an OVMF firmware blob from OVMF_CODE_CANDIDATES in
Makefile:96-97; the Makefile pins the expected blob via OVMF_CODE_SHA256
and the ovmf-verify target enforces the match before ISO and cloud-disk
construction. On the qemu-smoke CI runner the ovmf=2024.02-2ubuntu0.8
apt-install resolves the first candidate (/usr/share/ovmf/OVMF.fd),
ovmf-verify succeeds against the pinned digest, and make build-provenance
records the resulting firmware-blob SHA-256 per run. Build hosts without any
OVMF candidate installed see an ovmf-verify NOTICE skip rather than a
failure, so research workflows that never invoke make run-uefi continue to
build the ISO unchanged.
Remaining gap for Security Verification Track S.10.3: decide whether full
production reproducibility uses an immutable runner image digest, repo-managed
download-and-verify tool digests for the apt-pinned build/boot tools, or both.
build-essential, curl, sha256sum, the shell, and the checkout-time git
used by actions/checkout remain runner-provided; the PR-blocking provenance
gate records and compares the post-checkout build environment, but it does not
turn the mutable ubuntu-24.04 runner label into an immutable production image.
GitHub Actions Runner and Workflow Pinning
The CI harness in .github/workflows/ci.yml is itself a supply-chain input:
its identities determine which third-party code runs against every push and
pull request, and the chosen runner image determines the host package set
underneath every host-baseline, Kani, and optional QEMU job. Mutable @v<N>
or @master references on third-party Actions would allow upstream owners to
swap out the executed code at any time without a repository diff, and
ubuntu-latest would silently roll the runner OS when GitHub re-points it.
The current policy is to pin every third-party Action to a 40-character
commit SHA and to pin the runner OS to a specific release rather than the
floating label. Each pinned uses: line carries a trailing # v<X.Y.Z>
comment so reviewers and bump PRs can read the intended release without
following the SHA through the GitHub UI.
| Identity | Pinned reference | Notes |
|---|---|---|
runs-on: runner image | ubuntu-24.04 | Replaces ubuntu-latest; applied to host-baseline, kani-proofs, dma-assurance-models, and qemu-smoke. GitHub-hosted ImageOS and ImageVersion are recorded in target/build-provenance.txt when present and are compared by the PR-blocking CI environment policy. Bump only when the next LTS is needed and the full make check plus QEMU smokes are reverified against the new image. |
actions/checkout | 34e114876b0b11c390a56381ad16ebd13914f8d5 # v4.3.1 | Resolved from the actions/checkout v4 major-version tag. |
swatinem/rust-cache | c19371144df3bb44fab255c43d04cbc2ab54d1c4 # v2.9.1 | Canonical Swatinem/rust-cache v2.9.1 release commit. The v2 major-tracking tag carries the same 2.9.1 message but points at a distinct republication commit; always dereference the exact release tag rather than the major tag. |
dtolnay/rust-toolchain | 3c5f7ea28cd621ae0bf5283f0e981fb97b8a7af9 # master @ 2026-03-27 | The upstream action does not publish numbered releases; its documented usage is @master. The pin is a snapshot of master at the dated commit. |
actions/upload-artifact | ea165f8d65b6e75b540449e92b4886f43607fa02 # v4.6.2 | Resolved from the actions/upload-artifact v4.6.2 lightweight tag (same SHA as the moving v4 major tag at resolution time). Used in qemu-smoke to publish target/build-provenance.txt. |
actions/download-artifact | d3f86a106a0bac45b974a628896c90dbdf5c8093 # v4.3.0 | Resolved from the actions/download-artifact v4.3.0 release tag. Paired with actions/[email protected] (both v4 series) and used in qemu-smoke on pull_request events to fetch the most recent successful main-branch build-provenance-<sha> artifact for the blocking make build-provenance-compare BUILD_PROVENANCE_COMPARE_POLICY=ci-environment step. |
Bump procedure for any of the entries above:
- Resolve the candidate release to its commit SHA via the upstream
release tag, e.g.
gh api repos/<owner>/<repo>/git/ref/tags/v<X.Y.Z>(dereference any annotated tag throughgh api repos/<owner>/<repo>/git/tags/<sha>), orgh api repos/<owner>/<repo>/commits/<branch>for branch-tracked actions likedtolnay/rust-toolchain. Always dereference the exact release tag (vX.Y.Z) rather than the moving major-version tag (vX): major-version tags can be re-cut at a republication commit whose tag message still names the same release (as observed forswatinem/rust-cache@v2), so followingvXcan pin a different commit than the canonicalvX.Y.Zrelease. - Update both the SHA and the trailing
# v<X.Y.Z>comment in.github/workflows/ci.ymlso the reviewer sees the intended release. - Run
make fmt-checkandmake workflow-checklocally for the bump branch. Workflow hygiene plus YAML well-formedness must pass before review. The acceptance gate for the bump itself is a green CI run on the bump branch –make checkplus the existing QEMU smokes – which exercises the new Action versions end-to-end. - Treat any
master-branch SHA pin (currentlydtolnay/rust-toolchain) as a manual-bump dependency: the upstream action does not publish release tags, so bumping its SHA is the only way to absorb upstream fixes. Schedule those bumps explicitly rather than relying on a floating reference.
This pinning closes the mutable-tag supply-chain gap for the CI harness
itself. It does not by itself satisfy the “pinned runner image digest” line
of the Build Provenance Retention And Comparison Policy: ubuntu-24.04 is
still a label managed by GitHub Actions, not an immutable image digest. The
current documented equivalent for PR gating is to retain the GitHub-hosted
ImageOS/ImageVersion fields in target/build-provenance.txt and compare
them against the latest successful main-branch record. A future
production-hardening slice may move to a self-built runner image referenced by
digest, mirror the build-tool packages, or both.
Inventory Method
This inventory is based on source inspection, Cargo metadata, lockfile checks, and local host-tool version queries. Local host-tool versions are observations, not repository pins; the tables above distinguish enforced pins from observed environment state.
Useful commands for refreshing the inventory:
git status --short --branchrg -n "S\\.10|trusted|supply|Limine|limine|capnp|capnpc|QEMU|qemu|download|curl|git clone|wget|build\\.rs|rust-toolchain|Cargo\\.lock" ...rg --filescargo metadata --locked --format-version 1 --no-depsrg -n '^name = |^version = |^checksum = ' Cargo.lock init/Cargo.lock demos/Cargo.lock tools/mkmanifest/Cargo.lock tools/ringtap-viewer/Cargo.lock capos-rt/Cargo.lock shell/Cargo.lock libcapos/Cargo.lock libcapos-posix/Cargo.lock capos-wasm/Cargo.lock fuzz/Cargo.lockcommand -v rustc cargo capnp cue qemu-system-x86_64 xorriso sha256sum git makerustc -Vv,cargo -V,capnp --version,cue version,qemu-system-x86_64 --version,xorriso -version,make --version,git --version
Panic-Surface Inventory
Scope: panic!, assert!, debug_assert!, .unwrap(), .expect(),
todo!, and unreachable! surfaces relevant to boot manifest loading, ELF
loading, SQE handling, params/result buffers, IPC, and future spawn inputs.
Classification terms:
trusted-internal: depends on kernel/shared-code invariants, static ABI layout, or host build/test code; not directly controlled by a service.boot-fatal: reached during boot/package setup before mutually untrusted services run. Bad platform/package state can halt the system.untrusted-input reachable: reachable from userspace-controlled SQEs, Cap’n Proto params/result buffers, IPC state, manifest/package data, or future spawn-controlled service/binary data.
Summary
No current panic!/assert!/unwrap()/expect() site found in the
kernel ring dispatch path directly consumes raw SQE fields or user
params/result-buffer pointers. Those paths mostly return CQE errors through
kernel/src/cap/ring.rs.
The remaining relevant surfaces are boot-fatal setup assumptions, scheduler internal invariants that would become more exposed once untrusted spawn/lifecycle inputs can create or destroy processes dynamically, and IPC rollback queue capacity assumptions.
Locations use path::function anchors rather than line numbers; line numbers
drift on every refactor. Grep the path plus the quoted surface text to
re-locate a site.
Manifest And Future Spawn Inputs
| Location | Surface | Reachability | Classification | Notes |
|---|---|---|---|---|
kernel/src/main.rs run_init | MODULES.response().expect("no modules from bootloader") | Boot package/module table | boot-fatal | Missing Limine modules abort before manifest validation. |
kernel/src/main.rs run_init | elf_cache.get(service.binary.as_str()).ok_or_else(...) | Manifest service binary reference | untrusted-input reachable, controlled error | Not a panic surface. Included because it is the future spawn shape to preserve: unknown or unparsed binaries return an error. |
kernel/src/spawn.rs spawn_service | Process::new(...).map_err(...) | Manifest-spawned process creation | untrusted-input reachable, controlled error | Current boot path converts allocation/mapping failures into boot errors. Future ProcessSpawner should keep this shape instead of adding unwraps. |
ELF Inputs
| Location | Surface | Reachability | Classification | Notes |
|---|---|---|---|---|
kernel/src/spawn.rs load_elf | debug_assert!(stack_top % 16 == 0, ...) | ELF load path | trusted-internal | Constant stack layout invariant, not ELF-controlled. |
kernel/src/spawn.rs align_up | debug_assert!(align.is_power_of_two()) | TLS mapping from parsed ELF | trusted-internal | elf::parse rejects non-power-of-two TLS alignment; load_tls also caps the size before calling align_up. |
capos-lib/src/elf.rs parser | no runtime panic surfaces outside tests/Kani | Boot manifest ELF bytes; future spawn ELF bytes | untrusted-input reachable, controlled error | Parser uses checked offsets/ranges and returns Err(&'static str). Test-only assertions/unwraps are excluded from runtime classification. |
kernel/src/spawn.rs load_elf | slice init_data[src_offset..] | Parsed ELF PT_LOAD file range | untrusted-input reachable, guarded | Not matched by the panic-token grep, but it is an index panic candidate if parser invariants are bypassed. elf::parse checks segment file ranges before load_elf. |
kernel/src/spawn.rs load_tls | slice &init_data[init_start..init_end] | Parsed ELF TLS file range | untrusted-input reachable, guarded | Not matched by the panic-token grep, but it is an index panic candidate if parser invariants are bypassed. elf::parse checks TLS file bounds before load_tls. |
SQE And Params/Result Buffers
| Location | Surface | Reachability | Classification | Notes |
|---|---|---|---|---|
kernel/src/cap/ring.rs process_ring / dispatch_call / dispatch_recv / dispatch_return | no matched panic-like surfaces | Userspace SQEs, params, result buffers | untrusted-input reachable, controlled error | SQ corruption, unsupported fields/opcodes, oversized buffers, invalid user buffers, and CQ pressure return transport errors or defer consumption. |
capos-config/src/ring.rs const _: () = assert!(...) ABI size checks | const assert! layout checks | Shared ring ABI | trusted-internal | Compile-time ABI guard; not runtime input reachable. |
capos-config/src/capset.rs const _: () = assert!(...) ABI size checks | const assert! layout checks | Shared CapSet ABI | trusted-internal | Compile-time ABI/page-fit guard; not runtime input reachable. |
capos-lib/src/frame_bitmap.rs (alloc_frame and alloc_contiguous) | .try_into().unwrap() on 8-byte bitmap windows | Frame allocation, including work triggered by manifest/process creation and capability methods | trusted-internal | Guarded by frame + 64 <= total or i + 64 <= to, assuming the caller-provided bitmap covers total_frames. Kernel constructs that bitmap at boot. |
IPC
| Location | Surface | Reachability | Classification | Notes |
|---|---|---|---|---|
kernel/src/cap/endpoint.rs Endpoint::endpoint_call | pending receive pop on CALL delivery | Cross-process CALL delivered to pending RECV | untrusted-input reachable, controlled error | The former guarded pending_recvs.pop_front().unwrap() now returns a failed capnp error if the queue is inconsistent. Endpoint pending-RECV exhaustion has QEMU coverage in endpoint-roundtrip. |
kernel/src/cap/endpoint.rs endpoint_restore_recv_front | rollback push_front growth | IPC rollback path | untrusted-input reachable, controlled error | CALL delivery reserves the popped pending-RECV slot until rollback restores the RECV or receiver completion releases the reservation, so concurrent receives cannot consume rollback capacity. Recovery helpers resolve the original endpoint object through revoked cap epochs and wrapper recovery methods bypass liveness checks, without reopening ordinary CALL/RECV/RETURN authority. If restore still fails after reaching the endpoint, the ring path posts or defers an explicit receiver cancellation instead of silently dropping the popped RECV. endpoint-roundtrip includes QEMU coverage for same-process CQ-pressure rollback with both available and saturated pending-RECV capacity, then consuming the restored undersized RECV through the controlled receiver-error path; capos-lib host coverage checks revoked-cap recovery lookup. |
Scheduler And Process Lifecycle
| Location | Surface | Reachability | Classification | Notes |
|---|---|---|---|---|
kernel/src/sched.rs register_idle_process_locked | Process::new_idle().expect("failed to create idle process") | Boot scheduler init (sched_init, slot 0) and lazy per-CPU registration (current_cpu_idle_thread_locked) | boot-fatal at slot 0; per-CPU-fatal on first AP idle | Synthetic idle Process creation OOM panics. There is no fallback idle path after the user-mode idle process removal, so this panic is the deliberate unrecoverable-OOM behavior. |
kernel/src/sched.rs sched_init | CPL0 idle kernel stack .expect, idle-context registry try_reserve_exact().expect, per-CPU CpuContext Box::try_new panic! | Boot scheduler init | boot-fatal | CPL0 idle-context infrastructure OOM panics before services run. Same rationale as the synthetic idle records: no fallback idle path exists, so the failure is deliberately unrecoverable. |
kernel/src/sched.rs block_current_on_cap_enter | current.expect, idle assert!, process-table expect | cap_enter(min_complete > 0) path | untrusted-input reachable, internal invariant | Userspace can request blocking, but these unwraps assert scheduler state, not user values. Future process lifecycle/spawn changes increase this exposure. |
kernel/src/sched.rs capos_block_current_syscall | current.expect, idle assert!, table expect, panic! if not blocked | Blocking syscall continuation | untrusted-input reachable, internal invariant | Triggered after cap_enter chooses to block. User controls the request, but panic requires kernel state inconsistency. |
kernel/src/sched.rs run_queue references missing process expect (context-switch + start paths) | run-queue/process-table consistency | Scheduling after queue selection | trusted-internal now; future spawn/lifecycle sensitive | A stale run-queue PID panics. Dynamic spawn/exit must preserve run-queue/process-table invariants. |
kernel/src/sched.rs exit_current | current.expect, idle assert!, processes.remove(...).unwrap(), next-process unwrap() | Ambient exit syscall and future process exit | untrusted-input reachable, internal invariant | Any service can exit itself. Panic requires scheduler corruption or idle misuse, but future spawn/process APIs should harden this boundary. |
kernel/src/sched.rs current_ring_and_caps | current.expect, process-table expect | cap_enter flush path | untrusted-input reachable, internal invariant | User can call cap_enter; panic requires no current process or missing table entry. |
kernel/src/sched.rs start | initial run-queue expect, process-table unwrap, CR3 expect | Boot service start | boot-fatal | Manifest with zero services is rejected earlier, and process creation errors out; panics indicate scheduler/CR3 invariant breakage. |
kernel/src/arch/x86_64/context.rs timer context restore | CR3 expect("invalid CR3 from scheduler") | Timer interrupt scheduling | trusted-internal; future lifecycle sensitive | Scheduler should only return page-aligned CR3s from AddressSpace. |
Boot Platform And Memory Setup
| Location | Surface | Reachability | Classification | Notes |
|---|---|---|---|---|
kernel/src/main.rs kmain | assert!(BASE_REVISION.is_supported()) | Limine boot protocol | boot-fatal | Platform/bootloader contract check. |
kernel/src/main.rs kmain | memory-map and HHDM expect | Limine boot protocol | boot-fatal | Missing bootloader responses halt before untrusted services. |
kernel/src/main.rs kmain | cap::init().expect("failed to initialize kernel capabilities") | Kernel cap table bootstrap | boot-fatal | Fails on kernel-internal cap-table exhaustion. |
kernel/src/mem/frame.rs init | frame-bitmap region expect("no region large enough for frame bitmap") | Boot memory map | boot-fatal | Bad or too-small memory map halts. |
kernel/src/mem/frame.rs free_frame | try_free_frame(...).expect("free_frame failed") | Kernel-owned frame teardown | trusted-internal | Capability handlers use try_free_frame; this panic surface is for kernel-owned frames and rollback/Drop paths. |
kernel/src/mem/frame.rs HHDM cache helper | assert!(offset != 0, "frame allocator not initialized") | HHDM cache use before frame init | trusted-internal | Initialization-order invariant. |
kernel/src/mem/heap.rs init | alloc_contiguous(HEAP_FRAMES).expect("out of memory for heap") | Boot heap init | boot-fatal | Fails if the frame allocator cannot provide the fixed kernel heap. |
kernel/src/mem/paging.rs alloc_page_table_frame / kernel_pml4_frame / assert!(addr != 0, "paging not initialized") | page-alignment .unwrap() / paging initialized assert! | Kernel frame/page-table internals | trusted-internal | frame::alloc_frame returns page-aligned addresses. |
kernel/src/mem/paging.rs init_kernel_page_tables | kernel PML4 expect("failed to allocate kernel PML4"), page-lookup and map expects | Kernel page-table setup | boot-fatal | Assumes kernel image is mapped in bootloader tables and enough frames exist. |
kernel/src/arch/x86_64/syscall.rs init | STAR selector expect("invalid STAR segment configuration") | Syscall init | boot-fatal | GDT selector layout invariant. |
kernel/src/sched.rs context-switch / exit_current / start | CR3 expect("invalid CR3") | Context switch/exit/start | trusted-internal; future lifecycle sensitive | Scheduler should only carry page-aligned address-space roots. |
Audit Method
Candidate sites come from panic-token searches over runtime source plus manual review of nearby indexing and allocation paths on untrusted-input boundaries. The table excludes test-only assertions unless they enforce runtime ABI or layout contracts. Re-run the searches after code changes and classify new sites by reachability, not by token alone.
Search commands:
rg -n "\b(panic!|assert!|assert_eq!|assert_ne!|debug_assert!|debug_assert_eq!|debug_assert_ne!|unwrap\(|expect\(|todo!|unreachable!)" kernel capos-lib capos-config init demos tools schema system.cue Makefile docs -g '*.rs' -g '*.cue' -g '*.md' -g 'Makefile'
rg -n "\b(panic!|assert!|assert_eq!|assert_ne!|debug_assert!|debug_assert_eq!|debug_assert_ne!|unwrap\(|expect\(|todo!|unreachable!)" kernel/src capos-lib/src capos-config/src init/src demos/capos-demo-support/src demos/*/src tools/mkmanifest/src -g '*.rs'
DMA Isolation Design
Security Verification Track S.11 gates PCI, virtio, and later userspace device-driver work on an explicit DMA authority model. The immediate goal is narrow: let the kernel bring up a QEMU virtio-net smoke without creating a user-visible raw physical-memory escape hatch.
Short-Term Decision
Use kernel-owned bounce buffers for the first in-kernel QEMU virtio-net smoke.
The first virtio-net smoke stays on this conservative path:
- kernel-owned DMA pages
- kernel-owned virtqueue descriptor tables
- kernel-owned packet buffers
- kernel-programmed physical addresses
- copied packet bytes delivered to the network stack
- no DMA buffer capability exposed to userspace
- no physical address exposed to userspace
- no virtqueue pointer exposed to userspace
- no BAR mapping exposed to userspace
The kernel allocates DMA-capable pages from its own frame allocator, owns the virtqueue descriptor tables and packet buffers, programs the device with the corresponding physical addresses, and copies packet payloads between those buffers and the networking stack.
This is deliberately conservative:
- It works before ACPI/DMAR or AMD-Vi parsing, IOMMU page-table management, MSI/MSI-X routing, and userspace driver lifecycle supervision exist.
- It keeps all physical-address programming inside the kernel, where the same code that allocates the frames also bounds the descriptors that reference them.
- It does not make the current
FrameAllocatororMemoryObjectcapability part of the DMA path.FrameAllocatorno longer exposes raw physical addresses, but DMA still needs device-owned buffer objects with IOVA and reset/revoke semantics rather than repurposed general memory caps. - It gives the smoke a disposable implementation path. When NIC or block
drivers move to userspace, bounce-buffer authority becomes a typed
DMAPoolobject instead of an ad hoc physical-address grant.
An IOMMU-backed DMA-domain model remains the target for direct device access from mutually untrusted userspace drivers, but it is not a prerequisite for the first QEMU smoke. Without an IOMMU, a malicious bus-mastering device can still DMA to arbitrary RAM at the hardware level; the short-term smoke assumes QEMU-provided virtio hardware and protects against confused or untrusted userspace, not hostile hardware.
IOMMU Staging
IOMMU support is a deferred-with-known-dependency prerequisite for production hardware claims and for moving direct DMA-capable NIC or block drivers into userspace. capOS now discovers bounded ACPI IOMMU table summaries for Intel DMAR and AMD-Vi/IVRS and records static DMAR DRHD include-all or single-hop PCI endpoint device-scope coverage for retained DMA-capable PCI diagnostics functions. Bridge and multi-hop scopes are retained for diagnostics but do not prove endpoint attachment until PCI topology traversal exists, and include-all fallback fails closed when retained DRHD units or scopes are capped.
The selected QEMU Intel remapping path now programs VT-d root/context and
second-level tables for manager-owned DMAPool pages, reports bounded fault
state, exports only domain-scoped IOVAs, and proves two claimed DMA-capable
functions receive distinct per-device domains and second-level roots. It also
asserts the production-path S.11.2 hostile-smoke matrix over the active
DMAPool / DMABuffer ledger. The decomposed integration umbrella for this
path closed 2026-05-23 23:35 UTC
(ddf-iommu-remapping-production-closeout).
This is still QEMU-only evidence for the
selected path, not a general production
hardware-isolation claim: trusted sharing groups, AMD-Vi programming, and
production NIC/block userspace driver authority remain future work, and VM
shapes without usable remapping hardware remain on the explicit bounce-buffer
fallback.
The discovery parser is intentionally shallow and follows the static-table
formats documented by the Intel VT-d architecture specification, the AMD IOMMU
specification, and QEMU’s q35-only -device intel-iommu emulation:
Future real remapping work is grounded by the primary-source IOMMU remapping research note, which records Intel VT-d, AMD-Vi, and QEMU sections relevant to table programming, invalidation, fault/status diagnostics, and QEMU-only smoke tests. That note is source grounding only; it does not make the current diagnostics path a real remapping implementation.
- Intel VT-d architecture specification: https://www.intel.com/content/www/us/en/content-details/671081/intel-virtualization-technology-for-directed-i-o-architecture-specification.html
- AMD IOMMU specification: https://docs.amd.com/v/u/en-US/48882_IOMMU
- QEMU manpage: https://www.qemu.org/docs/master/system/qemu-manpage.html?highlight=numa
The staged implementation order is:
- Discover firmware IOMMU topology from ACPI static tables and fail closed if the tables are malformed, unsupported, or inconsistent with the PCI root complex being used. This first bounded table-discovery step is implemented for DMAR/IVRS summaries only; domain attachment is still planned.
- Record each DMA-capable PCI function’s attachment to an IOMMU unit, or explicitly keep the function on the prototype bounce-buffer-required policy when no trusted IOMMU domain can be created. This reporting step is implemented for retained PCI diagnostics functions when DMAR DRHD include-all or single-hop PCI endpoint device-scope metadata proves PCI segment/BDF coverage. Bridge and multi-hop scopes are not treated as attachment proof until PCI topology traversal exists, and include-all fallback fails closed when retained DMAR coverage metadata is capped; trusted domain creation is still planned.
- Define and prove the claimed-device domain policy: one device-manager-owned DMA domain per claimed device or trusted sharing group, with all exported device addresses represented as IOVAs scoped to that domain rather than host physical addresses. The selected QEMU Intel path now implements the per-device form for two claimed DMA-capable functions; trusted sharing groups remain disabled and out of scope.
- Attach
DMAPoolallocation, descriptor validation, MMIO ownership, interrupt ownership, and revocation state to the same device-manager ledger before any doorbell write can make a descriptor visible to hardware. - On revoke, reset, or driver death, stop new submissions, remove or invalidate IOMMU mappings before page reuse, and flush the relevant IOTLB state where the hardware model requires it.
Until those gates exist, direct DMA and userspace driver handoff remain blocked. Devices that cannot be placed in a trusted IOMMU domain must stay on kernel-owned bounce buffers or remain unsupported for production claims. This also affects the hostile-smoke gate: S.11.2 smokes must prove that stale DMA handles, stale completions, reset races, and teardown ordering fail closed for IOMMU-backed IOVA mappings, while the process-exit / exit-under-DMA rows remain covered by the selected backend evidence before a cloud or hardware driver can be treated as isolated from the rest of memory.
Fallback Policy For No Usable IOMMU Exposure
Some providers or VM shapes may not expose remapping hardware that capOS can trust. That includes absent, malformed, unsupported, capped, or incomplete DMAR/IVRS metadata; scopes that require PCI topology traversal capOS has not implemented yet; and platforms where remapping hardware is unavailable or cannot be programmed safely. Those shapes use a fail-closed fallback policy:
- Direct device DMA remains blocked.
direct_dma_trusted_domainsstays zero andremapping_tablesstaysnot-programmed. - Prototype devices that remain enabled use kernel-owned bounce buffers only.
The kernel or device manager owns the pages, descriptor validation,
physical-address programming, and packet or block-data copies between
device-visible memory and non-device memory. General
FrameAllocatorandMemoryObjectcapabilities are not DMA authorities. - capOS does not expose direct hardware authority for userspace
DMAPool,DMABuffer,DeviceMmio, orInterruptin the fallback shape. Result-only.infoskeletons and bounded manifest grants may report conservative status. The currentDMAPoolmanifest grant may allocate and free eight fixed manager-attached, kernel-owned, single-page bounce-bufferDMABufferresult caps, with backing pages scrubbed before frame release and no host physical address or IOVA exposed. That narrow fixed-slot allocation/free authority does not map DMA, program device-visible addresses, publish arbitrary CQ entries, program IOMMU/remapping tables, access arbitrary BAR registers or doorbells, or own hardware interrupt acknowledgement, mask, or unmask. The selected provider-TX proof is the current bounded exception: after the same manager-ownedDMABufferauthority and bounce-scrub gates, queue1may publish the full selected TX queue-depth descriptor/avail window into the existing kernel-owned virtio-net TX ring before the first completion, ring one selected notify doorbell per accepted provider descriptor through the live no-writenotify_mmiopolicy, and hand those bounded completions back through descriptor/generation-matchedDMABuffer.completeDescriptorplus livetx_interrupt.waitcompletion events. The same selected path can also usetx_interrupt.mask/unmaskto toggle only the selected TX MSI-X table vector-control bit and matching route state after live issue-id and route validation, and can retire one deferred LAPIC EOI for each delivered selected TX used-ring completion event, withInterrupt.acknowledgereturning ABI-visible provider CQ/ack ledger fields plus hardware dispatch ack count, delta, token, and mutation flag for that bounded pairing. Full-queue QEMU bursts that coalesce selected TX MSI-X delivery use a boundedINT $vectorproof hook only while the virtio TX completion path has an active full-window coalescing budget, so the selected IDT handler and deferred-EOI path remain observable without claiming full production IRQ ownership. Successful selected queue1DMABuffer.completeDescriptor,tx_interrupt.wait, andtx_interrupt.acknowledgeresults also carry bounded CQ event identity: sequence, queue, descriptor id, slot, slot generation, software descriptor generation, completion length, provider issue id, source id/generation, and route generation. Pre-event, duplicate ack, masked-route ack, wrong-order completion, teardown-drain, stale issue after release/regrant, reset, and stale-after-release paths keep that identity empty and do not mutate the bounded identity queue. Provider TX release also retires delivered but unacknowledged bounded CQ events for the live issue before clearing that issue: the stale post-release ack path is revoked, and the release proof records seven pending provider completion acks and their deferred EOIs as release-retired. The same selected path also has a bounded teardown-only drain for seven incomplete provider-published TX descriptors while one completed descriptor remains live: release may explicitly drain only the incomplete matching used-ring entries, retire those allocation-backed device-DMA TX queue ledgers, and free only after manager in-flight state is drained, without publishing provider CQ/IRQ events or issuingDMABuffer.completeDescriptorresults. The paired provider RX bootstrap grant can now validate the live RX issue and selected virtio-net RX route before toggling only the selected RX MSI-X table vector-control bit and route state, and it can complete one selected-route RXInterrupt.waitafter a delivered RX MSI-X/LAPIC dispatch. The pairedInterrupt.acknowledgeaccounts exactly one RX dispatch token and retires one deferred LAPIC EOI for that delivered zero-CQ RX event; pre-event, masked-route, duplicate, and stale-after-release paths fail closed without mutating delivery or acknowledgement state. RX descriptor accounting and RX CQ ownership remain bounded to the synthetic proof path, and full hardware IRQ ownership remains blocked. These exceptions do not transfer full virtio-net ownership, direct DMA, IOMMU authority, arbitrary doorbells, production NIC/storage authority, or cloud readiness. - capOS does not claim hostile-hardware isolation for those shapes. A malicious or compromised bus-mastering device without a trusted remapping domain can still write arbitrary RAM at the hardware level. The fallback is acceptable only for prototype devices and trusted emulator or provider shapes where that hardware threat is outside the claim; otherwise the device remains unsupported.
- Before any userspace driver path can rely on DMA or IRQ authority, S.11.2 hostile smokes must pass for the selected backend. That includes stale DMA handles, stale completions, descriptor abuse, revoke/reset races, stale IRQs, teardown-under-DMA for IOMMU-backed IOVA mappings, and exit-under-DMA for the fallback bounce-buffer path when the fallback is used.
This fallback policy is separate from current diagnostics-only IOMMU metadata
coverage and from future real remapping-domain integration. Diagnostics can
report static firmware-table coverage for a PCI function, but unless capOS
creates a device-manager-owned remapping domain and programs mappings, the
active direct-DMA policy remains blocked. Future real integration must attach
DMAPool, DeviceMmio, Interrupt, ledger teardown, mapping removal or
invalidation, and required IOTLB flushes to the same ownership transaction
before a direct-DMA trusted-domain count can become nonzero.
DMA Assurance Model Checked Evidence And Cloud Backend Inputs
The DMA assurance model records the claim boundary and checked bounded evidence
for DMA authority; the cloud backend contract it feeds is authoritative and
lives in the “Cloud DMA Backend” section below:
DMA Assurance Model
and models/dma/. It is a design/evidence scaffold, not a new production
hardware gate by itself. The checked gates are make model-dma-tla,
make model-dma-alloy, make kani-dma-authority, and
make model-dma-deferred-completion-loom; make dma-assurance-model-check
aggregates them locally, while GitHub CI runs the Alloy/TLA+/Loom gates in
dma-assurance-models and the Kani gate in kani-proofs. The
operationalization track that reconciled the skeletons against landed DMA code
is tracked in
Security and Verification
(“DMA Assurance Model Operationalization”).
Cloud NIC/storage work must use the model as the checklist for backend selection. Backend selection is a runtime, fail-closed decision the kernel makes on each boot, with an optional operator override declared in the system manifest; it is not a per-VM-shape safety assertion that a person signs off. The authoritative selection rule and the manifest override contract are defined in the “Cloud DMA Backend” section below.
Cloud backend evidence must separate provider-side DMA isolation from guest-controlled remapping authority. SR-IOV, virtual NIC, GPU, accelerator, or local NVMe support can identify a DMA-capable surface, but it is not enough to claim direct-DMA isolation. A direct-remapping backend needs guest-visible IOMMU or equivalent translation authority that capOS can discover and program. The cloud evidence matrix must record provider API or documentation sources, retrieval date, region or zone, instance type, image and kernel, live guest PCI/device probes, IOMMU table/group observations, and maintenance or device revocation behavior as the support-policy record for advertised targets. The runtime probe, not this matrix, makes the binding per-boot selection.
The matrix does not replace runtime selection. capOS must choose the safest
backend on each boot from what it can actually observe and validate. Direct
remapping is enabled only when guest-programmable remapping authority is
present and passes the selected self-tests. A provider-remapped or bounce path
is selected only when direct DMA remains blocked and device-visible memory can
stay manager-owned. Ambiguous, contradictory, or unvalidated observations select
Unsupported.
The backend candidates are:
- Direct remapping domain. The provider shape must expose guest-programmable remapping hardware; capOS must discover and program a device-manager-owned domain for the target device; descriptor publication must be ordered after mapping; and teardown must remove mappings, observe required invalidation completion, and scrub before page reuse. The selected path must carry stale-handle, stale-completion, descriptor-abuse, revoke/reset-race, teardown-under-DMA, no-host-physical-exposure, and cross-domain alias evidence.
- Labeled bounce-buffer fallback. Direct DMA stays blocked, device-visible
memory remains manager-owned bounce pages, host physical addresses and
generic
MemoryObjectauthority stay hidden from the driver, and stale handle/completion/teardown evidence covers the selected fallback. This path must keephostile_hardware_isolation=not-claimedunless separate per-domain remapping evidence justifies a stronger provider-specific claim. - Unsupported. Devices whose DMA behavior cannot satisfy either candidate stay unbound or disabled. A serial boot result or PCI enumeration line is not enough to claim cloud NIC/storage readiness.
Downstream cloud driver preflights must declare the candidate backend and map their evidence to the assurance model’s invariants: no host-physical exposure, mapping before publication, no page reuse before teardown, stale-handle and stale-completion fail-closed behavior, domain-scoped aliasing only, bounded fail-closed holds, and explicit backend evidence. The evidence matrix is a support-policy record of advertised targets; the runtime probe, not the matrix, selects the backend on each boot.
Cloud DMA Backend
This section is the authoritative contract for how capOS selects a DMA backend for cloud NIC/storage devices. Selection is a runtime, fail-closed decision the kernel makes on each boot from what it can actually probe and validate, with an optional declarative override in the system manifest. There is no human sign-off in the selection path: the runtime probe decides by default, and the manifest override is config that an operator sets for a deployment, not a doc-signing ritual gated on any specific person. Downstream cloud NIC/storage driver slices consume this contract directly as their DMA-backend authority.
The preceding “DMA Assurance Model Checked Evidence And Cloud Backend Inputs” section defines the three backend candidates; this section adds the per-candidate trade-off analysis, the runtime selection rule, the manifest override field, and the downstream-contract scaffolding that a cloud NIC/storage driver declares. The research substrate is the provider evidence inventory Cloud DMA Provider Evidence Inventory, and the invariants and tool mapping are in DMA Assurance Model.
Provider-Written Addresses And No-IOMMU Brokered Bounce
Two DMA-address ownership models can be valid, but they do not apply to the same backend.
- Provider-written, kernel-validated addresses (the NVMe Model B validator) are valid only when the provider’s device-visible address is not a host physical address: a verified direct-remapping/vIOMMU domain-scoped IOVA, or a future synthetic software address namespace that the manager translates before hardware sees it.
- Brokered address publication is the no-IOMMU bounce-buffer model. The provider may own protocol state and buffer capabilities, but the kernel or device manager writes device-visible queue-base, PRP/SGL, or virtqueue address fields because those values are host physical or bus addresses on current no-IOMMU hardware.
Correction recorded 2026-05-27: the earlier reconciliation that treated a
no-IOMMU bounce window as a provider-visible, non-host-physical device address
space is not valid for the current implementation. On the run-pci-nvme
no-IOMMU shape, DeviceDmaAllocation carries host physical pages and the
reviewed IOVA export discipline keeps userspace IOVA/host-physical export
disabled. Therefore a provider-written NVMe queue base or PRP on that gate would
export a host physical address, violating the no-host-physical-exposure
invariant. A bounce buffer protects data ownership and copy discipline; it does
not create an untrusted-driver-safe IOVA namespace by itself.
The kernel on-notify DMA validator
(kernel/src/cap/nvme_doorbell_validator.rs, validate_doorbell_scan) remains
useful evidence for the provider-written model. On a queue-arm/CC.EN write and
on an SQ tail doorbell it scans the device-visible addresses the provider
published (queue bases; PRP1/PRP2 and one level of PRP-list indirection) and
fails closed before the doorbell takes effect on any address that is not wholly
within a window granted to the doorbell claim’s owner at the live generation:
out-of-window, host-physical, cross-owner-alias, region-overrun, unaligned,
deeper-than-one-level PRP chain, or stale generation (ScanReject). Owner
identity and live generation come from the grant ledger, never from
provider-supplied metadata. A completion whose submission scan was never
validated, or was validated under a now-retired generation, does not wake a
waiter (completion_wakes_waiter), matching the stale-completion gate on the
virtio-net path. That mechanism is the right fit for the QEMU direct-remapping
lane and any future cloud shape that exposes guest-programmable remapping.
For the current GCP/no-IOMMU target, the storage path must use brokered bounce:
userspace supplies typed commands, queue ownership intent, and live
DMABuffer/buffer-cap handles; the manager materializes the actual
device-visible queue-base and PRP/SGL fields and orders publication, teardown,
copy, and scrub. That still leaves protocol-specific NVMe logic in userspace,
but it does not let userspace author raw device addresses.
The brokered admin-queue enable landed 2026-05-27
(nvme-no-iommu-brokered-controller-enable,
device_manager::nvme_brokered_admin_queue_enable). The provider allocates the
admin submission/completion queue pages through its DMAPool cap and requests
enable through the CC selected-write claim (CC.EN set); the manager resolves
those pages from the live ledger (proof_buffers slots 0/1), validates the
authored bases through validate_doorbell_scan (ScanKind::QueueArm), and
authors AQA/ASQ/ACQ plus the CC.EN write itself. The provider never
receives the host physical / device-visible queue-base address; CSTS.RDY=1 is
observed only through brokered reads. This is the brokered model applied to the
admin queue-arm; the steady-state SQ-tail doorbell over provider-written PRPs
still needs the direct-remapping/synthetic-address lane above. Proof
make run-pci-nvme; provenance docs/devices/nvme.md §6.
Provider-Side Isolation Versus Guest-Programmable Remapping
The decisive distinction for backend selection is between a DMA-capable surface and guest-programmable remapping authority:
- SR-IOV (AWS ENA, Azure Accelerated Networking VF), a virtual NIC (gVNIC, virtio-net), a GPU/accelerator, or local NVMe identifies a device that does or could bus-master. This is a DMA-capable surface, not a safety property.
- A direct-remapping classification requires a usable Intel VT-d, AMD-Vi, or Arm SMMU unit that the guest can discover, program, and validate, with translation/fault/invalidation behavior matching IOMMU Remapping Grounding. A DMA-capable surface alone never implies this.
- Provider-side isolation facts (host-enforced VPC isolation, Nitro/host data-path bypass, hypervisor-side IOMMU) are support-policy assumptions a guest cannot prove from inside, not evidence that capOS can safely program direct DMA.
Runtime probing is authoritative for selecting the safe backend on a particular
boot: capOS chooses from the device inventory, the remapping authority it can
actually program, driver self-tests, and fail-closed probe results, and unknown
or contradictory observations select the labeled bounce-buffer path or
Unsupported. The cloud VM evidence matrix is the separate support-policy
record for advertised targets and provider assumptions a guest cannot fully
prove by itself; it does not override the boot-time runtime selection.
Candidate Trade-Off Analysis
| Dimension | Direct remapping domain | Labeled bounce-buffer fallback | Unsupported |
|---|---|---|---|
| IOMMU coverage requirement | Requires a guest-programmable VT-d/AMD-Vi/SMMU unit capOS can program and validate per device. | None: used precisely when no usable guest IOMMU is exposed. | N/A: device stays unbound. |
| Cloud VM shape coverage (per inventory) | No probed GCE shape exposes a guest-programmable IOMMU; AWS/Azure shapes not yet probed. So no probed shape currently qualifies. | Indicated for shapes with a DMA-capable surface but no guest IOMMU (the probed GCE rows); fail-closed default for unproven shapes with a manager-ownable surface. | Ambiguous, contradictory, or unvalidated observations. |
| Per-operation cost | Translation only; no data copy. IOTLB/context-cache invalidation on teardown. | Copy between device-visible bounce pages and non-device memory on every transfer; Confidential VMs force this in hardware regardless. | None. |
| Hostile-smoke coverage today | Bounded QEMU Intel path only (make run-iommu-remapping, ddf-iommu-remapping-production-closeout); no cloud guest-IOMMU evidence. | S.11.2.7/8/9 rows enforced by the make run-net gate (tools/qemu-net-smoke.sh), with bounce-buffer virtio-net provider evidence in ddf-provider-virtio-net-driver-closeout; bounce-buffer DMAPool lifecycle by make run-dmapool-grant. The GCP-shape local binding precursor (cloud-gcp-virtio-net-local-qemu-binding) asserts, in both make run-net and make run-ddf-provider-consumer, that the enumerated/bound function matches the documented GCP 1st/2nd-gen virtio-net device surface (standard virtio-net, vendor 0x1af4) and that the resolved backend is this labeled bounce-buffer path; it does not claim live GCP enumeration. | N/A. |
| Hostile-hardware isolation claim | Claimable only with per-domain remapping evidence and the IOMMU hostile smokes; not yet established for any cloud shape. | not-claimed: a malicious bus-mastering device without a trusted remapping domain can still write arbitrary RAM. | N/A. |
The GCE live-probe rows in the evidence inventory record that every probed GCE
shape (1st-gen n1, 2nd-gen e2, 3rd-gen Intel c3, and AMD-SEV Confidential
n2d) boots with intel_iommu=off, DMAR: IOMMU disabled, SWIOTLB software
bounce buffering, empty /sys/kernel/iommu_groups, and no DMAR/IVRS/IORT
table; the Confidential shape forces bounce buffering as a memory-encryption
invariant. These rows are support-policy expectations, not a hardcoded
selection table: they describe what capOS’s runtime probe should expect to find
on those shapes today, and they confirm that the fail-closed default lands on
the labeled bounce-buffer path there. The boot-time probe, not this matrix,
makes the binding selection on each boot, so a shape whose IOMMU exposure
changes is handled by the probe re-evaluating rather than by editing this text.
AWS and Azure shapes carry no live-probe evidence yet; the probe treats them
the same as any other unproven platform and defaults to the bounce-buffer
path.
Runtime Selection Rule (Fail-Closed Default)
On each boot, capOS probes the platform for guest-programmable remapping authority — IOMMU presence, programmability, and coverage for the DMA-capable functions it intends to bind — and selects the backend fail-closed:
- Probe the platform. Discover DMA-capable functions, then test whether a usable Intel VT-d, AMD-Vi, or Arm SMMU unit is present, discoverable, and programmable, and whether its translation/fault/invalidation behavior passes the self-tests in IOMMU Remapping Grounding.
- Select fail-closed. Select the direct remapping domain backend for a device only when the probe positively verifies a usable+safe IOMMU for that device. If the probe cannot verify it — IOMMU absent, not programmable, coverage unproven, self-test failed, or observations ambiguous — select the labeled bounce-buffer fallback for any DMA-capable surface the manager can keep manager-owned, or Unsupported when even that cannot hold. This probe-gated rule is the default: on an unproven platform the probe cannot verify, so direct DMA is not used and the bounce-buffer path is chosen.
- There is no human in this loop. The machine decides per boot; the only external authority is the optional manifest override below.
This is the boot-time authority. The cloud VM evidence matrix above is the support-policy expectation of what the probe should find, not the decision itself.
Manifest Override Field (Operator Authority Lever)
An operator can override the runtime default for a deployment through one
declarative, auditable enum field in the system manifest’s kernel parameters:
the dmaBackendPolicy field of the SystemConfig struct in
schema/capos.capnp. It is config, not a doc-signing ritual, and is not gated
on any specific person. The field is absent by default, and the absent default
applies the probe-gated runtime selection rule above. The enum values and their
interaction with the probe result are:
| Value | Probe verifies usable+safe IOMMU | Probe cannot verify | Notes |
|---|---|---|---|
| (field absent) | direct remapping domain | bounce-buffer fallback | Default: the probe-gated runtime selection rule. Direct-DMA when the probe verifies a usable+safe IOMMU, bounce-buffer fallback otherwise (fail-closed). Identical to enable-if-verified. |
enable-if-verified | direct remapping domain | bounce-buffer fallback | The explicit, auditable form of the default. Probe-gated direct-DMA with fail-closed bounce-buffer fallback. Redundant with the absent default but kept for explicit configuration. |
enable-unsafe | direct remapping domain | direct remapping domain | Force direct-DMA even when the probe cannot verify it. The operator takes responsibility for the platform’s DMA isolation; the value name carries the warning. Use only on a platform whose isolation is known-good out of band. |
bounce-buffer | bounce-buffer fallback | bounce-buffer fallback | Pin the labeled bounce-buffer path and disable direct-DMA entirely, even where the probe would verify a usable IOMMU. The most conservative value. |
Selection rules that hold for every value:
- The absent default and
enable-if-verifiedselect direct DMA only when the probe verifies a usable+safe IOMMU, and otherwise fall back to bounce-buffer. enable-unsafeis the sole value that can pick direct DMA without probe verification. The value name is the acknowledgement; there is no separate per-shape “I-accept-unverified” ceremony.bounce-buffernever selects direct DMA, even where the probe would verify a usable IOMMU.- When the selected backend is
Unsupportedfor a device (no manager-ownable DMA-capable surface at all), the device stays unbound regardless of the override value. The override governs direct-vs-bounce, not whether an unbindable device is forced online.
This selection mechanism is implemented. The dmaBackendPolicy capnp enum
encodes an absent field as ordinal 0 (unspecified), which decodes identically
to enable-if-verified; an unrecognized ordinal decodes fail-closed to the
bounce-buffer path (never direct DMA) rather than failing the manifest parse or
honoring the probe-gated default. The kernel resolves the backend on each boot
from the IOMMU probe verdict and this override and emits a boot proof line of
the form dma: backend selection dma_backend=<direct-remapping|bounce-buffer> dma_backend_override=<absent|enable-if-verified|enable-unsafe|bounce-buffer> probe_verified_usable_iommu=<bool>. The bounded QEMU shapes prove the
probe-gated default end-to-end: make run-iommu-remapping (verifiable Intel
VT-d shape) records dma_backend=direct-remapping dma_backend_override=absent,
and make run-dmapool-grant (no usable IOMMU) records
dma_backend=bounce-buffer dma_backend_override=absent. The override values and
the unknown-ordinal fail-closed decode are covered by cargo test-config over
the shared selection rule, capnp round-trip, and CUE decode.
The production cloud (non-qemu) build emits the same backend selection on
the cloudboot harness’s serial-port-1 path as
cloudboot-evidence: dma-backend bounce_buffer (in the harness token
namespace), and on the bounce-buffer verdict drives the production-path
bounce-buffer DMAPool + DMABuffer grant proof
(kernel/src/cap/dmapool_bounce_buffer_grant_proof.rs, called from
kernel::run_init): stage a parked manager-attached DMAPool over one
DMA-capable PCI function from the inventory through
device_manager::stage_bounce_buffer_dmapool_record
(kernel/src/device_manager/stub.rs), allocate one bounded bounce-buffer
DMABuffer through
device_manager::issue_manager_attached_dmabuffer_handle_with_request
(which calls
device_dma::allocate_manager_attached_dmapool_bounce_buffer_page),
assert the cap-info labels (userspace_dmapool=manager-issued-bounce-buffer,
allocation=single-bounce-buffer-page, real_dma=not-attempted,
direct_dma=blocked, host_physical_user_visible=0,
iova_export=disabled-future-only), quiesce-before-release
(release_dmapool_record_for_cap_release returns pending-buffer-release
while the buffer is live), scrub-before-reuse (the released bounce-buffer
frame is zeroed in place between scrub and frame-free), and
stale-handle-after-detach, then emit
cloudboot-evidence: dma-pool-grant <token> with shape
<seg>.<bus>.<dev>.<fn>-pool.<slot>.gen.<gen>-phys.<hex> (every character
inside the harness grammar [A-Za-z0-9._-]+). The proof stages a pool,
allocates one bounce-buffer page, asserts the invariants, and emits a marker:
it does not program real DMA, attach a queue, program interrupts, claim a
device for sustained ownership beyond the grant, or emit
provider-nic-bound / storage-bound. A future direct-remapping verdict
skips the proof rather than aliasing direct-DMA onto the bounce-buffer
assertion shape.
Implementation note, 2026-05-29 11:50 UTC: the production-cloud bounce-buffer stub
now implements the cap-side DMABuffer.map(R+W) / DMABuffer.unmap
admission and state-machine entry points
(validate_dmabuffer_map_admission, record_dmabuffer_user_mapping,
begin_dmabuffer_user_mapping_unmap, restore_dmabuffer_user_mapping,
clear_dmabuffer_user_mapping in kernel/src/device_manager/stub.rs) so a
userspace holder of a manager-issued DMABuffer cap can map the single
bounce-buffer page read/write, write payload bytes through that mapping,
unmap it, and observe DMAPool.info.mapped_vmas reflect the live
mapping (make run-cloud-dmapool-grant). The same Absent -> Mapped -> Unmapping -> Absent state machine the QEMU side enforces governs the
parked slot: a duplicate map fails closed with the DmaPoolLive shape,
a clear before the cap-side aspace unmap returns
DmaPoolTeardownEvidenceInvalid, an identity-mismatched begin/clear
fails closed, and a post-freeBuffer map fails closed with the
DmaBufferStaleHandle shape. The manager remains the single owner of the
bounce-buffer page’s device-visible host-physical address and IOVA: the
mapping does not expose either to userspace, real DMA stays
not-attempted, direct DMA stays blocked, and IOVA export stays
disabled-future-only. This is a local-QEMU proof of the userspace mapping
path only; it does not unlock a live cloud NIC bind, IOMMU programming, or
production direct-DMA authority.
Implementation note, 2026-06-03: Phase C slice 2
(cloud_virtio_net_userspace_ownable_vring_proof,
make run-cloud-prod-nic-driver-userspace-ownable-vring) wires this landed
bounce-buffer authority to a userspace virtio-net driver’s own vring without
adding a new isolation backend. The driver allocates its descriptor / available /
used ring pages through a granted DMAPool, and the kernel programs the
virtio queue-address registers (queue_desc / queue_driver / queue_device)
with the manager-owned bounce host-physical address. The brokered-address-
publication model (above) holds: the driver never authors a device address. It
writes an opaque per-buffer device-usable handle (exported through
DMABuffer.info.deviceIova with scope bounce-handle, a deterministic
non-address encoding of the buffer’s manager identity), and the kernel resolves
that handle against the live grant ledger
(device_manager::stub::resolve_virtio_vring_device_address) to the real
host-physical address before any MMIO write. The no-host-physical-exposure
invariant is preserved end-to-end: host_physical_user_visible=0,
iova_export=disabled-future-only, and reads of the queue-address base
registers are refused so the resolved address is never read back into userspace.
Out-of-grant, host-physical-looking, and stale-generation handle writes fail
closed, mirroring the NVMe doorbell-scan reject classes. queue_enable /
DRIVER_OK stay fail-closed (slice 3); this is a local-QEMU proof of the
userspace-ownable-vring path only, not a live cloud NIC bind.
Downstream-Contract Scaffolding
A cloud NIC/storage driver declares its chosen backend through the device-manager policy fields that already label the local paths in this design. The contract below scaffolds the values each candidate requires; it is the shape a driver preflight declares, not an authorization to enable any value.
| Policy field | Direct remapping domain | Labeled bounce-buffer fallback | Unsupported |
|---|---|---|---|
direct_dma | enabled for the programmed per-device domain | blocked | blocked |
trusted_domain | manager-owned domain id for the device | none | none |
bounce_buffer | not required (mapped IOVA path) | required | N/A (device unbound) |
remapping_tables | programmed | not-programmed | not-programmed |
hostile_hardware_isolation | claimed only with per-domain remapping evidence + IOMMU hostile smokes | not-claimed | N/A |
exported_device_addresses | iova-only, domain-labeled | none (no host physical or IOVA exposed) | none |
Each candidate must satisfy the following gates, mapped to the assurance-model invariants in DMA Assurance Model:
- Direct remapping domain — mapping before publication; no page reuse
before teardown, with mapping removal and observed invalidation completion
ordered strictly before scrub/reuse; stale-handle and stale-completion
fail-closed; domain-scoped aliasing only (per-device context entries where a
peer domain cannot resolve another domain’s IOVA); no host-physical exposure
(export only the domain-scoped IOVA, labeled as meaningless outside the
domain); backend evidence explicit. The only landed remapping-domain evidence
is the bounded QEMU Intel path in
ddf-iommu-remapping-production-closeout, exercised bymake run-iommu-remapping. A cloud shape needs its own guest-programmable remapping evidence before this candidate applies to it. - Labeled bounce-buffer fallback —
direct_dma=blocked; all device-visible memory is manager-owned bounce pages; no host physical address and no genericMemoryObject/FrameAllocatorauthority is exposed to the driver; stale-handle, stale-completion, and exit-under-DMA teardown fail closed;hostile_hardware_isolation=not-claimedstays explicit. The landed evidence is the S.11.2.7/8/9 hostile-smoke rows enforced by themake run-netgate (tools/qemu-net-smoke.sh), with the bounce-buffer virtio-net provider evidence inddf-provider-virtio-net-driver-closeout, plus the bounce-bufferDMAPoolgrant lifecycle inmake run-dmapool-grant. As ofddf-real-dma-s112-hostile-smokes, the S.11.2.7 stale-IRQ-after-reset and S.11.2.8 stale-DMA-completion-after-reset closure summaries are asserted on the dedicatedmake run-dmapool-grantgate as well, so the DMA-grant gate fails closed on a hostile-row regression without depending onmake run-net; exit-under-DMA (in-flight drain, scrub, page free) is enforced bymake run-dmapool-grant-exit. See the “Fallback Policy For No Usable IOMMU Exposure” and “Hostile-Smoke Acceptance Matrix” sections in this document for the full gate text. - Unsupported — the device stays unbound or disabled; no driver-visible DMA, MMIO doorbell, interrupt ownership, or storage/network readiness claim is made. A serial boot line or PCI enumeration line is not readiness evidence.
The models/dma/ TLA+ and Alloy files, the extracted Kani core, and the focused
Loom model are checked bounded evidence for these invariants. They supplement
the QEMU/host evidence above; they do not satisfy a candidate hardware or cloud
backend gate by the mere presence of model files. Each checked result records
its tool version, command, bounds, and output in
../models/dma/README.md, and the CI placement is
tracked in Security and Verification.
First QEMU Intel Remapping Smoke Acceptance Gate
Decision recorded 2026-05-14 09:07 UTC. The first slice that programs real
Intel VT-d remapping state under QEMU has an explicit, bounded acceptance gate.
This unblocks the
ddf-iommu-qemu-intel-remapping-smoke
task, whose ## Acceptance section carries the full gate; the summary here is
the design-level decision.
The first slice programs the minimum Intel VT-d path for exactly one selected
DMA-capable function under a pinned QEMU q35 -device intel-iommu,aw-bits=39
shape: one root entry, one context entry bound to a device-manager-owned domain
ID and a 39-bit second-level page-table root, and a single second-level mapping
from one device-visible IOVA page to one kernel-owned DMAPool page, plus the
root-table-address register write and the global-command/global-status
translation-enable handshake. Acceptance requires observable proof that the
mapped IOVA was translated and that an out-of-domain IOVA faults closed in the
fault-status/fault-recording registers.
Invalidation is part of the gate, not a follow-up. On revoke, device reset,
driver death, and DMAPool page release, the slice must remove the
second-level mapping and invalidate the relevant context-cache and IOTLB state
through the selected invalidation interface, observe invalidation completion,
and order page scrub/reuse strictly after that completion. This sits at the
existing “remove IOMMU mappings” and “scrub and free DMA pages” steps of the
DeviceOwnerState revocation order and must not be reachable before
QueuesQuiesced or a completed Resetting transition.
IOVA export discipline: host physical addresses stay hidden from userspace in
every result cap, diagnostic line, and audit record. The selected QEMU Intel
IOMMU-backed DMABuffer.info path may export only the domain-scoped IOVA for
the live mapped buffer generation, explicitly labeled as meaningless outside
that domain; fallback bounce-buffer paths keep IOVA export disabled.
Per-device domain granularity: the selected QEMU Intel path programs two distinct per-device context entries and second-level roots for two claimed DMA-capable functions under the same DRHD. Both domains may use the same IOVA, but the peer domain’s second-level walk is proven not to resolve that IOVA to the primary page; stale and wrong-owner domain assignment fail closed, and trusted multi-device sharing groups remain disabled.
The kernel-owned bounce-buffer fallback stays the path for VM shapes without
usable remapping hardware and must remain explicitly labeled
(fallback_policy=kernel-owned-bounce-buffer-only,
remapping_tables=not-programmed,
hostile_hardware_isolation=not-claimed); it must never be silently
reinterpreted as direct-DMA or hostile-hardware isolation. The IOMMU-backed
path adds stale-DMA-handle, stale-completion, descriptor-abuse,
revoke/reset-race, teardown-under-DMA, cross-domain stale-handle, and
fail-closed teardown branch hostile smokes while the existing bounce-buffer
exit-under-DMA and stale-DMA evidence (the device-dma stale-handle and
stale-completion proofs and the S.11.2.7/S.11.2.8 closure summaries) is
preserved unchanged.
Explicitly out of scope for this first slice: AMD-Vi table programming;
trusted multi-device sharing groups; scalable-mode context entries; interrupt
remapping and device-IOTLB options; 48-bit IOVA space / 4-level tables;
production NIC or storage driver ownership; userspace DMAPool direct-DMA
authority; and moving the live virtio-net path off bounce buffers. The
acceptance evidence is QEMU-only emulator evidence, not a hardware-isolation
claim. The smoke adds or selects a focused make run-iommu-remapping gate
asserted by tools/qemu-iommu-remapping-smoke.sh.
Implementation note, 2026-05-02 04:58 UTC: the ACPI discovery path recognizes DMAR and
IVRS in the root table walk, reports absent/valid/malformed/unsupported state,
records bounded table length/header facts, DMAR host address width and flags,
IVRS IVinfo/flags, and bounded remapping-structure type counts. Malformed DMAR
or IVRS structure lengths stop parsing, and unsupported shapes such as parser
scan-cap overflow leave direct_dma=blocked with bounce_buffer=required.
Implementation note, 2026-05-02 05:31 UTC: the attachment-policy slice also
retains DMAR DRHD include-all and bounded PCI endpoint device-scope metadata,
including segment, single-hop BDF, and remapping-hardware register base, and
reports each retained DMA-capable PCI function as IOMMU-attached/covered when
that static table metadata covers its segment/BDF. Bridge and multi-hop scopes
are diagnostic-only until PCI topology traversal can resolve them, and
include-all fallback fails closed when retained DRHD units or scopes are
capped. Functions without trusted static coverage are reported as uncovered;
covered functions are reported as attached/covered, but both paths keep
dma_policy=prototype-bounce-buffer-only, bounce_buffer_required, and
blocked_direct_dma_devices because remapping domains are unsupported. The
direct-DMA trusted-domain count remains zero, and userspace DMAPool,
DeviceMmio, and Interrupt authority remain unavailable.
Implementation note, 2026-05-02 07:27 UTC: the domain-policy staging slice adds
a pci: dma-domain policy proof line and a diagnostics mirror. The proof
reports the future domain owner as the device manager, the domain granularity
as per-device or trusted-sharing-group, exported device addresses as IOVA-only,
host_physical_user_visible=0, direct_dma_trusted_domains=0,
claimed_device_domains_ready=0, remapping_tables=not-programmed,
remapping_domains=not-started, userspace DMAPool/DeviceMmio/Interrupt
as not-started, and prototype devices as kernel-owned bounce-buffer-only.
Malformed, unsupported, absent, or retained-capped metadata leaves direct DMA
blocked; proof_result=ok is only evidence for that conservative
blocked-direct-DMA policy.
Implementation note, 2026-05-09 18:47 UTC: the blocked-direct-DMA admission
decision now lives in the pure capos-lib::device_authority validator next to
the DMA/MMIO/IRQ handle validators. Host tests cover the current all-prototype
bounce-buffer shape, fail-closed results if any direct trusted domain is
claimed before the policy is ready, fail-closed results if the prototype
bounce-buffer count does not cover every DMA-capable function, and the absent,
malformed, unsupported, and retained-capped metadata labels. The kernel PCI
proof line and diagnostics mirrors consume that pure decision while preserving
the existing direct_dma=blocked, remapping_tables=not-programmed,
domain_activation=not-started, and policy=blocked-direct-dma labels. This
is IOMMU/remapping groundwork only; it does not program remapping tables,
create trusted domains, expose host physical addresses, or enable production
userspace DMA authority.
Implementation note, 2026-05-02 15:29 UTC: the COM1 devices diagnostics
command now prints the same bounded DMA-domain policy facts without naming an
owner identity. The line explains that all current DMA-capable prototype
functions remain on direct_dma=blocked, bounce_buffer=required,
direct_dma_trusted_domains=0, claimed_device_domains_ready=0,
remapping_tables=not-programmed, exported_device_addresses=iova-only,
host_physical_user_visible=0, and
prototype_devices=kernel-owned-bounce-buffer-only. This is a diagnostics
mirror for the current conservative policy, not evidence that IOMMU remapping
domains or userspace DMA authority exist.
Implementation note, 2026-05-02 15:45 UTC: attached device-manager DMAPool
records now store the current explicit bounce-buffer policy and the QEMU
device-manager proofs read it back through the active device record plus the
matching DmaPoolHandle. The logged policy scope is
device-manager-attached-dmapool-bounce-buffer-policy, with
direct_dma=blocked, bounce_buffer=required, trusted_domain=none,
remapping_tables=not-programmed, remapping_domain=not-started,
userspace_dmapool=not-started, host_physical_user_visible=0, and
policy_bound_to_manager=true. This binds the conservative policy to current
manager state; it still does not program remapping domains, expose userspace
DMAPool, or perform real DMA mapping teardown.
Implementation note, 2026-05-11 00:00 UTC: attached device-manager DMAPool
policy records now also carry an explicit manager-owned remapping-domain
ledger staging record. The lifecycle and imported-live proofs report
remapping_domain_ledger_scope=device-manager-attached-dmapool-remapping-domain-ledger,
static_iommu_coverage=acpi-pci-diagnostic-only,
remapping_domain_owner=device-manager,
remapping_domain_granularity=per-device-or-trusted-sharing-group,
remapping_domain_ledger=manager-owned-staging-record,
remapping_domain_ready=false, and
iova_export=disabled-future-only, while preserving
direct_dma=blocked, remapping_tables=not-programmed, and
host_physical_user_visible=0. This is a software ledger/readiness record
only: capOS still does not program Intel VT-d/AMD-Vi/QEMU remapping tables,
create a trusted direct-DMA domain, expose host physical addresses or IOVAs,
or claim production hostile-hardware DMA isolation.
Implementation note, 2026-05-11 17:07 UTC: the same manager-owned
remapping-domain staging record is now an explicit activation gate tied to the
active DMAPool record and matching handle. The device manager validates that
gate before current DMAPool policy/accounting and buffer issue paths proceed.
The gate reports domain_ownership=manager-owned-active-dmapool, but keeps
direct_dma=blocked because remapping_table_programming=not-programmed,
iova_export=disabled-future-only,
remapping_invalidation_policy=not-installed,
remapping_iotlb_flush_policy=not-installed, and
remapping_stale_mapping_cleanup=not-installed; the selected fallback remains
remapping_fallback_policy=kernel-owned-bounce-buffer-only. The activation
result is blocked-remapping-prerequisites-missing with
remapping_activation_gate=fail-closed,
remapping_activation_blocker=remapping-tables-not-programmed, and
remapping_activation_side_effect=side-effect-blocked. This is a software
policy gate and proof surface only: capOS still does not program Intel
VT-d/AMD-Vi/QEMU remapping tables, create a trusted direct-DMA domain, export
IOVAs or host physical addresses, remove real IOMMU mappings, flush IOTLB
state, or prove IOMMU-backed hostile stale-DMA behavior.
Implementation note, 2026-05-12 18:49 UTC: the manager-owned staging record
now includes a concrete per-device remapping-domain identity for the active
DMAPool handle: claimed-device domain identity, staged single-device sharing
group, BDF-derived device id, pool slot, pool generation, and owner generation.
The activation preflight treats that identity binding as a prerequisite before
direct DMA could be considered, and the QEMU lifecycle/imported-live proofs emit
a dmapool remapping domain identity proof line. The direct-DMA blocker remains
unchanged: remapping tables are still not-programmed, IOVA export is still
disabled-future-only, invalidation, IOTLB flush, and stale mapping cleanup are
still not-installed, direct DMA remains blocked, and the fallback remains
kernel-owned-bounce-buffer-only.
Implementation note, 2026-05-12 21:19 UTC: the same manager-owned
remapping-domain ledger now carries a separate mapping-lifecycle preflight
record. The record is bound to the active DMAPool handle and claimed-device
domain identity, and the existing device-manager policy gate validates it
before accepting the current bounce-buffer attach/accounting/buffer-issue
paths. Its direct-DMA result remains fail-closed with explicit blockers:
IOVA space, mapping install, removal before page reuse, invalidation policy,
IOTLB flush policy, and stale mapping cleanup are all not-installed. This
is still an in-repo software preflight only; it does not program remapping
tables, expose IOVAs or host physical addresses, enable direct DMA, remove
real IOMMU mappings, flush IOTLB state, or prove IOMMU-backed hostile
stale-DMA behavior.
Implementation note, 2026-05-12 22:26 UTC: capOS now has an Intel/QEMU
remapping table scaffold that can represent a DRHD identity field, PCI
segment/BDF/source ID, domain ID, QEMU Intel address-width choice, disabled
root/context entries, and a second-level page-table-root placeholder. PCI
diagnostics can bind that scaffold to discovered DRHD/segment metadata when
ACPI/PCI discovery provides it. The disabled backend registry’s only accepted
active state is disabled. The proof labels distinguish representability
from programming: root-table pointer, context-entry programming, invalidation
registers, fault registers, protected-memory registers, and invalidation
queue remain not-written; remapping tables remain not-programmed;
hardware programming remains not-attempted; direct DMA remains blocked;
IOVA export remains disabled-future-only; and host physical addresses remain
hidden from userspace. This is still not Intel VT-d, AMD-Vi, or QEMU IOMMU
programming.
Implementation note, 2026-05-23 18:06 UTC: the first production DMAPool ledger
integration for the QEMU Intel remapping path now maps the selected
virtio-rng request-buffer IOVA to an active manager-owned DMABuffer page.
Mapping install is admitted through the matching active DmaPoolHandle and
DmaBufferHandle generations, stale pool and buffer generations fail closed,
and wrong-owner mapping attempts are side-effect blocked. On teardown, the
target second-level leaf is removed, the context-cache and IOTLB completion
polls finish, and only then is the DMABuffer released through the production
device-manager ledger. The proof keeps the IOVA internal, keeps
host_physical_user_visible=0, keeps userspace IOVA export disabled for this
slice, and leaves the no-remapping fallback policy as
kernel-owned-bounce-buffer-only.
Implementation note, 2026-05-23 19:18 UTC: QEMU Intel remapping fault
reporting now decodes VT-d FSTS plus FRCD[0] into a bounded kernel record
for the faulting IOVA, reason, requester source ID, and DMA read/write type.
The unmapped-IOVA, stale-handle, and stale-completion proofs record the fault,
clear it with write-1-to-clear semantics, verify the clear-after-record state,
and report source/IOVA match status without exposing host physical addresses.
The COM1 devices diagnostics path now prints an IOMMU fault summary that
reserves fault_summary=clean for a successful clear fault-status read and
labels unavailable fault-status reads as unavailable/fail-closed. Owner
identity, DRHD register bases, and host physical addresses stay hidden. The
optional audit route is explicitly not wired for this slice
(volatile_audit=not-routed).
Implementation note, 2026-05-13 15:29 UTC: active manager-owned DMAPool
remapping preflight records now consume the same retained DMAR DRHD/requester
metadata when PCI coverage is complete. The active disabled Intel/QEMU
scaffold records the retained DRHD identity and requester segment/BDF/source
ID for the bound pool handle, but absent, malformed, capped, unsupported, or
uncovered metadata still leaves the scaffold not-bound and disabled. This is
metadata-only binding: capOS still does not program root/context tables,
install or remove mappings, invalidate remapping caches, flush IOTLB state,
export IOVAs, enable direct DMA, or expose host physical addresses.
Implementation note, 2026-05-12 23:07 UTC: PCI diagnostics now include a
separate Intel/QEMU remapping MMIO-status proof for the selected DMA-capable
function. When complete retained DMAR DRHD metadata covers that function and
the register base is page-aligned, capOS maps only the selected
remapping-register page for bounded volatile diagnostic reads of the version,
capability, global-status, root-table-address, and fault-status registers. The
mapped label describes the diagnostic access pattern, not page-table write
protection. When the default diagnostics shape has no DRHD, metadata is capped,
or the retained DRHD base is invalid, the same proof reports
mmio_window=not-mapped, mmio_read=not-attempted, unavailable
capability/status/fault reads, and a fail-closed reason. The labels preserve
remapping_tables=not-programmed, direct_dma=blocked,
fallback_policy=kernel-owned-bounce-buffer-only, and
hostile_hardware_isolation=not-claimed. This is not remapping-domain
activation: capOS still does not write VT-d, AMD-Vi, or QEMU remapping
registers, install root/context tables or invalidation queues, export IOVAs,
or claim hostile-hardware DMA isolation.
Implementation note, 2026-05-13 01:20 UTC: active manager-owned DMAPool
records now also carry a generic disabled IOVA ledger under the
remapping-domain record. The ledger binds a domain-scoped reservation identity,
hidden internal range metadata, and reservation generation to the active owner
and pool generation. The same device-manager accounting path validates that
ledger before current bounce-buffer allocation, descriptor submission,
completion accounting, buffer free/page release, and pool release checks
proceed. Pure tests and QEMU proof labels reject stale reservation generations
and wrong owner generations as disabled-iova-ledger-stale with side effects
blocked. The active state remains disabled: proof output keeps the internal
range hidden and reports iova_base_user_visible=0,
host_physical_user_visible=0, iova_export=disabled-future-only,
direct_dma=blocked, mapping_install=not-installed,
mapping_remove_before_page_reuse=not-installed,
invalidation_policy=not-installed, iotlb_flush_policy=not-installed, and
stale_mapping_cleanup=not-installed. This still does not program remapping
tables, export IOVAs, expose host physical addresses, install or remove real
IOMMU mappings, flush IOTLB state, or claim hostile-hardware isolation.
Implementation note, 2026-05-11 18:44 UTC: the selected userspace virtio-net
TX provider smoke now grants a runtime-visible DeviceMmio notify BAR cap
named notify_mmio, but keeps the active DMA posture unchanged. The provider
still uses manager-owned bounce buffers, direct_dma=blocked, and
host_physical_user_visible=false; the notify cap is a no-write MMIO
admission boundary over the selected virtio-net TX notify offset, not a direct
DMA or descriptor-ring ownership transfer. The selected submit path validates
descriptor authority and scrubs the bounce page before consuming the
grant-derived notify policy, and it proves wrong value, wrong offset, stale
handle, and stale generation block before any doorbell. This does not program
IOMMU/remapping tables, export IOVA or host physical addresses, mutate the
real virtio-net descriptor ring, or claim production NIC isolation.
Implementation note, 2026-05-11 19:05 UTC: the notify_mmio grant remains
runtime-visible but is now explicitly no-direct-MMIO as well as no-write.
DeviceMmio.map and DeviceMmio.read32 for that provider notify cap return
typed blocked results before any user VMA or register read, DeviceMmio.info
validates the live mapping generation before accepting the provider-notify
record, and notify_mmio detach clears the submit-path notify policy so a
later selected submit reports stale-handle blocking with no doorbell write.
The submit path also invalidates the cached notify policy on owner-generation
transitions and stale/missing cap-release detach boundaries before accepted
no-write authority can be reported.
Implementation note, 2026-05-11 19:56 UTC: the same provider smoke now grants
a runtime-visible Interrupt cap named tx_interrupt for the selected
virtio-net TX MSI-X route snapshot. This extends the grantable authority
boundary without changing the DMA posture: the provider still uses
manager-owned bounce buffers, direct DMA and IOMMU remapping remain blocked,
and the interrupt cap only proves generation-checked admission for
info/ack/unmask/wait/mask plus waiter cancellation and stale-after-release
blocking. Later bounded follow-ups add selected TX MSI-X vector-control
mask/unmask; this first grant slice did not deliver provider IRQs to userspace,
acknowledge or mask/unmask hardware, ring a doorbell, or mutate real
virtio-net descriptor rings.
Implementation note, 2026-05-11 20:09 UTC: provider TX waiters are now
separate no-delivery waiter-table entries rather than generic delivery
waiters. A pending tx_interrupt.wait remains pending across TX route
delivery-count advancement and only completes through the explicit
mask/release cancellation path. The staged provider TX grant source also
tracks a live-issued cap and refuses another live tx_interrupt alias for the
same route snapshot.
Implementation note, 2026-05-12 09:13 UTC: the selected userspace virtio-net
TX provider path now performs one bounded real descriptor/avail publication on
queue 1 while keeping the DMA posture conservative. The descriptor points at
the manager-owned bounce buffer already governed by the DMABuffer record;
the submit path validates live buffer identity, scrubs before publication,
requires the live no-write notify_mmio policy, asserts that the descriptor
page is ledgered to the virtio-net TX queue, and blocks wrong queue, stale
notify policy, and a real stale DMABuffer.submitDescriptor(queue=1) attempt
at the stale capability/liveness boundary before touching the real ring. The
proof logs descriptor/avail/used ring physical addresses for kernel evidence,
but those addresses are not returned to userspace. Because this slice does not
claim real used-ring/CQ completion, the
published page remains pinned in the manager in-flight record; userspace
completeDescriptor, freeBuffer, post-publication remap, and cap-release
drain do not retire, remap, or release it. A follow-up rings one selected TX
virtqueue notify doorbell after the same descriptor authority, submit-scrub,
live notify_mmio policy, submit-effect write, and publication gates, while
wrong-queue and stale-notify or stale-DMABuffer paths remain not-written.
Readback-mismatch publication failures do not write the immediate doorbell and
are treated as possibly-published ring state that quiesces later TX
notification and keeps the manager buffer pinned rather than claiming rollback.
Pre-publication bounce-page metadata remains doorbell_write=not-written.
Any immediate used-ring or IRQ effect from that doorbell is recorded only as
an out-of-scope hardware side effect. Later 2026-05-12 follow-ups advanced the
selected path to a bounded used-ring completion handoff:
DMABuffer.completeDescriptor validates the live manager-attached buffer and
in-flight descriptor id, observes the real TX used ring for the stored
software descriptor generation, consumes that entry through the existing
descriptor tracker and DMA ledger, and only then clears the manager in-flight
record. As of commit e248d42b (2026-05-23 13:36 UTC), kernel TX helpers
stay quiesced after provider ownership starts while the provider path can
publish the full selected TX queue-depth window of eight descriptors before the
first completion; the smoke records live_inflight_after_submits=1/2/3/4/5/6/7/8
(the ninth allocation rejected dmapool-budget-exceeded), blocks map/free/reuse
while any buffer is in flight, and proves wrong-order descriptor 7 used-ring
handling preserves the observed descriptor 0 completion for its matching
generation.
The
provider-facing
tx_interrupt waiter is a
runtime-visible completion-event consumer for the same selected route;
delivery validates the expected TX source id, source generation, route
generation, owner, driver-unmasked state, and live issue id before completing
each bounded event. A 2026-05-13 follow-up adds the bounded
incomplete-descriptor teardown drain: when one descriptor has completed and
seven remain incomplete, release retires only the incomplete descriptors’
allocation-backed TX DMA ledgers and clears only the selected virtqueue
descriptor/used-ring tracking needed for those releases, while CQ publication
and provider IRQ delivery stay blocked and the pending waiter remains
undelivered until the smoke explicitly cancels it through the existing
mask/cancel path. Commit e248d42b (2026-05-23 13:36 UTC) extends that drain
to the full selected TX queue-depth window and keeps the completed descriptor’s
buffer retained until it is explicitly freed. A later 2026-05-13 remediation
binds each provider TX
in-flight descriptor to the submission-time provider issue/source/route
generation. If that old descriptor completes after tx_interrupt
release/regrant, DMABuffer.completeDescriptor fails closed as
dmabuffer-provider-tx-stale-issue before consuming the used ring, publishing
provider CQ/IRQ state, or advancing provider acknowledgements; later cap
release may still drain the descriptor as teardown-only.
tx_interrupt.wait posting is serialized with provider release, mask, and
delivery, and stale issue ids fail closed at admission and insertion. A later
2026-05-13 follow-up lets tx_interrupt.acknowledge account exactly one
already observed selected TX dispatch token paired with one delivered provider
CQ event; the smoke proves pre-event, duplicate, teardown-drain, masked-route,
reset/regrant, stale-after-release, and stale issue acknowledgements fail closed
before delivery-count, route-state, CQ, ack-ledger, or hardware-dispatch-ack
mutation. This is still bounded selected-route evidence: provider IRQ
ownership, deferred EOI, LAPIC/MSI-X acknowledgement, direct DMA, IOMMU mapping,
full virtio-net ownership, production NIC/storage migration, and cloud
readiness remain open. Commit e248d42b (2026-05-23 13:36 UTC) adds
release-time retirement for delivered but unacknowledged bounded provider TX CQ
events: the release proof now records seven pending provider completion acks
retired from the ledger in one live issue while preserving the separate
stale-bound in-flight descriptor proof, with stale post-release ack revoked and
no hardware ack claimed. A 2026-05-13 follow-up adds bounded selected TX MSI-X
mask/unmask only: live provider tx_interrupt.mask and unmask toggle the
selected TX vector-control bit plus route state, preserve generations and
delivery counts, and block stale issues before side effects.
Implementation note, 2026-05-02 19:43 UTC: the bounded zero-live
device-manager DMAPool lifecycle proof now treats its manager-attached DMA
buffer record as teardown-blocking metadata. The pool detach path still checks
authoritative live accounting first, then rejects zero-live detach while the
proof buffer is attached as dmapool-buffer-attached. Before the active free
path, the proof validates stale same-slot and wrong-identity FreeBuffer
operations through
capos-lib::device_authority::validate_dma_buffer_operation. The wrong
identity cases cover wrong owner generation, wrong pool slot, wrong pool
generation, and wrong buffer slot; each records dmabuffer-stale-handle, the
exact validator reason (stale-owner-generation, wrong-pool,
stale-pool-generation, or wrong-slot), side-effect-blocked, and a
preserved manager-owned buffer record. The stale same-slot case continues to
record stale-slot-generation and buffer_stale_free_preserved=true, then
the proof observes that pool detach still fails as
dmapool-buffer-attached. The proof clears the gate only after validating a
proof-scoped active FreeBuffer operation, scrubbing and freeing the
kernel-owned proof frame, and detaching the manager-owned buffer record. This
remains lifecycle evidence only: no userspace DMAPool or DMA-buffer
authority is exposed, no physical address or IOVA is exposed, and S.11.2
hostile stale-DMA smokes remain open.
Implementation note, 2026-05-03 02:31 UTC: the same zero-live
device-manager DMAPool lifecycle proof now validates manager-record
CompleteDescriptor authority for the attached DmaBufferHandle. Active
completion validation records buffer_active_complete_result=ok; freed-buffer,
reused-slot generation, and stale-after-revoke completion attempts fail closed
as dmabuffer-stale-handle with exact pure-validator reasons and
side-effect-blocked. This is manager-record validation evidence only: it does
not complete a real descriptor, publish a completion queue entry, grant a
userspace DMABuffer, run real DMA, or clean up or reuse production
userspace DMA pages.
Implementation note, 2026-05-14 14:05 UTC (DDF IOMMU remapping Slice A1): the
first slice that programs real Intel VT-d remapping state under QEMU has
landed. The ## First QEMU Intel Remapping Smoke Acceptance Gate above defines
the full bounded gate; that gate is being delivered as a sequenced A1/A2/B/C
split (the slice was correctly scoped as bigger than one reviewable unit). This
note records Slice A1.
Pinned QEMU shape: qemu-system-x86_64 8.2.2, -machine q35,
-device intel-iommu,aw-bits=39 (3-level second-level page tables, 39-bit IOVA
space). The kernel iommu module (kernel/src/iommu.rs, cfg(qemu)-only)
selects one DMA-capable function that is not the live virtio-net
bounce-buffer path (virtio-rng under the default smoke shape), allocates a root
table, one context table, a 3-level second-level page table, and one mapped
DMA page; encodes and writes one root entry, one context entry (binding the
requester source id to a domain id, aw-bits=39, and the second-level root),
and the second-level table-pointer / leaf entries through the HHDM; writes the
Root Table Address Register; and runs the global-command/global-status
SRTP-then-TE handshake, polling the status register for each step. The
capability-register extended-capability IRO field (IOTLB register offset) is
decoded and reported for Slice B’s benefit. MMIO ordering invariants are
enforced with SeqCst fences: between the last in-memory table-entry write and
the RTAR write, between the RTAR write and GCMD.SRTP, and between the latched
root pointer and GCMD.TE.
Slice A1 proves translation with kernel-side structural evidence only, which
the gate’s IOVA-export-discipline clause explicitly permits. Hardware confirms
translation-enabled (GSTS.TES + GSTS.RTPS polled set), the written entry
words are read-back-verified through the HHDM, the pure
capos_lib::device_authority validator accepts the layout, and the unmapped
IOVA’s 3-level walk structurally terminates at a non-present entry. The proof
labels are scrupulously honest: mapped_iova_translated=structural (not
hardware-dma), unmapped_iova_fault=structural-not-present (not observed),
proof_evidence=kernel-side-structural. A real hardware-DMA translation and a
real fault-status fault require driving a device virtqueue through the IOMMU;
that is deferred to follow-on task A2 (a virtio-rng virtqueue driver as the
DMA proof vehicle). Invalidation + IOTLB flush with completion polling (Slice B)
and the IOMMU-backed stale-handle / stale-completion hostile smokes (Slice C)
are also follow-on slices; at A1 their proof lines emit
proof_result=deferred-next-slice. A2, B, and C have since landed — see the
implementation notes below.
The table pages are recorded in a bounded ledger modeled on the device-manager
DMAPool page-accounting discipline (allocate-record, scrub-before-free on the
fail-closed path); mapping removal with IOTLB-flush-ordered scrub/free is Slice
B. IOVA export stays disabled (iova_export=disabled-this-slice), no host
physical address is user-visible (host_physical_user_visible=0), and no
hostile-hardware isolation is claimed (hostile_hardware_isolation=not-claimed).
The kernel-owned bounce-buffer fallback is unchanged for QEMU shapes without
usable intel-iommu hardware and is emitted with the explicit
fallback_policy=kernel-owned-bounce-buffer-only /
remapping_tables=not-programmed labels. The new make run-iommu-remapping
gate is asserted by tools/qemu-iommu-remapping-smoke.sh; make run-net,
make run-dmapool-grant, and make run-diagnostics continue to prove the
fallback path unchanged.
Implementation note, 2026-05-14 15:19 UTC (DDF IOMMU remapping Slice A2): the device-DMA proof vehicle has landed, upgrading the A1 structural proof to a real hardware-DMA proof and closing the literal hardware-DMA text of gate part
- After the VT-d tables are programmed and
GCMD.TEis set, a minimal virtio-rng virtqueue driver — split into a mapped-DMA phase (crate::virtio::prove_iommu_rng_mapped_dma) and an unmapped-DMA phase (crate::virtio::prove_iommu_rng_unmapped_dma) — drives the device QEMU exposes on theintel-iommushape. The second-level table now installs four leaf entries inside one shared L1 page: the request buffer plus the three virtqueue ring pages (descriptor table, available ring, used ring). The driver programs the device’sQUEUE_DESC/QUEUE_DRIVER/QUEUE_DEVICEregisters and the request descriptor’saddrfield with the programmed IOVAs, never the host-physical page addresses. Because VT-d translation is global per requester onceGCMD.TEis set, every DMA the device issues — every ring access and the entropy write — must walk the second-level table. A used-ring completion plus a non-zero buffer reading therefore proves a real hardware DMA reached the kernel page through the programmed IOVA translation:mapped_iova_translated=hardware-dma,proof_evidence=virtio-rng-hardware-dma.
The driver then publishes a second descriptor whose addr is the
deliberately-unmapped IOVA and kicks the device. The device’s DMA to that IOVA
raises a real VT-d translation fault; the kernel reads it back out of the Fault
Status Register (FSTS.PPF), the first Fault Recording Register’s fault bit
(FRCD[0].F at the decoded CAP.FRO offset), and the faulting page address
recorded in FRCD[0] — which must equal the unmapped IOVA the device was
pointed at — and reports unmapped_iova_fault=observed with the
fault_recording_reason code. The fault gate is purely the VT-d register
surface: whether QEMU’s virtio-rng still pushes the faulting descriptor onto
the used ring afterward (it does, with the entropy write dropped) is QEMU
device behavior, reported as the unmapped_descriptor_uncompleted diagnostic
field but deliberately not a gate condition. The fault registers are cleared
(write-1-to-clear) before the device DMA and again after the observed-fault
read so no stale fault is mistaken for the proof and no fault is left for a
later VT-d consumer. The MMIO discipline reuses A1’s NO_CACHE mapping and the
descriptor/available-ring writes are SeqCst-fenced before the notify
doorbell. The two phases are deliberately split so the kernel reads the fault
registers strictly between them — the unmapped-IOVA descriptor is never in
flight while the mapped-DMA result is judged. The virtio-rng device negotiates
VIRTIO_F_ACCESS_PLATFORM, which is what makes QEMU route its DMA through the
platform IOMMU rather than treating the ring registers as host-physical
addresses; the run-iommu-remapping make target therefore creates the
virtio-rng device with iommu_platform=on (a target-scoped override of the
shared QEMU_SECOND_DEVICE, which no other run target needs because none of
them drives virtio-rng DMA). The IOMMU-backed hostile smokes (Slice C) were a
follow-on at A2 (proof_result=deferred-next-slice) and have since landed —
see the Slice C implementation note below; IOVA export stays disabled and no
host-physical address is user-visible.
Implementation note, 2026-05-14 17:19 UTC (DDF IOMMU remapping Slice B): the
invalidation + IOTLB flush + invalidation-ordered scrub/free has landed,
closing gate part 2. After the A2 hardware-DMA proof, kernel/src/iommu.rs
runs a revocation cycle (run_invalidation_revocation_cycle) that models the
device-manager DeviceOwnerState revocation FSM at the
QueuesQuiesced -> Resetting -> DmaMappingsRemoved -> Dead steps. The cycle
removes the four second-level leaf entries the A2 layout installed (request
buffer + the three virtqueue ring pages), SeqCst-fences so the in-memory
removal is visible to the IOMMU before the flush, then issues two
register-based invalidations: a context-cache invalidation through the
Context Command register (CCMD_REG at offset 0x28, CCMD.ICC set with
CCMD.CIRG global granularity, polling CCMD.ICC clear for completion) and a
domain-selective IOTLB invalidation through the IOTLB register at the
A1-decoded CAP.IRO offset + 8 (IOTLB.IVT set with IOTLB.IIRG
domain-selective granularity and the domain id in IOTLB.DID, polling
IOTLB.IVT clear and reading IOTLB.IAIG back non-zero to confirm the
request was serviced). Both completion polls are bounded by the same
VTD_STATUS_POLL_BUDGET the A1 status handshakes use.
The hard ordering invariant — the whole point of the slice — is that the eight
VT-d ledger-owned table/ring/used pages and the separate production
DMAPool-owned request-buffer page are scrubbed and returned to their ledgers
strictly after both completion polls return. A SeqCst fence sits between
the completion reads and the scrub/free so the ordering is explicit in program
order. A poll that exhausts its bounded budget fails closed:
invalidation_completed is false, the pages are deliberately not freed
(a page reused while hardware may still hold a cached translation through it is
a stale-DMA hole), the ledgers keep them accounted rather than
leaked-and-forgotten, and the proof line reports
proof_result=fail-closed. Slice B uses register-based invalidation only:
no GCMD.QIE queued-invalidation bit is set, so the A1 single-bit-GCMD
discipline (correct only by minimalism — no other persistent GCMD bit set)
still holds and the GCMD-reconstruct boundary is not crossed. The
production DMAPool programming-abort path follows the same rule: if VT-d
programming fails before root/translation state can expose the DMAPool page to
hardware, the prepared DMABuffer/DMAPool records are detached; if the mapping
may already be hardware-visible, the partial VT-d ledger is carried through
the same leaf-removal and invalidation teardown before any VT-d table/ring page
or production DMAPool page can be reused. The
make run-iommu-remapping smoke and tools/qemu-iommu-remapping-smoke.sh now
assert the invalidation proof line as proof_result=ok with
mapping_removed=true context_cache_invalidated=true iotlb_flushed=true
iotlb_actual_granularity_nonzero=true invalidation_completed=true
page_reuse_ordered_after_invalidation=true table_pages_live_after=0
invalidation_interface=register-based-ccmd+iotlb, and forbid both a
regression to the Slice-A deferred label and any invalidation_interface=queued
value. The IOMMU-backed stale-handle / stale-completion hostile smokes (Slice
C) were the deferred follow-on; they have since landed (see the note below),
and as part of that work the single-phase run_invalidation_revocation_cycle
was refactored into a two-phase run_target_revocation_phase +
complete_revocation_teardown so the hostile re-drive can sit between the
phases — the Slice B contract (every page freed strictly after its mappings
are invalidated) is unchanged, and the combined invalidation proof line
still asserts proof_result=ok for the complete teardown. IOVA export stays
disabled and no host-physical address is user-visible.
Implementation note, 2026-05-14 19:13 UTC (DDF IOMMU remapping Slice C): the
IOMMU-backed hostile stale-DMA smokes have landed, closing gate part 5 and the
parent IOMMU remapping task. Closing the slice required refactoring the Slice
B revocation into two phases so the hostile re-drive can run against a
partially revoked remapping — the original single-phase cycle freed every
page (including the virtio-rng descriptor table and available ring) before any
hostile re-drive could observe a fault, so the re-driven device read an
all-zero descriptor and issued no DMA at all. The kernel/src/iommu.rs ledger
now classes each page by revocation-phase role: Target (request buffer +
used ring — what the device’s DMA lands on), RingInfra (descriptor table +
available ring — what the device reads to issue a DMA), and Table
(root/context/second-level tables). run_target_revocation_phase removes the
Target second-level leaves, invalidates the context-cache + IOTLB, and frees
the Target pages — while deliberately keeping the RingInfra + Table
pages mapped and live. run_hostile_stale_dma_cycle then re-drives the
same still-live old-generation virtio-rng device through the new
crate::virtio::prove_iommu_rng_stale_dma (each re-drive uses a fresh
available-ring index past the A2 phases so the device sees a genuinely new
descriptor): because the ring-infra pages are still mapped, the device reads a
valid descriptor whose addr is a revoked target IOVA, so the DMA faults
in the IOMMU (FSTS.PPF + FRCD[0].F, recorded faulting address is the stale
IOVA) instead of reaching memory. A stale mapping-install attempt is refused
(attempt_stale_mapping_install — the RevokedRemapping token is a
dead-domain receipt with no live table handle, not install authority), and the
freed Target page reads back as the scrubbed zeros. A second re-drive at the
revoked used-ring IOVA faults too, publishes no device-written used-ring CQ
entry into the freed page, exposes no memory to a would-be new owner, and
makes no freed page eligible for reuse. complete_revocation_teardown then
finishes the Slice B teardown by revoking + freeing the RingInfra + Table
pages; the combined gate-part-2 invalidation proof line
(invalidation_phases=target-then-ringinfra) still asserts proof_result=ok
for the full two-phase teardown, with the same hard ordering invariant in each
phase (pages freed strictly after that phase’s invalidation completion polls
return; a poll that exhausts its budget fails closed and the phase’s pages are
not freed). The load-bearing observation is the revoked translation state
— not device cooperation, not a software ledger drop — blocking the stale DMA,
confirmed by the VT-d fault registers plus a freed-page-stays-scrubbed
read-back through the HHDM. Existing bounce-buffer stale-DMA evidence (the
device-dma S.11.2.7/S.11.2.8 proofs) is preserved unchanged; the
IOMMU-backed hostile smokes are strictly additive. The
make run-iommu-remapping smoke and tools/qemu-iommu-remapping-smoke.sh now
assert both hostile proof lines as proof_result=ok and forbid regression
to the deferred, not-reached, or fail-closed labels. IOVA export stays
disabled (iova_export=disabled-this-slice), no host-physical address is
user-visible (host_physical_user_visible=0), and no hostile-hardware
isolation is claimed (hostile_hardware_isolation=not-claimed).
Implementation note, 2026-05-23 (domain-scoped IOVA export): the selected QEMU
Intel production DMAPool path now exposes the mapped request-buffer IOVA
through DMABuffer.info only while the matching active DmaBufferHandle
generation is live. The schema fields are deviceIova,
deviceIovaScope=domain-scoped-iova,
deviceIovaMeaning=meaningless-outside-domain, and
iovaExport=domain-scoped-only; the production remapping proof asserts that
deviceIova=0x200000 matches the installed second-level mapping. After the
buffer is freed and the pool is released, an export attempt on the same handle
fails closed with side-effect-blocked. The bounce-buffer grant path still
reports deviceIova=0, deviceIovaScope=none,
deviceIovaMeaning=not-exported, and iovaExport=disabled-future-only.
Implementation note, 2026-05-23 21:34 UTC (production-path hostile smokes):
the selected QEMU Intel path now emits and asserts iommu-remapping: production dmapool hostile proof over the active manager-owned DMAPool /
DMABuffer ledger. The proof ties the raw VT-d stale-handle and
stale-completion faults to the production mapped IOVA, synthetic stale
pool/buffer generation mismatch candidates, post-teardown stale-handle export
failure, and per-device cross-domain boundary. It covers stale IOVA after
revoke/reset, descriptor abuse, revoke/reset race ordering, stale completion
after reset, teardown-under-DMA ordering, and cross-domain stale-handle
attempts; no second-level entry is installed for stale authority, no CQ entry
is published, no new-owner memory is exposed, and page reuse stays ordered
after invalidation completion. It does not claim a process-exit trigger for
the IOMMU path; the existing make run-net bounce-buffer evidence remains the
exit-under-DMA source. The same smoke asserts the
complete_iommu_dmapool_mapping_teardown prerequisite-false return and the
hold_iommu_dmapool_mapping_ledger_after_abort path as fail-closed branch
evidence. Existing bounce-buffer S.11.2 evidence from make run-net and
make run-dmapool-grant is preserved unchanged.
Implementation note, 2026-05-26 05:55 UTC (direct-DMA posture transition for
the selected QEMU Intel path): the closeout slices above landed the full
mechanism — real hardware DMA over a manager-owned DMAPool DMABuffer page
mapped through the per-device IOMMU domain, domain-scoped IOVA export, per-device
domains, and the production hostile matrix — but deliberately deferred the
headline direct_dma=enabled claim behind iova_export=disabled-this-slice.
The selected QEMU Intel path now emits iommu-remapping: direct-dma posture real_dma=attempted direct_dma=enabled remapping_tables=programmed trusted_domain=<domain-id> descriptors_reference=domain-scoped-iova mapped_page_source=manager-owned-dmabuffer mapping_installed_before_doorbell=true invalidated_before_page_reuse=true bounce_buffer=not-required exported_device_addresses=iova-only host_physical_user_visible=0 hostile_hardware_isolation=not-claimed proof_result=ok. Every field is computed
from the real proof facts, not asserted as a constant: remapping_tables=programmed
requires the root/context/second-level entries written plus the SRTP/TES
handshakes; real_dma=attempted requires the virtio-rng device’s mapped DMA to
have completed through the programmed IOVA (hardware_dma_translation_proven);
direct_dma=enabled additionally requires the manager-owned DMAPool mapping to
have been installed before the device doorbell; and invalidated_before_page_reuse
folds in the two-phase revocation’s page_reuse_ordered_after_invalidation and
invalidation_completed results. This is bounded QEMU-emulator evidence, so
hostile_hardware_isolation stays not-claimed (real hostile-hardware isolation
needs real hardware, not QEMU). The no-IOMMU run-net / run-dmapool-grant
bounce-buffer fallback is untouched: it keeps direct_dma=blocked with no IOVA
export, and make run-iommu-remapping now forbids this path from regressing to
the bounce-buffer fallback proof or to a blocked/not-attempted posture. The
contract table in “Downstream-Contract Scaffolding” (direct-remapping domain:
direct_dma=enabled, remapping_tables=programmed,
exported_device_addresses=iova-only) is now backed by an emitted, asserted
posture on the selected path.
Authority Model
Device authority is split into three independent capabilities:
DMAPool: authority to allocate, expose, and revoke device-visible memory within a kernel-owned physical range or IOMMU domain.DeviceMmio: authority to map and access one device’s register windows.Interrupt: authority to wait for and acknowledge one interrupt source.
Holding one of these capabilities never implies the others. A driver needs all three for a normal device, but the kernel and init can grant, revoke, and audit them separately.
Production Handle Epoch Invariants
All three object families use opaque handles whose identity is checked against kernel-owned records before every operation. A raw object id is never enough to authorize DMA, MMIO, interrupt waits, acknowledgements, descriptor submission, or teardown. A handle is accepted only when all of these facts match in the same ownership transaction:
- the object id resolves to a live record of the expected type;
- the handle’s device owner generation matches the current device-manager owner record;
- the handle’s pool, mapping, slot, source, or route generation matches the current reusable subrecord;
- the record state permits the requested operation.
The exact ABI shape may change when the capability surface is implemented, but production handles must carry the equivalent identity:
#![allow(unused)]
fn main() {
struct DmaPoolHandle {
device_id: u32,
owner_generation: u64,
pool_id: u32,
pool_generation: u64,
}
struct DmaBufferHandle {
device_id: u32,
owner_generation: u64,
pool_id: u32,
pool_generation: u64,
slot: u32,
slot_generation: u64,
}
struct DeviceMmioHandle {
device_id: u32,
owner_generation: u64,
bar: u8,
mapping_id: u32,
mapping_generation: u64,
}
struct InterruptHandle {
device_id: u32,
owner_generation: u64,
source_id: u32,
source_generation: u64,
route_generation: u64,
}
}
Object identity fields have distinct jobs:
DMAPoolhandles name the claimed device, the device owner generation, and the pool record generation. Buffer handles issued by the pool repeat the device-owner and pool identity and additionally name a buffer slot and slot generation. The pool identity prevents a handle from crossing devices or owners; the slot identity prevents a freed or reused buffer slot from accepting an old descriptor, free, or completion.DeviceMmiohandles name the claimed device, owner generation, BAR or subrange mapping record, and mapping generation. The physical range, cache attributes, and access policy remain in the kernel record and are not user-editable handle fields.Interrupthandles name the claimed device, owner generation, source record, source generation, and route generation. Waiter records may carry their own waiter generation internally, but they must be invalidated whenever the source or route generation changes.
Owner generations and subrecord generations are intentionally separate. The
device owner generation belongs to the device-manager ownership record and
invalidates every DMAPool, DeviceMmio, and Interrupt handle for the old
owner when ownership is revoked, transferred, reset, or reassigned. Pool,
buffer-slot, MMIO-mapping, interrupt-source, and route generations belong to
records that may be reused below the device owner. They prevent stale buffer,
mapping, route, waiter, and completion handles from matching a newly allocated
subrecord even when the device id or pool id is reused.
Every epoch is non-wrapping for authority purposes. Implementations must use an epoch width that cannot wrap during the object’s lifetime, or permanently retire the exhausted device, pool, slot, mapping, source, or route record. Epoch exhaustion is a closed allocation or reassignment failure; it must never wrap back to a value that could match an old handle.
Generation mismatch, wrong object type, wrong device owner, freed slot, detached source, revoked mapping, and wrong device-owner state are hard closed results. The failed operation must not program a descriptor, ring a doorbell, perform an MMIO write, unmask or acknowledge an interrupt, wake a waiter, publish a CQE, decrement completion accounting, free a page for reuse, or mutate the device ledger except for bounded failure accounting or audit metadata.
Transfer, revoke, reset, and reassignment are ordered around those epoch checks:
- Transfer: The old owner leaves
Active, the owner generation advances, and old handles become invalid before a new owner receives handles. A transfer may preserve hardware state only after old interrupt notifications, MMIO write authority, and DMA submissions are either quiesced or represented by old-generation ledger entries that the new owner cannot consume. - Revoke: The device manager invalidates user-visible handles first, then follows the revocation order below: MMIO write authority removed, interrupts masked or detached, queues quiesced or reset, mappings removed, and pages scrubbed before release.
- Reset: Reset or disable advances the owner generation and any affected source, route, pool, mapping, and buffer-slot generations before new handles can be issued. If old DMA writes cannot be proven stopped, buffer slots stay unavailable until reset completion and mapping invalidation prove reuse is safe.
- Reassignment: Interrupt sources, MMIO mappings, and DMA pool records are detached or unmapped, their subrecord generations advance, pending waiters or completions are drained or marked stale, and only then can a new owner receive authority for the reused source, mapping, or slot.
Handle reuse rules:
- stale handles fail closed;
- freed-handle reuse fails closed;
- reallocated slots must not restore authority to old handles;
- old interrupt waiters must not observe or acknowledge a new owner’s interrupt source;
- old DMA handles must not reference a newly allocated buffer in the same slot.
Production proof obligations are split between host tests and QEMU smokes. Host tests must cover the pure validator and state-machine cases: stale owner generation, stale pool or mapping generation, stale buffer slot, stale interrupt source or route, wrong owner, wrong device, wrong object type, freed object, wrong state, epoch exhaustion/retirement, and no side effects on failure. QEMU smokes must prove the hardware-facing ordering: stale DMA handles after free/reuse cannot submit descriptors, stale DMA completions after revoke/reset cannot publish CQEs or mutate reused buffers, stale MMIO handles cannot ring doorbells after revoke, stale interrupt waiters or acknowledgements cannot wake or affect a new owner, and process-exit or driver-crash teardown reaches a zero-live ledger before pages are reused. These production handles and proofs remain open; the current QEMU scratch proofs are prerequisite evidence for this contract, not completion of it.
Implementation note, 2026-05-02 13:18 UTC: capos-lib::device_authority now
implements the bounded pure host-testable validator prerequisite for these
handle epoch invariants. The module models the documented handle and record
identity fields for DMAPool, DMA buffer, DeviceMmio, and Interrupt,
separates device-owner generations from pool, slot, mapping, source, and route
generations, returns explicit fail-closed error labels, blocks the relevant
side-effect class on validation failure, and refuses epoch wrap or retired
epoch reuse. This does not expose production userspace handles, wire kernel
device paths, attach budget/OOM policy to real handle creation, or complete
the QEMU stale-handle or S.11.2 hostile-smoke gates.
Implementation note, 2026-05-03: the pure host-test operation matrix now
enumerates every current validator operation variant:
DMAPool::{AllocateBuffer,IssueBufferHandle},
DMABuffer::{SubmitDescriptor,CompleteDescriptor,FreeBuffer},
DeviceMmio::{Map,Read,Write,RingDoorbell,Unmap}, and
Interrupt::{Wait,Acknowledge,Mask,Unmask}. Each row asserts active
acceptance plus stale owner/subrecord, freed, revoked, and retired failures
with the exact blocked side-effect class for that operation. This remains
ABI-independent host-test evidence only; it does not create production
userspace handles or replace the QEMU stale-handle and S.11.2 hostile-smoke
gates.
Implementation note, 2026-05-02 13:43 UTC: the current kernel device-manager
DMAPool lifecycle and imported-live accounting proofs now adapt their BDF,
owner generation, pool slot, and pool generation into
capos-lib::device_authority records. The QEMU proof records active validator
success, stale-after-revoke failure as dmapool-stale-handle, the validator
reason stale-owner-generation, and side-effect-blocked. This is still a
bounded kernel-proof adapter, not production userspace handle exposure,
DeviceMmio/Interrupt handle wiring, production page cleanup, or S.11.2
hostile smoke completion.
Implementation note, 2026-05-02 17:04 UTC: the current kernel device-manager
also has a bounded manager-owned DeviceMmio record proof adapter. The record
carries BAR, mapping id, mapping generation, and owner generation fields, and
the QEMU virtio-rng device-manager path validates a RingDoorbell operation
through capos-lib::device_authority. After revoke begins, the old handle
fails through the pure validator as stale-owner-generation, records
devicemmio-stale-handle plus side-effect-blocked, and no doorbell write is
attempted. The lifecycle proof blocks RevokingHandles -> MmioRevoked while
the record is attached, then allows the transition after bounded detach. This
does not expose production userspace DeviceMmio authority, program real BAR
mappings, create mapping objects, or complete hostile stale-MMIO smokes.
Implementation note, 2026-05-02 17:29 UTC: the bounded DeviceMmio adapter now
derives the proof mapping from the first decoded PCI memory BAR on the tested
PciDevice through the shared BAR-region validator. The attached manager-owned
record carries that BDF/BAR/base/length metadata and validates that it is the
same BDF, a memory BAR, nonzero length, and the same BAR named by the handle
before constructing the pure capos-lib::device_authority record. The QEMU
smoke asserts region_source=pci-decoded-memory-bar,
region_bound_to_manager=true, bar_present=true, bar_memory=true,
bar_base, and bar_length. This is still prerequisite evidence only: it does
not create userspace DeviceMmio handles, program real MMIO mappings, enforce
cache attributes or write policy, or write a real doorbell.
Implementation note, 2026-05-02 17:54 UTC: the same bounded DeviceMmio
adapter now records fail-closed malformed-region evidence before the positive
attach. Wrong-BDF metadata, wrong BAR/handle mismatch, and zero-length region
metadata all report devicemmio-region-invalid, with
region_invalid_mapping=not-created and
region_negative_side_effect=side-effect-blocked; the proof still records
real_mmio_mapping=not-programmed and real_doorbell=not-written. This is
bounded manager-proof evidence only. It does not create userspace
DeviceMmio handles, map real BAR pages, enforce cache attributes or write
policy, complete hostile stale-MMIO smokes, or perform a real doorbell write.
Implementation note, 2026-05-02 20:14 UTC: the bounded DeviceMmio adapter now
stores future mapping policy metadata on the attached manager-owned record and
reads it back through the active record plus matching DeviceMmioHandle. The
proof line records policy_scope=manager-attached-devicemmio-cache-write-policy,
cache_policy=device-uncacheable,
page_table_protection=capability-scoped-device-nx,
write_policy=claimed-registers-and-doorbells-only, executable=blocked,
userspace_devicemmio=not-started, host_physical_user_visible=0,
policy_bound_to_manager=true, and policy_result=ok. A tampered cache/write
policy record fails closed with policy_tamper_result=fail-closed,
policy_tamper_mapping=not-created, and
policy_tamper_side_effect=side-effect-blocked. This is still metadata proof
only: no PAT/MTRR or page-table programming is performed, no userspace
DeviceMmio handle is created, no real BAR mapping object exists, and no
doorbell is written.
Implementation note, 2026-05-02 20:45 UTC: while the same bounded
manager-owned DeviceMmio record is still active, the proof now validates
hostile RingDoorbell handles through a proof-scoped adapter that uses the
already-attached record rather than manager lookup short-circuits. Wrong owner
generation, wrong mapping generation, wrong mapping id, wrong BAR, and wrong
BDF/device fail closed with exact pure-validator reasons
stale-owner-generation, stale-mapping-generation, wrong-mapping,
wrong-bar, and wrong-device. Each records side-effect-blocked; the proof
also records that the attached manager record is preserved, no fake mapping is
created, and no doorbell is written. This remains bounded proof evidence only:
production userspace handles, real MMIO mappings, real cache/write-policy
enforcement, and hostile stale-MMIO smokes remain open.
Implementation note, 2026-05-03 00:36 UTC: the schema and kernel now include a
result-only DeviceMmio.info skeleton that can wrap a manager-issued
DeviceMmioHandle. The object validates the live device-manager record
through validate_devicemmio_record() before returning status labels such as
userspaceDeviceMmio=manager-issued-skeleton,
managerRecord=validated-active, realMmioMapping=not-programmed,
realDoorbell=not-written, hostPhysicalUserVisible=false,
directMmio=blocked, registerRead=blocked, registerWrite=blocked, and
bootstrapGrant=blocked. The QEMU device-manager lifecycle proof constructs
that cap object while the attached record is active, records
devicemmio_cap_info_result=ok, exercises the serialized
CapObject::call(0, &[]) path and decodes the returned DeviceMmio.info
Cap’n Proto result as devicemmio_cap_serialized_call_result=ok, then
verifies the same cap fails closed after revoke begins as
devicemmio_cap_stale_after_revoke_result=devicemmio-stale-handle; the same
stale object also fails the serialized method-0 path as
devicemmio_cap_serialized_stale_after_revoke_result=invoke-failed. A later
manifest-grant smoke explicitly releases the granted DeviceMmio cap through
CAP_OP_RELEASE and proves a subsequent typed DeviceMmio.info call fails
closed from userspace. A focused grant-cycle smoke now repeats that grant,
release, and stale-info proof twice in sequence and asserts the second
manager-grant-source acquire receives a fresh mapping generation after the
first release; the same smoke also decodes both acquire/release cycles through
the typed volatile HardwareAuditLog.snapshot surface. The focused
hardware-audit interrupt-waiter smoke also decodes recent boot-time
DmaBuffer, DmaPool, and Interrupt driver-crash / reset-disable
lifecycle records from the current volatile 16-record snapshot window. The same
smoke now uses the startSequence cursor to decode older retained
DeviceMmio lifecycle rows that the default latest 16-record tail has skipped.
A 2026-05-10 15:33 UTC manifest-grant follow-up turns
DeviceMmio.map from admission-only into a read-only userspace VMA over the
boot-preseeded BAR page already used by brokered read32. The typed smoke
validates the active DeviceMmioOperation::Map authority check, rejects
writable, executable, unknown, zero-size, unaligned, out-of-BAR, and overflow
requests with typed no-side-effect results, reads the same QEMU BAR value
through the returned userspace address and DeviceMmio.read32, rejects a
duplicate active map, explicitly calls DeviceMmio.unmap, proves a second
unmap is a typed no-op, remaps, and proves stale unmap fails closed after cap
release. Release/drop/driver-crash/reset-disable cleanup revokes any borrowed
user VMA before detaching the manager record. This is read-only BAR VMA
evidence only: it does not add writable MMIO, volatile register writes,
doorbells, host physical/IOVA exposure, post-userspace kernel MMIO mappings,
IOMMU programming, durable/signed audit persistence, concurrent sharing
semantics, or a production provider-driver consumer. A 2026-05-10 20:06 UTC
follow-up promotes DeviceMmio.write32 to one bounded brokered volatile dword
write through that same boot-preseeded kernel MMIO cache after active
manager-attached handle, owner/state, policy/region, pure Write authority,
dword alignment, decoded-BAR range validation, and a single provider-scoped
claim derived from PCI MSI-X metadata, including BDF, BAR, BAR base, offset,
and value. The focused proof writes the claimed virtio-rng MSI-X entry-0
vector-control mask dword,
reads it back through both brokered read32 and the read-only userspace VMA,
then proves an unclaimed
message-address dword write leaves the original value unchanged. Invalid range
and unclaimed calls remain typed no-write results, while stale or released
handles fail closed before any write and do not return a write32 result
payload. This does not add writable userspace BAR mappings, arbitrary register
writes, doorbells, host physical/IOVA exposure, post-userspace arbitrary
remaps, IOMMU programming, or a production provider-driver consumer.
A 2026-05-26 07:32 UTC follow-up
(ddf-userspace-writable-devicemmio-interrupt) proves the cross-authority
non-implication required by the gate (“holding one authority must not imply
either of the others”). Each grant smoke’s granted CapSet is its sole
authority source: the DeviceMmio smoke holds only console + device_mmio
and the Interrupt smoke only console + interrupt. The smokes assert that
the other DDF grants (interrupt/device_mmio and dmapool) are absent from
the CapSet, and that the held cap cannot be reinterpreted as another
interface because the kernel-delivered interface id is fixed at grant time
(negative-authority ... result=ok lines, with the kernel “spawned … 2 caps”
structural counterpart). This is a non-implication proof over the
already-landed authorities; it adds no new kernel MMIO/IRQ/DMA surface. Real
userspace wait/acknowledge over the live route with deferred LAPIC EOI is
proven separately by the provider tx_interrupt/rx_interrupt consumer
(make run-ddf-provider-consumer).
DMAPool Invariants
DMAPool is the only future userspace-facing authority that may cause a
device-visible DMA address to exist.
- Authority: A holder may allocate buffers only from the pool object it was granted. It may not request arbitrary physical frames, import caller virtual memory by address, or derive another pool.
- Handle identity: A pool operation checks the claimed device id, owner generation, pool id, and pool generation before changing pool state. Buffer operations additionally check the buffer slot and slot generation before descriptor validation, completion accounting, free, scrub, or reuse.
- Physical range: Every exported device address must resolve to pages owned by the pool. The kernel records the allowed host-physical page set and validates every descriptor mapping against that set before a device can use it. If an IOMMU domain backs the pool, the exported address is an IOVA, not raw host physical memory.
- Ownership: Each DMA buffer has one pool owner, one device-domain owner, and explicit CPU mappings. Sharing a buffer with another process requires a later typed memory-object transfer; copying packet data is the default until that object exists.
- No raw grants: Userspace never receives an unrestricted host-physical
address. A driver may receive an opaque DMA handle or an IOVA meaningful
only to its
DMAPool/device domain. It cannot turn that value into access to unrelated RAM. - Residency: DMA pages are committed before exposure to the device, resident for the entire device-visible lifetime, unswappable, and scrubbed before reuse by another owner.
- Bounds: Buffer length, alignment, segment count, and queue depth are bounded by the pool. Descriptor chains that point outside an allocated buffer, wrap arithmetic, exceed device limits, or reference freed buffers fail closed before doorbell writes.
- Revocation: Revoking the pool first quiesces the device path using it, prevents new descriptors, waits for or cancels in-flight descriptors, then removes IOMMU mappings or invalidates bounce-buffer handles before freeing pages.
- Reset: If in-flight DMA cannot be proven stopped, revocation escalates to device reset through the owning device object before pages are reused.
- Residual state: Pages returned from a pool are zeroed or otherwise scrubbed before reuse by a different owner. Receive buffers are treated as device-written untrusted input until validated by the driver or stack.
Device-visible memory authority is not ordinary MemoryObject authority.
FrameAllocator and MemoryObject must not become raw physical-address escape
hatches. A future shared-buffer transfer may share CPU-visible packet bytes
after validation, but it does not by itself grant IOVA creation, descriptor
programming, or device write authority.
For the in-kernel QEMU smoke, the kernel is the only DMAPool holder. The
same invariants apply internally even though no userspace capability object is
exposed yet.
Implementation note, 2026-04-24: the initial virtio-net transport uses kernel-owned frame-allocator pages for RX/TX split-virtqueue descriptor, available, and used rings plus the one-shot TX descriptor proof buffer, ARP TX buffer, ICMP TX buffer, smoltcp adapter/TCP TX buffers, and posted RX packet buffers. The smoltcp adapter copies completed RX frame bytes out of those device-written pages before handing them to the stack. Those pages are programmed into the device only by kernel code after modern PCI transport discovery and feature negotiation; no userspace process receives a DMA buffer, physical address, or BAR mapping.
Implementation note, 2026-05-02: the current QEMU virtio-net DMA path routes
those kernel-owned pages through a bounded device_dma pool ledger. The net
smoke proves live pool bytes, page counts, page-rounded MMIO mapping bytes,
config/RX/TX interrupt holds, RX/TX ring depths, and RX/TX
submission/completion and in-flight descriptor accounting. This is the first
kernel-owned DMAPool accounting proof; it does not expose userspace DMA,
MMIO, or interrupt handles and does not complete the production S.11.2
hostile-smoke gate.
Implementation note, 2026-05-02 06:59 UTC: the kernel-owned device_dma
ledger now has an explicit bounded budget/OOM policy for the current
virtio-net proof path: 32 DMA pages, 131072 DMA bytes, queue depth 8,
submission depth 8, four page-rounded MMIO mapping holds, 16384 MMIO mapped
bytes, and three interrupt holds. make run-net emits a scratch-ledger
device-dma: budget oom proof ... proof_result=ok line proving page and byte
allocation over budget, overlarge queue depth, duplicate and over-budget MMIO
holds, MMIO byte over budget, duplicate and over-budget interrupt holds, and
descriptor submission beyond queue depth all fail closed without mutating the
live virtio-net ledger; the proof also revalidates the live ledger. This is a
bounded prerequisite for the production DMAPool contract. It still does not
expose userspace DMAPool, DeviceMmio, or Interrupt handles, wire real
lifecycle hooks, program IOMMU remapping domains, or close the S.11.2 hostile
smoke matrix.
Implementation note, 2026-05-02 16:31 UTC: attached device-manager DMAPool
records now also carry that budget profile. The lifecycle and imported
live-accounting proofs read page, byte, queue-depth, submission-depth, MMIO
mapping, MMIO byte, and interrupt-hold budgets through the active
device-manager record plus the matching DmaPoolHandle. Queue and submission
depth remain per-queue limits; the manager proof records queue_count=2 plus
derived aggregate in-flight/submission budgets and checks imported virtio-net
accounting against those aggregate totals. This keeps the budget policy tied
to manager state but still does not create userspace DMAPool handles or
enforce budgets at userspace handle creation, transfer, or revoke.
Implementation note, 2026-05-02 21:13 UTC: the zero-live device-manager
DMAPool lifecycle proof now validates a proof-scoped tampered
AttachedDmaPoolBudgetPolicyRecord through the manager budget-policy helper
while the attached pool record is active. The tampered record uses the wrong
policy scope/source/label plus stricter page, byte, queue, in-flight, MMIO,
interrupt, and submission budgets, and it fails closed before it can be
treated as a usable policy. The QEMU proof records
budget_policy_tamper_result=fail-closed,
budget_policy_tamper_allocation=not-created,
budget_policy_tamper_ledger=not-mutated,
budget_policy_tamper_teardown=not-advanced, and
budget_policy_tamper_side_effect=side-effect-blocked. This is still
bounded metadata proof only: no userspace DMAPool handle is exposed, no
production userspace DMA page is allocated, freed, or reused, and no real DMA
teardown is claimed.
Implementation note, 2026-05-02 22:16 UTC: the manager-owned DMAPool
budget-accounting proof now fails closed on accounting over the attached
budget instead of only logging passive booleans. The positive zero-live and
imported-live budget_*_within_policy=true labels call a helper that
revalidates the active attached record, matching DmaPoolHandle, owner, active
state, and attached budget policy before accepting the record accounting.
While the zero-live record remains active, synthetic attached-accounting
candidates exceed buffer count, page count, byte count, and the current
aliased in-flight/submission total; each candidate fails closed before it can
be treated as usable manager state. The QEMU proof records exact overrun
reasons, no fake allocation, no ledger mutation, no teardown advancement, and
side-effect blocking. A proof-scoped over-budget attach candidate now fails
before pool generation allocation and records preserved generation state. This
remains bounded manager-record evidence only; later grant slices add the first
single-page bounce-buffer allocation/free authority, but multi-buffer
allocation, DMA mapping, descriptor execution, IOMMU programming, production
driver consumption, and S.11.2 hostile smokes remain open.
Implementation note, 2026-05-02 23:59 UTC, updated 2026-05-03: the schema and
kernel now include a result-only DMAPool.info skeleton that can wrap a
manager-issued DmaPoolHandle. The object validates the live device-manager
record through validate_dmapool_record() before returning status labels such as
userspaceDmaPool=manager-issued-bounce-buffer, realDma=not-attempted,
hostPhysicalUserVisible=false, and directDma=blocked. The QEMU zero-live
device-manager lifecycle proof constructs that cap object while the attached
record is active, records dmapool_cap_info_result=ok, exercises the
serialized CapObject::call(0, &[]) path and decodes the returned
DMAPool.info Cap’n Proto result as
dmapool_cap_serialized_call_result=ok, then verifies the same cap fails
closed after revoke begins as
dmapool_cap_stale_after_revoke_result=dmapool-stale-handle; the same stale
object also fails the serialized method-0 path as
dmapool_cap_serialized_stale_after_revoke_result=invoke-failed. The same
proof now exercises DMAPool.allocateBuffer through call_with_table() on a
real DmaPoolCap in a CapTable: it decodes bufferIndex=0, verifies
CAP_CQE_TRANSFER_RESULT_CAPS, cap_count=1, the transfer-result record’s
DMABUFFER_INTERFACE_ID, the non-transferable same-session result-cap hold,
and DMABuffer.info through the returned result cap. Duplicate allocation and
stale-after-revoke allocation both fail closed without adding result caps. The
duplicate-active valid-size path now reports a structured schema result with
result=dmapool-already-attached, reason=active-buffer-attached,
sideEffect=side-effect-blocked, and bufferPresent=false; the duplicate path
also preserves the manager generation counter. Invalid-size requests use the
same no-result-cap response shape with
result=dmapool-allocation-request-invalid, the exact request reason,
sideEffect=side-effect-blocked, and bufferPresent=false.
As of the 2026-05-08 DMAPool grant-source follow-up, the same bounded path is
also available through KernelCapSource::DmaPool: the grant source attaches a
fresh zero-live manager-owned pool record, stages matching zero-live release
evidence, and lets the child mint one DMABuffer result cap. A 2026-05-09
release-order follow-up has the smoke explicitly release the parent DMAPool
before the result DMABuffer; the parent detach remains pending until the
DMABuffer frees the page and completes the staged zero-live pool detach. A
later 2026-05-09 follow-up adds a typed DMABuffer.freeBuffer method for that
bounded result cap: the method reuses the same FreeBuffer authority validation
and page scrub/ledger/frame-free cleanup path as cap release, emits a
free-buffer audit event, invalidates later DMABuffer.info, and makes the
later cap release a no-op detach. A second bounded follow-up keeps the parent
DMAPool live after that first explicit free, reallocates the same slot with a
fresh slotGeneration, and then repeats the parent-first release proof on the
second buffer. The focused read-side HardwareAuditLog.snapshot smoke also
decodes both slot generations, both typed free-buffer records, the parent
DmaPool release, and both no-op release-after-free records through the
volatile audit cap. The run-net
DMABuffer
driver-crash and reset-disable proofs also cover the same pending-parent
completion path so successful buffer cleanup cannot orphan the staged parent
release. A later admission follow-up adds typed
DMABuffer.submitDescriptor to the same manifest-granted bounce-buffer path:
the method validates the active manager-attached buffer epoch through the
existing DmaBufferOperation::SubmitDescriptor authority validator, echoes
the queue/descriptor/length and generation identity, and proves the same call
fails closed after freeBuffer revokes the old cap. A later symmetric
follow-up adds typed DMABuffer.completeDescriptor to the same bounded path.
The 2026-05-10 request-shaping follow-up routes both typed descriptor calls
through a shared pure bounded descriptor validator: valid bounce-buffer
requests return ok request labels plus queue_count=4,
descriptor_count=8, and buffer_bytes=4096, while out-of-range
queues/descriptors, zero submit lengths, submit lengths beyond the bounce
buffer, and completion lengths beyond the bounce buffer fail closed as
dmabuffer-descriptor-request-invalid with side-effect-blocked before any
descriptor side effect. A later manager-accounting follow-up records only the
bounded manager counter: submit returns manager-inflight-recorded and
raises DMAPool.info live_inflight to 1, completion returns
manager-inflight-completed and restores it to 0, and a valid completion
with no outstanding submission returns dmabuffer-no-inflight-submission.
Too-small descriptor result buffers are rejected before accounting mutation,
and cap-table release drains bounded in-flight accounting before detaching the
bounce buffer. The 2026-05-10 06:37 UTC follow-up makes this
allocateBuffer/freeBuffer page lifecycle the first production-labeled
single-page bounce-buffer allocation/free authority. The typed surfaces report
userspaceDmaPool=manager-issued-bounce-buffer,
allocation=single-bounce-buffer-page,
recordPool=userspace-bounce-buffer-live,
zero-live-dmapool-bounce-buffer, and freeBuffer=bounce-buffer-page; the
underlying device_dma ledger uses a manager-attached bounce-buffer helper and
scrubs before frame free. A 2026-05-10 11:44 UTC follow-up extends that bounded
manifest-granted path to two fixed manager-owned slots; a 2026-05-10
12:49 UTC follow-up extends the same path to three fixed manager-owned slots:
slot 0, slot 1, and slot 2 can be live together, DMAPool.info reports three
live buffers/pages while all are attached, a fourth allocation fails closed as
dmapool-already-attached, and slot 0 can be freed and reused with a fresh
generation while slots 1 and 2 remain live. There is still no allocation
beyond those three fixed slots, real device-visible DMA mapping, host physical
address or IOVA exposure, BAR mapping, production descriptor-ring mutation, CQ
publication, IOMMU programming, or production driver consumer.
Stale allocation attempts preserve the live backing, and page allocation
failure occurs before buffer-generation allocation so it does not burn a
generation.
The 2026-05-10 13:45 UTC follow-up (3bbeb3d4) adds explicit typed
DMABuffer.unmap for the mapped bounce-buffer userspace VMA. The method
validates the live DMABuffer handle before reporting success or no-op,
removes only the borrowed VMA owned by that mapping for the calling process,
and publishes the mapping as absent only after the borrowed-range ownership
check, page-table unmap, TLB wait, and waiter cleanup succeed. While teardown
is in progress, concurrent map/free/release paths fail closed against an
in-progress mapping state. A second unmap returns dmabuffer-mapping-absent /
no-user-mapping with no side effect. This is userspace VMA cleanup only: it
does not free or scrub the bounce page, detach the buffer record, change
DMAPool.info live buffer/page/in-flight counts, program or remove real DMA or
IOMMU mappings, expose host physical/IOVA addresses, mutate descriptor rings,
publish CQ entries, or add a production driver consumer.
The 2026-05-10 14:12 UTC follow-up moves bounded descriptor accounting from a
single pool-global descriptor identity to per-slot state on each live
manager-owned DMABuffer record. DMAPool.info live_inflight remains the
aggregate sum across live slots. A valid submit on slot 0 and a valid submit
on slot 1 can coexist; duplicate submit on either same slot still fails
closed; mismatched completion preserves that slot’s descriptor without
touching other slots; matching completion of slot 0 decrements the aggregate
while slot 1 remains in flight; explicit freeBuffer of an in-flight slot
fails closed; and cap-release/process-exit cleanup drains only the releasing
slot’s descriptor before detach. This is still bounded manager accounting and
does not mutate descriptor rings, publish CQ entries, expose host physical or
IOVA addresses, attempt direct DMA, program IOMMU state, or add a production
driver consumer.
The 2026-05-10 18:11 UTC follow-up makes the single manager-owned
bounce-buffer page exclusive between userspace borrowed-VMA ownership and
manager in-flight descriptor ownership for each DMABuffer cap. A valid
submit while the same cap still has a live mapping fails closed as
dmabuffer-mapping-live / user-mapping-live before manager in-flight
accounting changes; explicit unmap restores submit. A valid map while the
slot has an in-flight descriptor fails closed as dmabuffer-inflight-submission
/ in-flight-submission, returns addr=0, and does not publish a borrowed
VMA; matching completion restores map for that slot. The lock order remains
cap mapping state before manager validation, with no address-space lock held
across manager state mutation.
The 2026-05-10 19:29 UTC follow-up adds bounded completion data on that same
manager-owned bounce-buffer page. The successful matching
DMABuffer.completeDescriptor path keeps the existing
manager-inflight-completed result labels, validates the active owner,
pool/slot generation, queue/descriptor identity, and submitted length, then
writes a deterministic byte pattern into only the accepted completionLength
bytes before clearing the in-flight record. Because submit is blocked while a
cap-owned borrowed VMA is live and map is blocked while the slot is in flight,
the write happens while no live user VMA exists for that slot; a later
successful map lets userspace read the pattern. Invalid requests, stale caps,
no-inflight completions, descriptor mismatches, length-exceeded completions,
mapped-live cases, and after-free calls do not write. This is
userspace-visible bounce-buffer completion data, not device DMA completion:
there is still no descriptor-ring mutation, CQ publication, direct DMA,
host physical/IOVA exposure, IOMMU programming, or production driver consumer.
Implementation note, 2026-05-10 04:40 UTC: duplicate-active valid-size and
invalid-size DMAPool.allocateBuffer requests now use schema result data for
domain rejection instead of an application-exception label. The no-result-cap
response reports either result=dmapool-already-attached /
reason=active-buffer-attached or
result=dmapool-allocation-request-invalid with the exact request reason, plus
sideEffect=side-effect-blocked and bufferPresent=false, before any
allocation side effect.
The 2026-05-10 live-accounting follow-up also carries that bounded frame into
the attached DMAPool record exposed by typed DMAPool.info: the manager
record starts as zero-live-dmapool-bounce-buffer, changes to
userspace-bounce-buffer-live with one, two, or three live 4096-byte pages while
manager-attached DMABuffer result caps exist, and returns to zero-live after
typed DMABuffer.freeBuffer or cap release scrubs/releases every live bounce
page. The same lifecycle proof validates that active manager-record accounting
against the attached budget policy before treating allocation as usable state.
This still does not expose a device-visible DMA address, IOVA, host physical
address, production descriptor side effect, DMA mapping, or production driver
consumer.
Implementation note, 2026-05-11 03:00 UTC: the manifest-granted
manager-owned fixed bounce-buffer DMAPool path now has its own
device-manager budget policy instead of importing the live virtio-net
device_dma policy. The policy covers three live buffers/pages, 12288 bytes,
four queues, eight descriptors per queue, one in-flight descriptor per live
slot, zero MMIO mappings/bytes, and zero interrupt holds.
DMAPool.allocateBuffer validates the next-live manager accounting against
that policy before slot selection, frame allocation, generation allocation,
result-cap minting, or manager ledger mutation. With all three fixed slots
live, a fourth valid-size allocation returns no result cap and reports
result=dmapool-budget-exceeded, reason=over-buffer-budget,
sideEffect=side-effect-blocked, and bufferPresent=false. Imported live
virtio-net proof records continue to use the device_dma:virtio-net budget
policy. This remains the bounded manager-owned bounce-buffer path: direct DMA
is blocked, host physical addresses and IOVAs stay hidden, descriptor rings
and completion queues are not mutated, and IOMMU/remapping plus production
driver consumption remain out of scope.
Implementation note, 2026-05-11 06:10 UTC: the same fixed-slot
manager-owned DMAPool family now revalidates current budget accounting
before publishing an allocated DMABuffer result cap, before acquire-audit
publication, before parent DMAPool release intent or detach, before
grant-rollback/drop/teardown detach, before pending-parent release completion,
before DMABuffer page release, and before descriptor-completion state
advancement. The focused grant smoke labels the full-pool budget rejection as
no leaked result cap, no generation burn, no ledger mutation, and no stale
authority publication. DeviceMmio and Interrupt budget propagation is not
changed by this slice.
Implementation note, 2026-05-03 02:05 UTC: the schema and kernel now include a
result-only DMABuffer.info skeleton that can wrap the manager-attached
DmaBufferHandle record already issued inside the zero-live DMAPool
lifecycle proof. The object validates the live manager-owned buffer record
through validate_dmabuffer_record() and the existing pure DMA-buffer
validator before returning status labels such as
userspaceDmaBuffer=manager-issued-bounce-buffer,
managerRecord=validated-active, bufferRecord=manager-attached-buffer,
realDma=not-attempted, hostPhysicalUserVisible=false,
directDma=blocked, descriptorSubmit=manager-inflight-accounting,
descriptorComplete=manager-inflight-accounting, freeBuffer=bounce-buffer-page, and
bootstrapGrant=blocked. The QEMU zero-live lifecycle proof constructs that
cap object while the buffer record is active, records
dmabuffer_cap_info_result=ok, exercises the serialized
CapObject::call(0, &[]) path and decodes the returned DmaBuffer.info
Cap’n Proto result as dmabuffer_cap_serialized_call_result=ok, then verifies
the same cap fails closed after revoke begins as
dmabuffer_cap_stale_after_revoke_result=dmabuffer-stale-handle; the same
stale object also fails the serialized method-0 path as
dmabuffer_cap_serialized_stale_after_revoke_result=invoke-failed. Later
bounded grant slices expose this result cap through the DMAPool
manifest-grant path, add typed DMABuffer.freeBuffer, and add bounded
userspace bounce-buffer DMABuffer.map / DMABuffer.unmap plus
manager-accounted request-shaped DMABuffer.submitDescriptor /
DMABuffer.completeDescriptor; the path still exposes no real DMA,
descriptor-ring mutation, CQ publication, host physical address or IOVA export,
production page cleanup/reuse, or production userspace DMABuffer completion.
Implementation note, 2026-05-11 11:22 UTC: the provider-consumer smoke now
uses that same bounded bounce-buffer path to prove a
descriptor-ring-equivalent provider side effect. After
DMABuffer.submitDescriptor validates live owner/pool/slot authority,
descriptor queue/id/length bounds, no live user mapping, and no duplicate
in-flight descriptor, the manager scrubs the page and writes a provider-visible
shadow descriptor entry with magic, queue, descriptor id, submitted length, and
flags before writing the existing submit marker and committing the in-flight
record. DMABuffer.completeDescriptor still writes completion bytes only
inside the validated completion length, and the smoke proves the shadow entry
and marker remain visible outside that completion window. Provider-effect
submits shorter than the 24-byte shadow-descriptor-plus-marker footprint are
rejected as a typed no-side-effect boundary, even though the shared descriptor
request shape is otherwise valid. This is a bounded bounce-buffer side effect
only; no hardware descriptor ring, CQ publication, MMIO doorbell, direct DMA,
IOVA, host physical address, or remapping-domain claim is added.
Implementation note, 2026-05-11 12:01 UTC: the same submit path now replaces
the shadow-descriptor payload with a selected provider-owned queue entry plus
marker. The entry records queue magic, queue id, tail, descriptor id,
submitted length, and flags after descriptor authority validation and the
submit scrub; make run-ddf-provider-consumer maps the buffer after
completion and proves the queue entry and marker remain visible outside the
completion window. Provider-effect submits shorter than the current 72-byte
provider queue-entry-plus-marker footprint are rejected before in-flight
accounting or provider-visible mutation. This remains bounded bounce-buffer
evidence only:
no hardware descriptor ring, CQ publication, MMIO doorbell, direct DMA, IOVA,
host physical address, or remapping-domain claim is added.
Implementation note, 2026-05-12 20:30 UTC: the accepted
DMABuffer.submitDescriptor path now constructs the candidate in-flight
descriptor in manager-local state and validates the resulting
DMAPool budget/accounting before scrubbing or writing provider-visible
bounce-buffer bytes. The provider-consumer smoke snapshots the selected
provider queue-entry and marker bytes, drives a short provider-effect submit
rejection, and proves both those bytes and live_inflight=0 are preserved.
The selected virtio-net TX publication gate remains separately bounded after
provider-entry write: quiesced publication still fails with no extra pin and
no doorbell, not with rollback of the already-written shadow entry.
Implementation note, 2026-05-11 14:39 UTC, branch commit f04a14f4: the
selected provider-owned queue entry now carries a staged claimed virtio-net
notify-offset admission record instead of only a “requires claim” gate. The
selected queue 1 path records accepted notify-offset admission plus blocked
wrong-queue and wrong-offset admissions after descriptor authority validation
and submit scrub; a separate queue 0 submit proves non-selected queues remain
neutral and blocked from selected-backend doorbell metadata. This is not a
real doorbell path: no virtio-net notify BAR handle is granted, no notify
register is written, no real virtio-net descriptor ring is mutated, and
production userspace NIC readiness is not claimed.
Current QEMU evidence: the same make run-net path now
exports a bounded live-pool snapshot from the kernel-owned virtio-net
device_dma ledger and feeds it into a device-manager DMAPool record proof.
The live record carries buffer/page count, live bytes, current in-flight
submissions, committed/resident/unswappable flags, and scrub-before-release
policy. The device-manager proof rejects both DmaMappingsRemoved and teardown
detach while that authoritative ledger snapshot remains live. The proof now
calls the device_dma teardown-evidence API and records the expected
authoritative-ledger-live block with matching imported live accounting, then
reports completion as deferred because no authoritative zero-live/scrubbed
evidence is available for the live virtio-net ledger. It does not zero the
imported record to simulate teardown, does not claim DmaMappingsRemoved,
terminal Dead, or release for the live virtio-net pages, and does not scrub
or free live virtio-net DMA pages. This is still a prerequisite
record-accounting proof:
the current pages remain kernel-owned, only bounded info-skeleton hardware cap
grants are exposed to userspace, production page release hooks for live
virtio-net DMA are not wired, IOMMU remapping is not
programmed, and S.11.2 hostile smokes remain open.
The same smoke emits a separate device_dma scratch proof for the positive
zero-live teardown-evidence path: teardown_evidence() fails closed before
quiesce and scrub markers, rejects one-marker states, and reports
authoritative-ledger-zero-live only after both markers are set. The scratch
proof revalidates the live virtio-net ledger but does not mutate, zero, scrub,
free, or claim teardown completion for real virtio-net DMA pages.
Implementation note, 2026-05-03 03:21 UTC: the zero-live device-manager
DMAPool lifecycle proof now consumes that scratch zero-live evidence before
final pool detach, and binds that evidence to the attached record source. The
manager-owned zero-live record is labeled device-manager /
zero-live-dmapool-bounce-buffer; imported live records keep the source labels from
the authoritative device_dma snapshot. After the proof-scoped buffer record
is actively freed and detached, a zero-live pool detach with mismatched
virtio-net / kernel scratch evidence fails closed as
dmapool-zero-live-evidence-invalid, and detach without authoritative
evidence fails closed as dmapool-zero-live-evidence-absent. Only scratch
authoritative-ledger-zero-live evidence carrying the same record source plus
both quiesce and scrub markers allows the manager-owned record to detach and
the revocation path to advance to DmaMappingsRemoved, Dead, and release.
This remains scratch/no-real-DMA evidence: it does not tear down live
virtio-net pages, program or remove IOMMU mappings, expose userspace
DMAPool mapping/descriptor authority, or claim production page cleanup
beyond the bounded manager-attached bounce page.
The current scratch proof set also covers stale DMA page handles without
touching real virtio-net pages: reusing the same synthetic physical page bumps
the DMA page generation, the old handle fails closed as stale-dma-handle,
wrong-queue and wrong-label frees preserve the active page record, and duplicate
free remains rejected. Production userspace DMAPool stale-handle smokes,
descriptor-abuse coverage, revoke/reset races, and real quiesce/scrub/release
remain open.
Implementation note, 2026-05-02 23:23 UTC: the kernel-owned virtio-net
device_dma page release path now validates the DeviceDmaAllocation against
the live ledger before any scrub or frame-allocator call, scrubs the frame
through the HHDM mapping, removes the ledger entry only after scrub succeeds,
and then returns the frame to the allocator. The frame scrub helper checks
frame alignment, HHDM/allocator initialization, range, and allocated state
before zeroing the page. make run-net emits a bounded
device-dma: release scrub proof line using a proof-only kernel-owned page:
stale generation, wrong queue, and wrong label release attempts fail before
scrub, frame free, or ledger mutation; the active release path records
scrub_before_frame_free=true and ledger_removed_after_scrub=true. This is
still no-userspace-handle/no-real-teardown evidence for the current
kernel-owned virtio-net path only; production DMAPool handles, real
device-manager teardown hooks, IOMMU/bounce-buffer mapping removal, hostile
stale-DMA smokes, and full page lifecycle cleanup remain open.
Implementation note, 2026-05-02 12:28 UTC: the same scratch proof family now
covers stale DMA completion ordering without touching real virtio-net pages. A
synthetic reused DMA page slot bumps generation, stale completion validation
checks the page generation before queue-completion accounting, and the old
completion fails closed as stale-dma-handle before completion counters can
underflow or the reused page/submission state can mutate. This is still
prerequisite evidence only: production userspace DMAPool hostile smokes,
reset/revoke races with outstanding descriptors, CQ notification publication,
real quiesce/scrub/release, and IOMMU or bounce-buffer teardown remain open.
Implementation note, 2026-05-02 14:03 UTC: that scratch stale-DMA-completion
proof now adapts the synthetic DMA buffer slot into
capos-lib::device_authority before completion accounting can mutate. The
QEMU line records current-handle validation as ok, stale same-slot reuse as
stale-slot-generation, and side-effect-blocked, then preserves the
existing stale-dma-handle completion outcome. This remains a
scratch/no-real-DMA validator adapter, not production userspace DMAPool
authority or S.11.2 hostile-smoke completion.
Implementation note, 2026-05-02 15:15 UTC: the same scratch proof family now
adds a paired stale-completion publication check. A synthetic reset bumps the
device owner generation so an old completion fails as
stale-owner-generation with side-effect-blocked before any CQ publication.
The same-slot reuse path then fails the old completion as stale-dma-handle,
preserves the new submission accounting, and records both
cq_publication_blocked=true and new_owner_exposure_blocked=true. This is
still scratch/no-real-DMA evidence; production userspace DMAPool completion
notification, real hardware stale-completion injection, reset/revoke races,
and IOMMU or bounce-buffer teardown remain open.
Implementation note, 2026-05-02 12:46 UTC: the device-manager interrupt
handoff proof now includes a bounded stale IRQ after-detach check. After an
attached interrupt route is detached, the proof delivers the old LAPIC vector
through the dispatch path and requires stale_irq_delivery_after_detach to be
unregistered with stale_irq_wake_blocked=true; the old route handle also
continues to fail as interrupt-stale-route. This is prerequisite evidence
for route teardown ordering only. It does not cover a pending hardware IRQ
across reset, userspace Interrupt waiter wakeup semantics, or reassignment
reuse by a new owner.
Implementation note, 2026-05-02 14:24 UTC: the same device-manager interrupt
handoff proof now adapts the attached source into
capos-lib::device_authority before exercising the active wait path. After
revoke begins, the old handle fails the pure validator as
stale-owner-generation with side-effect-blocked, then the proof preserves
the existing interrupt-stale-route, detached-vector unregistered, and
stale_irq_wake_blocked=true checks. This remains a proof adapter, not
production userspace Interrupt handles, real waiters, reset/reassignment
reuse proof, or S.11.2 hostile-smoke completion.
Implementation note, 2026-05-02 14:51 UTC: the interrupt handoff smoke now
adds two bounded stale IRQ ordering points. After revoke begins and while the
route is still attached, delivery to the old vector remains masked, matches
the attached route generation, reports wake blocking, and leaves the old route
delivery count unchanged. During the reset phase, a synthetic route-registry
same-vector reuse proof re-registers and claims the route with bumped source
and route generations, then shows delivery to that vector is still masked,
matches the new route generation, and leaves the reused route delivery count
unchanged.
This is route-manager prerequisite evidence only: it is not a true pending
hardware MSI/reset hostile smoke, does not involve userspace Interrupt
waiters, and does not prove DMA buffer reuse race closure.
Implementation note, 2026-05-03 18:48 UTC: the interrupt handoff smoke now
snapshots a bounded pending IRQ token from the old vector, source id, source
generation, and route generation before revocation. Checking that token after
revoke blocks as stale-pending-irq-masked with reason route-masked; after
detach it blocks as stale-pending-irq-unregistered; and after reset/reuse it
blocks as stale-pending-irq-generation with reason
source-route-generation-mismatch. Each check records
side-effect-blocked, wake blocking, and unchanged delivery counts, and the
reset/reuse check records that the new route did not receive a delivery count.
The same proof rejects a malformed pending token with a zero generation as
stale-pending-irq-invalid-state before any delivery-count mutation.
This was bounded token-generation evidence only and did not inject a real
pending MSI across reset; the real-int $vector injection added at
2026-05-05 18:17 UTC (see below) closes the S.11.2.7 stale IRQ
hostile-smoke gate row by exercising the production CPU exception entry path
across the same revoke, detach, and reset/reuse boundaries.
Implementation note, 2026-05-09 18:12 UTC: the pending IRQ token decision path
now has a pure capos-lib::device_authority validator. Host tests cover
current-route acceptance and the same fail-closed label space used by the
kernel/QEMU proof: stale source generation, stale route generation, both
generations changed after reuse, source mismatch, route masked, route
unregistered, invalid route state, invalid owner, malformed zero-generation or
unassigned source identity, and unsupported vector. The kernel still snapshots
the live dispatch slot and delivery count, but delegates the pending-token
identity/state decision to that shared helper before returning
stale-pending-irq-* labels. This is validator/adapter coverage only; it does
not expose production userspace Interrupt waiters or wait/ack/mask/unmask
authority.
Implementation note, 2026-05-05 18:17 UTC: the interrupt handoff smoke now
fires a real INT $vector instruction at the device MSI vector at three
points across the revoke/reset/reuse boundary, exercising the production
IDT entry, extern "x86-interrupt" stub, record_lapic_delivery dispatch
slot read, and LAPIC EOI write rather than the helper-call path the prior
proofs used. The new proof scope strings drop “no-real-msi” and read
stale-vector-after-detach-real-int-vector-injected,
manager-attached-claimed-masked-after-revoke-real-int-vector-injected,
route-registry-vector-reuse-during-reset-real-int-vector-injected, and
bounded-pending-irq-token-generation-real-int-vector-cross-reset-injection.
Each injection point requires the slot’s delivery count to remain zero
before and after the real INT, and the post-INT outcome to match
masked (after revoke, route still attached but masked), unregistered
(after detach, slot cleared), and masked (after reset/reuse, slot now
belongs to a freshly registered+claimed route with bumped source and route
generations). The proof emits a closure summary
s11_2_7_proof_scope=s11-2-7-stale-irq-after-reset-real-int-vector-cross-reset-injection-no-userspace-waiter,
s11_2_7_real_irq_injected_across_reset=ok,
s11_2_7_old_waiter_cannot_wake_new_owner=true, and
s11_2_7_stale_ack_blocked=true. This closes the S.11.2.7 row of the
hostile-smoke acceptance matrix at make run-net (which invokes
tools/qemu-net-smoke.sh). It does not yet create a userspace Interrupt
waiter object; the in-flight delivery is observed via the kernel-owned
dispatch slot atomic state machine that the production path consumes.
S.11.2.9 hostile-smoke gate-wiring closed 2026-05-05 20:49 UTC (see
the implementation note below).
Implementation note, 2026-05-05 19:37 UTC: the device-manager hostile-smoke
suite now closes the S.11.2.8 stale-DMA-completion-after-reset row.
prove_qemu_stale_dma_completion_handoff claims a fresh probe-then-driver
record on the virtio-net PCI BDF (separate from the live virtio-net driver
state) and walks it through the same revoke, detach, and reset/reuse
boundaries S.11.2.7 uses. At each boundary the proof allocates a real
virtio-net DMA page through the production
device_dma::allocate_virtio_net_page helper, frees the page through
device_dma::free_virtio_net_page, reallocates so the live ledger’s page
generation advances, and synthesizes a stale DeviceDmaAllocation keyed to
the live phys with a decremented generation. The synthesized stale handle
is then fed to the production
device_dma::record_virtio_net_completion_for_allocation path – the same
function the live virtio-net Virtqueue::record_used_completion_for_allocation
invokes after descriptor tracking validates a hardware used-ring entry.
The production validator rejects each stale injection as stale-dma-handle
before any queue accounting decrement, completion side effect, CQ
publication, or new-owner memory exposure. The bounded run-net proof
records real_completion_inject_after_revoke_result=stale-dma-handle,
real_completion_inject_after_detach_result=stale-dma-handle,
real_completion_inject_after_reset_reuse_result=stale-dma-handle, all
three with side-effect-blocked, queue_account_preserved=true,
live_page_preserved=true, cq_publication_blocked=true,
new_owner_exposure_blocked=true, freed_buffer_unchanged=true, and
generation_bumped=true, plus a closure summary
s11_2_8_proof_scope=s11-2-8-stale-dma-completion-after-reset-real-free-realloc-cross-revoke-detach-reset-reuse-no-userspace-dmapool,
s11_2_8_real_completion_injected_across_reset=ok,
s11_2_8_old_completion_cannot_publish_to_new_owner=true,
s11_2_8_freed_buffer_reuse_blocked=true, and
s11_2_8_accounting_underflow_blocked=true. The new shape is enforced in
tools/qemu-net-smoke.sh and runs from make run-net. This is the
production paired stale-DMA-completion proof showing old completions cannot
publish stale CQ notifications or expose new-owner memory after real
revoke, detach, and reset/reuse boundaries with real free + realloc page
generation advances on the live kernel-owned ledger; S.11.2.9 hostile-smoke
gate-wiring closed 2026-05-05 20:49 UTC (see the implementation note
below). Userspace DMAPool handles and real device-manager page
quiesce/scrub/release hooks remain open as separate follow-ups.
Implementation note, 2026-05-05 20:49 UTC: the S.11.2.9 hostile-smoke
coverage row of the acceptance matrix is closed by aggregating every
matrix-row proof line into the make run-net -> tools/qemu-net-smoke.sh
gate. Every proof line referenced by the matrix has at least one
assertion in the harness today; the assertion shape varies by row.
The two driver-crash lines wired by that gate slice, the existing
S.11.2.8 device-manager: dma completion handoff proof closure-summary
line, and the S.11.2.7 device-manager: interrupt handoff proof
closure-summary line (whose trailing anchor was added by this slice
for harness-strictness consistency with S.11.2.8) all use anchored
extended-regex assertions (field-by-field match plus
proof_result=ok[[:cntrl:]]?$ trailing anchor); other matrix-row rows
reuse the harness’s pre-existing mix of unanchored extended-regex and
fixed-string grep -Fq assertions on the emitted proof lines. The two previously unasserted lines wired by this
slice are device-manager: devicemmio driver crash hook proof source=devicemmio-driver-crash-hook ... trigger_path=trigger-driver-crash-for-devicemmio and
device-manager: interrupt driver crash hook proof source=interrupt-driver-crash-hook ... trigger_path=trigger-driver-crash-for-interrupt. Both proofs were
already emitted by the kernel on every boot (via
prove_qemu_devicemmio_driver_crash_hook and
prove_qemu_interrupt_driver_crash_hook in kernel/src/device_manager/proofs.rs)
and exercise the explicit driver-crash teardown trigger path with a stale
rerun noop, validate-live revoked cap state, and cap_release_after_crash
as noop. The chosen gate strategy keeps S.11.2.9 inside make run-net
rather than splitting into a separate make run-hostile-smokes target,
because all six matrix rows depend on the same virtio-net device bring-up
state (probe-then-driver records on the virtio-net BDF, real IDT vector
injection, real DMA page free + reallocate). A separate target would
duplicate the bring-up cost without adding coverage. Tightening the
remaining unanchored assertions to the same anchored shape is a
follow-up harness-hardening task and is not part of S.11.2.9 closure because
each affected proof line is still uniquely identified by its
emitted prefix and the asserted field set. Production userspace
DMAPool/DeviceMmio/Interrupt handles, real device-manager page
quiesce/scrub/release hooks, hardware-backed provider-driver Interrupt
wait/ack dispatch beyond the current bounded route-dispatch waiter proof,
durable/signed production audit consumption beyond the first volatile
HardwareAuditLog.snapshot cap, and IOMMU domain programming all remain open
as separate follow-ups tracked in
docs/backlog/hardware-boot-storage.md and the docs/tasks/README.md
userspace-driver-transition bullet.
Implementation note, 2026-05-08 09:44 UTC: the same make run-net gate now also
asserts cap-specific DMA driver-crash proofs. DmaBufferCap routes the
explicit trigger through the bounded FreeBuffer cleanup path and proves page
scrub/ledger/frame-free labels before stale rerun and post-trigger cap release
both return noop; DmaPoolCap routes the explicit trigger through the
zero-live evidence-gated detach path and proves authoritative zero-live,
quiesced, and scrubbed evidence labels before stale rerun and post-trigger cap
release return noop.
Implementation note, 2026-05-03 01:05 UTC: the schema and kernel now include a
result-only Interrupt.info skeleton that can wrap a manager-issued device
handle plus the attached DeviceInterruptRoute. The object validates the
live manager record, owner, claimed route, and attached route record through
validate_interrupt_record() before returning status labels such as
userspaceInterrupt=manager-issued-skeleton,
managerRecord=validated-active, routeRecord=manager-attached-route,
realInterruptDelivery=not-delivered, wait=admission-check-only,
acknowledge=admission-check-only, mask=route-state-control,
unmask=route-state-control, and bootstrapGrant=blocked. The interrupt handoff
QEMU proof constructs that cap object while the route record is active,
records interrupt_cap_info_result=ok, exercises the serialized
CapObject::call(0, &[]) path and decodes the returned Interrupt.info
Cap’n Proto result as interrupt_cap_serialized_call_result=ok, then verifies
the same cap fails closed after revoke begins as
interrupt_cap_stale_after_revoke_result=interrupt-stale-handle; the same
stale object also fails the serialized method-0 path as
interrupt_cap_serialized_stale_after_revoke_result=invoke-failed. A later
manifest-grant smoke explicitly releases the granted Interrupt cap through
CAP_OP_RELEASE and proves a subsequent typed Interrupt.info call fails
closed from userspace. A focused grant-cycle smoke now repeats that grant,
release, and stale-info proof twice in sequence and asserts the second
manager-grant-source acquire preserves the source generation and receives a
fresh route generation after the first release; the same smoke also decodes
both acquire/release cycles through the typed volatile
HardwareAuditLog.snapshot surface. The focused
hardware-audit interrupt-waiter smoke also decodes recent boot-time
DmaBuffer, DmaPool, and Interrupt driver-crash / reset-disable
lifecycle records from the current volatile 16-record snapshot window. The same
smoke now uses the startSequence cursor to decode older retained
DeviceMmio lifecycle rows that the default latest 16-record tail has skipped.
A 2026-05-09 19:18 UTC follow-up adds a bounded Interrupt.wait admission
method to that skeleton. The method validates the same manager-attached route,
snapshots a pending-token candidate, delegates to the shared
capos-lib::device_authority pending-IRQ validator, and returns typed labels
through capos-rt; the focused grant smoke asserts the current masked-route
result stale-pending-irq-masked, reason route-masked,
side-effect-blocked, matching token/current source and route generations,
unchanged delivery counts, no waiter wake, and fail-closed behavior after cap
release. This is bounded
manager-issued skeleton evidence only: there is no blocking wait,
real hardware acknowledgement, real hardware mask/unmask side effect,
interrupt delivery authority, real waiter object, production Interrupt completion,
durable/signed audit persistence, or concurrent sharing claim. A
2026-05-09 23:21 UTC follow-up adds bounded Interrupt.acknowledge
admission to the same skeleton. It validates the manager-attached route through
the existing Acknowledge authority path, returns
admission-check-only, interrupt-ack-not-attempted, and
side-effect-blocked, and proves delivery counts remain unchanged with no
waiter wake or hardware acknowledgement.
A 2026-05-09 23:52 UTC follow-up adds bounded Interrupt.mask and
Interrupt.unmask admission to the same skeleton. They validate the
manager-attached route through the existing Mask and Unmask authority
paths, return admission-check-only,
interrupt-mask-not-attempted / interrupt-unmask-not-attempted, and
side-effect-blocked, and prove route state and delivery counts remain
unchanged with no hardware mask/unmask, waiter wake, or IRQ delivery.
A 2026-05-10 04:01 UTC follow-up promotes those methods to bounded
route-state control over the manager-attached dispatch slot. Interrupt.unmask
now changes claimed-masked to driver-unmasked, Interrupt.mask changes it
back to claimed-masked, and both preserve delivery counts while still
avoiding hardware MSI/MSI-X table programming, waiter wakeup, hardware
acknowledgement, or real IRQ delivery.
A 2026-05-10 22:54 UTC follow-up wires real waiter completion to the
existing route-dispatch delivery counter from scheduler/poll context. The
poller observes matching delivered routes by vector, source generation, and
route generation without taking the waiter-table lock in the IRQ dispatch path,
then revalidates the manager-attached route under the route-post exclusion
before posting the deferred cap completion. The focused grant smoke proves the
first unmasked manifest-granted wait completes as interrupt-delivered /
waiter-completed-irq with real_interrupt_delivery=delivered and an
advanced delivery count, while a second unmasked wait still remains pending
until Interrupt.mask completes it through the existing
interrupt-waiter-cancelled / waiter-completed-no-irq path. Stale, masked,
released, reset, or reused routes do not wake as delivered IRQs. This remains a
bounded route-dispatch proof; it does not program hardware MSI/MSI-X tables,
acknowledge hardware, add provider-driver interrupt consumption, or claim
hostile hardware isolation.
Implementation note, 2026-05-03 13:49 UTC: the result-only
DMAPool.info, DMABuffer.info, DeviceMmio.info, and Interrupt.info
surfaces now return numeric identity fields alongside the conservative status
labels. The fields mirror the documented handle identity shape: deviceId,
BDF bus/device/function, owner generation, and the relevant pool id/generation,
buffer slot/generation, BAR/mapping id/generation, or interrupt
source/generation/route generation. The QEMU proof logs and net smoke assert
the active serialized method-0 decode for those fields, while stale method-0
calls still fail closed as invoke-failed. This remains a result-only
manager-issued skeleton surface; it does not add production DMA allocation,
free, map, submit, or completion authority, production MMIO mapping or
doorbell authority, production interrupt wait/ack/mask/unmask authority, real
DMA page cleanup/reuse, hostile hardware isolation, or S.11.2 completion.
Implementation note, 2026-05-03 16:37 UTC: the bounded interrupt route
identity skeleton now carries separate source and route generations end to end.
DeviceInterruptRoute, LegacyIoApicInterruptRoute, route records,
diagnostics summaries, dispatch-slot metadata, and the device-manager attached
interrupt bridge store both generations. Registration allocates both fields,
PCI MSI-X route reassignment preserves source generation while bumping only the
route generation, release/re-register allocates both generations fresh, and
Interrupt.info returns the independent values. The QEMU net smoke asserts the
split in PCI and legacy route logs, metadata proof logs, handoff proof logs,
serialized Interrupt.info, and cap-release proof logs. This closes only the
bounded identity proof gap; it does not expose production userspace
Interrupt authority, create real waiters, or complete the S.11.2 hostile IRQ
smoke matrix.
DeviceMmio Invariants
DeviceMmio is register authority, not memory authority.
- Authority: A holder may map only BARs or subranges recorded in the claimed device object. It may not map PCI config space globally, another function’s BAR, RAM, ROM, or synthetic kernel pages.
- Handle identity: Each call checks the claimed device id, owner generation, BAR or subrange mapping record, and mapping generation before mapping, unmapping, reading, or writing registers.
- Physical range: Each mapping is bounded to the BAR’s decoded physical range, page-rounded by the kernel, and tagged as device memory with cache attributes appropriate for MMIO. Partial BAR grants must preserve page-level isolation; otherwise the grant must cover the whole page-aligned register window and be treated as that much authority.
- Ownership: At most one mutable driver owner controls a device function’s
MMIO at a time. Management capabilities may inspect topology, but register
writes require the claimed
DeviceMmioobject. - No DMA implication: Mapping registers does not grant any DMA buffer,
frame allocation, interrupt, or config-space authority. Doorbell writes are
accepted only as effects of register access; descriptor validity is enforced
by
DMAPoolbefore queues are made visible to the device. - Revocation: Revocation unmaps the driver’s register pages, marks the device object unavailable for new calls, and invalidates outstanding MMIO handles. Stale mappings or calls fail closed.
- Reset: Revoking the final mutable
DeviceMmioowner resets or disables the device unless a higher-level device manager explicitly transfers ownership without exposing it to an untrusted holder.
Interrupt Invariants
Interrupt is event authority for one routed source.
- Authority: A holder may wait for, mask/unmask where supported, and acknowledge only its assigned vector, line, or MSI/MSI-X table entry. It may not reprogram arbitrary interrupt controllers or claim another source.
- Handle identity: Each wait, mask, unmask, and acknowledge checks the claimed device id, owner generation, source id, source generation, route generation, and any live waiter generation before affecting delivery state.
- Ownership: Each interrupt source has one delivery owner at a time. Shared legacy lines must be represented as a kernel-demultiplexed object with explicit device membership, not as ambient access to the whole line.
- Range: The capability records the hardware source, vector, trigger mode, polarity, and target CPU/routing state. User-visible operations are checked against that record.
- Revocation: Revocation masks or detaches the source, drains pending notifications for the old holder, invalidates waiters, and prevents stale acknowledgements from affecting a new owner.
- Reset: If the source cannot be detached cleanly, the owning device is reset or disabled before the interrupt is reassigned.
- No MMIO or DMA implication: Interrupt delivery does not grant register access, DMA buffers, or packet memory.
Revocation Ordering
Device revocation must follow a fixed order:
- Stop new submissions by invalidating the driver’s user-visible handles.
- Revoke MMIO write authority by write-blocking or unmapping BAR pages, or by disabling the device before any DMA teardown starts.
- Mask or detach interrupts.
- Quiesce virtqueues or device command queues.
- Reset or disable the device if in-flight DMA cannot be accounted for.
- Remove IOMMU mappings or invalidate bounce-buffer handles.
- Scrub and free DMA pages.
This order prevents a stale driver from racing revocation with doorbell writes, interrupt acknowledgement, or descriptor reuse. Logical handle invalidation is not sufficient while a BAR remains mapped; register-write authority must be removed or the device must be disabled before descriptor or DMA-buffer ownership is reclaimed.
Implementation should represent the order as an explicit device-owner state machine rather than as ad hoc booleans:
#![allow(unused)]
fn main() {
enum DeviceOwnerState {
Active,
RevokingHandles,
MmioRevoked,
InterruptsDetached,
QueuesQuiesced,
Resetting,
DmaMappingsRemoved,
Dead,
}
}
No path may free or reassign DMA pages until the state has reached
QueuesQuiesced with all in-flight descriptors accounted for, or Resetting
has completed and the device can no longer write old buffers. Dead means all
user-visible handles are invalid, interrupts are detached or masked, DMA
mappings are removed, and pages have been scrubbed or transferred to a trusted
owner.
Hard invariants:
- DMA pages cannot be freed before
QueuesQuiescedor a completedResettingtransition proves old DMA writes are stopped. - MMIO write authority must be revoked before DMA ownership teardown.
- Interrupt reassignment cannot happen before old pending notifications are drained or generation-invalidated.
- Device reset is mandatory if in-flight DMA cannot be proven stopped.
Future Userspace-Driver Transition Criteria
Moving NIC or block drivers out of the kernel is gated by Security Verification Track S.11.2. The gate is only open when all rows below are implemented and demonstrated. The S.11.2.N labels are local checklist row IDs for this gate.
The completed Device Driver Foundation selected milestone used this track as
the prerequisite for the DMAPool, accounting, and hostile-smoke sub-gate.
Future DDF follow-ups still use these rows as the userspace-driver transition
gate: generic MSI/MSI-X dispatch and second-device reuse may land first, but
userspace DeviceMmio and Interrupt exposure stays blocked until these rows
pass.
Production DMAPool Ledger Prerequisite
Before userspace NIC or block drivers receive DeviceMmio, Interrupt, or
DMAPool handles, the device manager must own one ledger of record for each
claimed device. That ledger is the authoritative source for every
device-visible hold, not a diagnostic mirror of separate subsystems.
The ledger records:
- DMA pool bytes reserved and live;
- DMA buffer count, slot generation, and owner generation;
- mapped userspace DMA VMAs, quiesce state, scrub state, and release eligibility for each attached DMA pool;
- descriptor and ring depth limits, including live in-flight submissions and completions;
- page-rounded MMIO mappings and their owning
DeviceMmiogenerations; - interrupt holds, waiter generations, and routed-source generations;
- budget and OOM policy for allocation, queue growth, mapping, and interrupt attachment;
- teardown state in the device-owner state machine.
Every operation that creates, consumes, or releases device-visible authority must update this ledger as part of the same ownership transaction that changes device-manager state. That includes DMA buffer allocation/free, descriptor submission, completion accounting, BAR mapping/unmapping, interrupt attach/detach, reset, revoke, process exit, and capability release.
Implementation note, 2026-05-03 13:18 UTC: the QEMU virtio-rng metadata path
now runs a bounded teardown-trigger proof for cap-release, process-exit,
driver-crash, reset-disable, interrupt-waiter, future-devicemmio, and
future-dmapool. Each trigger row sequentially claims and transfers the same
PCI function, begins revocation, walks the existing device-owner state machine
to Dead, releases only after Dead, and proves generation bumps, stale
handle rejection, direct state-skip rejection, pre-Dead release rejection,
and per-trigger coverage without duplicates. The cap-release row attaches a
bounded manager-owned DeviceMmio record to the active driver handle, removes a
DeviceMmioCap from a cap table, runs the CapOpRelease hook, and records
cap-table removal plus detached/stale manager validation before normal
revocation. The process-exit row attaches the same bounded DeviceMmio
record shape to a real proof Process, runs
Process::release_caps_for_exit(), and records cap-table removal plus
detached/stale manager validation before normal revocation. The driver-crash,
reset-disable, and interrupt-waiter rows register and claim bounded PCI
MSI-X lifecycle-probe routes, attach them to the device manager, prove
InterruptsDetached is blocked as interrupts-attached, detach and release
the routes while still in MmioRevoked, and then advance normally.
The future-devicemmio row attaches a bounded manager-owned DeviceMmio
record from the first decoded PCI memory BAR, proves MmioRevoked is blocked
as devicemmio-attached, detaches while still in RevokingHandles, and then
advances normally. The future-dmapool row attaches a bounded zero-live
DMAPool record, proves DmaMappingsRemoved is blocked as
dmapool-attached, detaches while still in Resetting, and then advances
normally. The generic teardown-trigger summary reports no label-only rows
and seven object-backed rows, while the route-aware interrupt handoff smoke
also labels the claimed MSI-X route as bounded interrupt-waiter blocker
evidence: interrupt_waiter_object=interrupt-route-record,
interrupt_waiter_block_state=InterruptsDetached,
interrupt_waiter_block_result=interrupts-attached,
interrupt_waiter_detach_result=ok, and
interrupt_waiter_route_generation_preserved=true. This bounded
route-record evidence is contract proof for the shared ownership transaction
only: it does not expose production userspace authority handles, real MMIO,
real DMA, a userspace waiter, or production crash/reset observers. Separate
first DeviceMmioCap,
InterruptCap, DmaPoolCap, and DmaBufferCap release-hook proofs now
exercise both the production ring CAP_OP_RELEASE dispatch path and a real
Process::release_caps_for_exit() path for those cap objects, validating
cap-table removal plus exact manager-owned DeviceMmio detach,
manager-attached interrupt-route release, bounded zero-live DMAPool detach,
or proof-owned DMA-buffer record cleanup. The generic route-record trigger rows
and remaining DMA production work do not yet implement production observers,
production interrupt-waiter objects, userspace DeviceMmio, production
userspace DMAPool/DMABuffer authority, full device authority, or true
pending hardware MSI/reset-hostile route teardown.
Implementation note, 2026-05-08 10:08 UTC: the first cap-specific
reset/disable trigger entry points now exist for DeviceMmioCap and
InterruptCap. trigger_reset_disable_for_devicemmio and
trigger_reset_disable_for_interrupt route through the same idempotent
stale-safe detach helpers as cap release and driver-crash cleanup, emit one
cap-audit: ... event=reset-disable detach=ok line on the first successful
trigger, and keep stale reruns silent. This is still bounded trigger plumbing:
the reset/disable observer, non-proof DMA cleanup integration, userspace
MMIO/interrupt operations, and IOMMU-backed remapping work remain future
requirements.
Implementation note, 2026-05-08 10:39 UTC: the DMA caps now have the matching
cap-specific reset/disable trigger plumbing. DmaPoolCap::on_reset_disable
uses the same authoritative zero-live/quiesced/scrubbed evidence-gated detach
as cap release and driver-crash cleanup. DmaBufferCap::on_reset_disable
reuses the bounded FreeBuffer authority validation and page-scrub/frame-free
cleanup path, then leaves the parent pool attached until staged zero-live
cleanup. make run-net asserts the dmabuffer-reset-disable-hook and
dmapool-reset-disable-hook proof lines, stale rerun noop, revoked
cap validation, post-trigger release noop, and exact-one
cap-audit: cap={dmabuffer,dmapool} event=reset-disable lines. This is still
proof-owned no-real-DMA cleanup; production userspace DMAPool/DMABuffer
authority and non-proof page lifecycle integration remain future work.
Budget or OOM failure is closed before the driver can observe a new handle, program a descriptor, map MMIO, attach an interrupt, or ring a doorbell. A failed submission must leave no live descriptor hold behind, or must leave an explicit in-flight record that teardown can drain or reset. A completed teardown must reconcile the ledger to zero live DMA buffers, zero live MMIO mappings, zero interrupt holds, and no in-flight descriptor submissions for the released device generation.
Implementation note, 2026-05-02 06:59 UTC, updated 2026-05-11 06:10 UTC: the
current kernel-owned virtio-net ledger now proves the closed budget/OOM cases
above with a scratch ledger and the live ledger validation described earlier.
Imported live device-manager DMAPool records still preserve the
device_dma:virtio-net source policy and prove imported live accounting stays
within its aggregate in-flight budget while preserving that policy’s per-queue
queue/submission depth limits. The manifest-granted manager-owned
bounce-buffer DMAPool path now attaches its own device-manager budget policy
to userspace DMAPool.allocateBuffer handle creation and the current
fixed-slot DMAPool/DMABuffer transfer, release, pending-release, drop,
rollback, teardown-detach, page-release, and descriptor-completion cleanup
paths. The full eight-slot pool fails as dmapool-budget-exceeded /
over-buffer-budget before allocation, cap minting, or ledger mutation, and
the selected release paths revalidate current or next accounting before
advancing manager-owned state. Production userspace DMAPool records must
still attach budget checks to broader provider-driver transfer/revoke/reset
transactions, IOMMU or direct-DMA mapping state, and non-fixed-slot
allocation before this row can be treated as the complete userspace-driver
transition gate.
Implementation note, 2026-05-02 08:33 UTC: the QEMU virtio-rng metadata path
now runs a bounded DMAPool record lifecycle proof on the device-manager
teardown state. The first slice keeps the record zero-live: it records a pool
slot, pool generation, and owner generation, rejects stale and
owner-mismatched attach attempts, rejects duplicate attachment, and proves that
begin_revocation invalidates the user-visible pool handle by bumping the
device owner generation. The ordered teardown path now fails closed with
dmapool-attached if it tries to enter DmaMappingsRemoved while the pool
record remains attached. The current continuation also proves that the revoke
handle cannot detach the zero-live pool without scratch authoritative
zero-live, quiesced, and scrubbed evidence bound to that record’s source; a
mismatched scratch source is rejected before detach. With matching
proof-scoped evidence, the record detaches after queues are quiesced/reset and
before DmaMappingsRemoved. Later bounded manifest grants expose conservative
DMAPool, DeviceMmio, and Interrupt surfaces; the current DMAPool grant
can mint only eight fixed manager-attached proof DMABuffer result caps. The
remaining gap is production userspace authority, allocation beyond those eight
fixed slots, real device-visible page allocation through the device manager,
non-proof DMA page lifecycle integration, IOMMU remapping, and the S.11.2
hostile smoke matrix.
Current QEMU evidence: the QEMU virtio-net path now adds
the corresponding imported live-accounting prerequisite proof. A
device-manager DMAPool record is attached with accounting derived from the
live device_dma ledger: live buffer/page count, live bytes, current in-flight
submissions, committed/resident/unswappable residency flags, and
scrub-before-release policy. DmaMappingsRemoved fails closed with
dmapool-attached while the record remains attached, direct teardown detach
fails closed with dmapool-live while the authoritative ledger remains live,
and the live proof consumes the device_dma teardown-evidence API, observes
authoritative-ledger-live with matching imported live accounting, and
explicitly defers completion with no real DMA teardown attempted. The same
proof path now validates the imported DMAPool record through
capos-lib::device_authority for the active handle and stale-after-revoke
failure labels.
This does not create production userspace handles, real page-release hooks,
IOMMU mapping invalidation, scrubbed release, terminal Dead, or
hostile-smoke coverage for the live virtio-net record. The companion
scratch-ledger proof covers the positive zero-live teardown-evidence result
without claiming that the live virtio-net record has been torn down. The
manager-owned zero-live lifecycle proof consumes matching-source
device-manager teardown evidence for the positive detach/DmaMappingsRemoved
path and separately proves mismatched-source and missing-evidence detach
attempts fail closed. The manifest-granted bounded DMAPool path now keeps
mapped userspace VMA count, in-flight descriptor holds, residency,
quiesce/scrub state, and release eligibility in that manager record. Borrowed
or device-visible pages remain committed, resident, unswappable,
generation-bound, and unavailable for reuse until the manager record is
zero-live, unmapped, quiesced, and scrubbed. Descriptor submission is refused
while a buffer is borrowed to userspace, and release consumes manager-owned
teardown evidence instead of proof-only device_dma zero-live evidence.
The corresponding DMAPool.info ABI reports mapped VMA count, quiesce state,
scrub state, and release eligibility for QEMU proof assertions. This is still
bounded bounce-buffer lifecycle authority only: direct DMA, host physical or
IOVA exposure, IOMMU/remapping, production provider-driver consumption,
durable audit, and broader transfer/revoke policy remain future work.
| Gate item | Required state | Must-have proof |
|---|---|---|
| S.11.2.0 DMA-owned buffers | DMAPool owns every driver-visible DMA mapping. | A driver receives opaque buffer handles or IOVA-only values; no path hands out raw host physical addresses. |
| S.11.2.1 Bound checks | Allocation, descriptor chain length, alignment, segment length, and ring depth are bounded and constant-time validated before ring submission. | Ring submissions fail closed on overflow, wrap, stale-handle, and freed-handle reuse attempts. |
| S.11.2.2 Explicit remap/ownership | DeviceMmio can only grant claimed BAR pages; cache attributes and write policy are enforced. | Driver cannot access unclaimed BARs, ROM, RAM pages, config-space globals, or stale mappings after revoke. |
| S.11.2.3 Interrupt correctness | Interrupt owns exactly one logical source at a time and drains/waits only for that source. | Reassigning an owner invalidates old waiters and masks or detaches the source first. |
| S.11.2.4 Quiesce + reset contract | Device manager can force reset/disable on failed revoke or teardown. | No in-flight descriptor may continue touching freed buffers after driver removal. |
| S.11.2.5 Process lifecycle | Capability release, process exit, and process-spawn cleanup paths cannot leak DMA pages/MMIO/intr ownership. | Crash-path teardown removes holds and invalidates user-visible handles before page free. |
| S.11.2.6 Isolation and accounting | Security Verification Track S.9 quota and authority ledgers include DMA, MMIO, and interrupt hold edges. | A malicious or buggy driver cannot consume more than its allocated authority budget. |
| S.11.2.7 Stale IRQ ordering | Stale interrupt delivery after revoke cannot wake, acknowledge, or signal a new owner. | Interrupt generation mismatch is ignored, or the source is masked/detached/reset before reassignment. Hostile smoke revokes a driver while an interrupt is pending, reassigns the source, and proves the old waiter cannot wake against the new owner. Closed 2026-05-05 18:17 UTC by make run-net’s device-manager: interrupt handoff proof line: real INT $vector injection across revoke, detach, and reset/reuse exercises the production IDT entry/handler/EOI path, asserts s11_2_7_real_irq_injected_across_reset=ok, s11_2_7_old_waiter_cannot_wake_new_owner=true, and s11_2_7_stale_ack_blocked=true, and is enforced by tools/qemu-net-smoke.sh. Userspace Interrupt waiter objects remain a future requirement for a full production-driver path. |
| S.11.2.8 Stale DMA completion ordering | Stale DMA completion after revoke cannot cause freed buffer reuse, stale CQ notification, or new-owner memory exposure. | Closed 2026-05-05 19:37 UTC by make run-net’s device-manager: dma completion handoff proof line: real virtio-net DMA page free + reallocate cycle bumps the live page generation, then the production device_dma::record_virtio_net_completion_for_allocation path (the same function the live Virtqueue::record_used_completion_for_allocation invokes) is fed a stale DeviceDmaAllocation keyed to the live phys with a decremented generation, at three boundaries (after revoke, after detach, after reset/reuse). All three reject as stale-dma-handle with side-effect-blocked, queue accounting unchanged, live new-owner page preserved, no CQ publication, no new-owner exposure, and the freed-buffer slot remaining unchanged. The closure summary asserts s11_2_8_real_completion_injected_across_reset=ok, s11_2_8_old_completion_cannot_publish_to_new_owner=true, s11_2_8_freed_buffer_reuse_blocked=true, and s11_2_8_accounting_underflow_blocked=true, and is enforced by tools/qemu-net-smoke.sh. Prior acceptance text: in-flight DMA is accounted for, or device reset/disable completes before buffer reuse; hostile smoke covers revoke/reset with outstanding descriptors and proves no old completion can publish new-owner memory. S.11.2.9 hostile-smoke gate-wiring also closed 2026-05-05 20:49 UTC (see the row below). Userspace DMAPool handles and real device-manager page quiesce/scrub/release hooks remain open as separate follow-ups. |
| S.11.2.9 Hostile-smoke coverage | QEMU/CI smokes cover stale handles, descriptor abuse, revoke races, stale IRQ after reset, stale DMA completion after reset, and exit-under-dma. | Smoke output has explicit closed-case proof lines for each above failure mode. Closed 2026-05-05 20:49 UTC by aggregating the existing per-row proof lines into the make run-net -> tools/qemu-net-smoke.sh gate. Every matrix-row proof line has at least one assertion in the harness; the original two driver-crash assertions, the existing S.11.2.8 device-manager: dma completion handoff proof closure-summary assertion, and the S.11.2.7 device-manager: interrupt handoff proof closure-summary assertion (whose trailing anchor was added by this slice for harness-strictness consistency with S.11.2.8) all use the anchored extended-regex shape (field-by-field match plus proof_result=ok[[:cntrl:]]?$ trailing anchor), and the other matrix-row rows reuse the harness’s pre-existing mix of unanchored extended-regex and fixed-string grep -Fq assertions. A 2026-05-08 09:44 UTC follow-up adds anchored assertions for the cap-specific dmabuffer-driver-crash-hook and dmapool-driver-crash-hook proof lines; a 2026-05-08 10:08 UTC follow-up adds anchored assertions and exact-one audit counts for the first cap-specific devicemmio-reset-disable-hook and interrupt-reset-disable-hook proof lines; a 2026-05-08 10:39 UTC follow-up does the same for dmabuffer-reset-disable-hook and dmapool-reset-disable-hook; a 2026-05-08 13:42 UTC follow-up (aeef8b41) adds the cap-specific device-manager: interrupt waiter hook proof source=interrupt-waiter-hook ... trigger_path=trigger-interrupt-waiter-for-interrupt assertion plus an exact-one cap-audit: cap=interrupt event=interrupt-waiter count. Per-row coverage: stale DMA handle (device-dma: stale dma handle proof, device-dma: live stale dma completion accounting proof); descriptor abuse (virtio-net: software descriptor generation model proof, virtio-net: invalid used descriptor id software-token proof, virtio-net: descriptor generation guard proof ok, virtio-net: invalid used descriptor id live software-token proof ok, plus device-dma: budget oom proof); revoke/reset race (device-manager: ownership proof, the seven device-manager: teardown trigger proof trigger=... variants plus the final aggregate, device-manager: dma completion handoff proof for S.11.2.8, device-manager: interrupt handoff proof for S.11.2.7, the device-manager: devicemmio driver crash hook proof source=devicemmio-driver-crash-hook ... trigger_path=trigger-driver-crash-for-devicemmio, device-manager: interrupt driver crash hook proof source=interrupt-driver-crash-hook ... trigger_path=trigger-driver-crash-for-interrupt, device-manager: dmabuffer driver crash hook proof source=dmabuffer-driver-crash-hook ... trigger_path=trigger-driver-crash-for-dmabuffer, device-manager: dmapool driver crash hook proof source=dmapool-driver-crash-hook ... trigger_path=trigger-driver-crash-for-dmapool, device-manager: devicemmio reset disable hook proof source=devicemmio-reset-disable-hook ... trigger_path=trigger-reset-disable-for-devicemmio, device-manager: interrupt reset disable hook proof source=interrupt-reset-disable-hook ... trigger_path=trigger-reset-disable-for-interrupt, device-manager: dmabuffer reset disable hook proof source=dmabuffer-reset-disable-hook ... trigger_path=trigger-reset-disable-for-dmabuffer, device-manager: dmapool reset disable hook proof source=dmapool-reset-disable-hook ... trigger_path=trigger-reset-disable-for-dmapool, and device-manager: interrupt waiter hook proof source=interrupt-waiter-hook ... trigger_path=trigger-interrupt-waiter-for-interrupt lines, all requiring first-trigger ok, stale rerun noop, cap validate_live=revoked, post-trigger release noop, and proof_result=ok with cap-specific cleanup/evidence labels); stale IRQ after reset (S.11.2.7 closure summary, see row above); stale DMA completion after reset (S.11.2.8 closure summary, see row above); exit-under-DMA (device-manager: teardown trigger proof trigger=process-exit owner=virtio-rng, the teardown-trigger aggregate triggers=cap-release,process-exit,driver-crash,reset-disable,interrupt-waiter,future-devicemmio,future-dmapool line, the four cap-release-hook proofs each containing process_exit_path=process-release-caps-for-exit, plus hardware-cap-release: ... reason=process-exit count assertions). A 2026-05-23 21:34 UTC follow-up adds the IOMMU production DMAPool hostile proof over the active mapped ledger, covering stale IOVA after revoke/reset, descriptor abuse, revoke/reset race ordering, stale completion after reset, teardown-under-DMA ordering, cross-domain stale-handle attempts, and the fail-closed teardown branch proof; process-exit/exit-under-DMA remains the existing run-net bounce-buffer evidence. Production userspace DeviceMmio/Interrupt handles, broader non-proof device-manager page quiesce/scrub/release hooks outside the selected IOMMU smoke, hardware-backed provider-driver Interrupt wait/ack dispatch beyond the bounded route-dispatch waiter proof, and durable/signed production audit consumption beyond the first volatile HardwareAuditLog.snapshot cap remain open as separate follow-ups. |
For each row, the transition requires an owner, implementation notes, and a CI-backed
verification path. Until all rows pass, Phase 4.2 NIC/block drivers remain in-kernel for
functionality, and only kernel-mapped bounce-buffer mode is allowed for prototype DMA.
Hostile-Smoke Acceptance Matrix
These smokes are the acceptance requirements for the userspace driver
transition. The S.11.2.7, S.11.2.8, and S.11.2.9 rows are now backed by
current make run-net QEMU evidence enforced by tools/qemu-net-smoke.sh
(see the per-row “Closed” notes for closure timestamps and the proof-line
shapes). The other matrix rows remain acceptance requirements for future
implementation work; their proof lines are emitted by the kernel today
and asserted by the same harness, but the production userspace handles,
real device-manager page quiesce/scrub/release hooks, real userspace
Interrupt waiter objects, IOMMU domain programming, and durable/signed
production audit consumption beyond the volatile HardwareAuditLog.snapshot
cap that complete each row’s full closure remain open as separate
follow-ups.
| Hostile case | Required setup | Closed-case proof expectation |
|---|---|---|
| Stale DMA handle | Allocate a DMA buffer, revoke or free it, advance the slot or pool generation, then attempt descriptor submission or buffer reuse through the old handle. | The operation fails closed on generation mismatch; no descriptor is made visible to the device, no DMA byte or buffer hold is restored, and any reused slot remains owned only by the new generation. |
| Descriptor abuse | Submit chains with out-of-pool addresses, stale or freed buffer slots, arithmetic wrap, misalignment, overlong segments, excessive chain length, or ring-depth overflow. | Validation rejects the chain before any doorbell write; the ledger shows no leaked descriptor hold, no in-flight increment without an owning buffer, and no access outside the pool range. |
| Revoke/reset race | Race revoke, reset, or process teardown against a driver that is submitting descriptors or ringing the device doorbell. | Revocation first invalidates handles and MMIO write authority; later submissions fail closed, existing in-flight records are either completed under the old generation or reset/disabled before page reuse, and teardown cannot skip to DmaMappingsRemoved while the ledger has live submissions. |
| Stale IRQ after reset | Leave an interrupt pending or a waiter blocked, reset or reassign the device/source, then deliver or acknowledge using the old generation. | The old waiter cannot wake against the new owner, stale acknowledgements do not affect the reassigned source, and the source is masked, detached, or generation-invalidated before reassignment. Closed 2026-05-05 18:17 UTC: make run-net injects a real INT $vector through the IDT/handler/EOI path at three points across revoke, detach, and reset/reuse and records s11_2_7_real_irq_injected_across_reset=ok, s11_2_7_old_waiter_cannot_wake_new_owner=true, s11_2_7_stale_ack_blocked=true, plus matching real_irq_inject_after_revoke_result=masked, real_irq_inject_after_detach_result=unregistered, real_irq_inject_after_reset_reuse_result=masked on the kernel proof line. |
| Stale DMA completion after reset | Reset with outstanding descriptors, reuse or prepare to reuse pool slots, then inject or observe a completion from the old device generation. | The stale completion cannot publish a CQE to a new owner, cannot expose new-owner memory, cannot underflow accounting, and cannot make a freed buffer eligible for reuse unless reset/disable has proven old DMA stopped. Closed 2026-05-05 19:37 UTC: make run-net walks a fresh device-manager record on the virtio-net BDF through the Active>RevokingHandles>MmioRevoked>InterruptsDetached>QueuesQuiesced>Resetting>DmaMappingsRemoved>Dead revocation path, exercises a real virtio-net DMA page free + reallocate cycle at three boundaries (after revoke, after detach, after reset/reuse), and feeds a synthesized stale DeviceDmaAllocation (live phys, decremented generation) to the production device_dma::record_virtio_net_completion_for_allocation path. Each boundary records real_completion_inject_after_*_result=stale-dma-handle, _side_effect=side-effect-blocked, _queue_account_preserved=true, _live_page_preserved=true, _cq_publication_blocked=true, _new_owner_exposure_blocked=true, _freed_buffer_unchanged=true, and _generation_bumped=true, plus a closure summary s11_2_8_real_completion_injected_across_reset=ok, s11_2_8_old_completion_cannot_publish_to_new_owner=true, s11_2_8_freed_buffer_reuse_blocked=true, s11_2_8_accounting_underflow_blocked=true. |
| Exit-under-DMA | Terminate or crash a driver process while it holds DMA buffers, MMIO mappings, interrupt waiters, and in-flight descriptors. | Process exit enters the device-manager teardown path, invalidates all user-visible handles, revokes MMIO, detaches interrupts, quiesces or resets queues, scrubs DMA pages before release, and reports a terminal ledger with no live holds for the old owner generation. |
Security Verification Track S.11.2 Decision Record
Security Verification Track S.11.2 is backend-scoped. The current brokered-bounce userspace-provider path has enough reviewed evidence to close the retained DDF production-authority finding, but that closeout is not a general direct-DMA, hostile-hardware, or device-autonomous interrupt claim.
Current status: the brokered-bounce transition path is represented by done task
evidence for DMAPool, DeviceMmio, and Interrupt lifecycle ownership,
provider virtio-net/NVMe chains, and hardware-audit consumption of abort-held
DMA mappings. The broader S.11.2 matrix remains the canonical gate for future
direct-remapping/vIOMMU, trusted-sharing-group, hostile-hardware-isolation, or
provider-written-address work. This document fixes the production handle epoch
invariants, DMAPool ledger of record, and hostile-smoke acceptance criteria
used by the completed Device Driver Foundation documentation gate. The
current QEMU virtio-net path has a kernel-owned DMA pool ledger for page,
descriptor, MMIO mapping, and interrupt-hold accounting proof coverage plus
static IOMMU attachment-policy reporting for retained DMA-capable PCI functions
and the bounded teardown trigger contract proof, bounded kernel-owned
budget/OOM proof, manager-bound DMAPool budget-profile proof plus bounded
budget-policy tamper and accounting-over-budget fail-closed proofs,
bounded manager-owned DeviceMmio proof adapter bound to decoded PCI
memory-BAR metadata plus future cache/write-policy metadata, bounded zero-live
device-manager DMAPool record lifecycle proof, and imported live-accounting
block/defer proof plus zero-live teardown-evidence scratch proof, stale DMA
handle scratch proof, stale DMA completion scratch proof, paired scratch
CQ-publication/new-owner-exposure proof, live software descriptor-generation
guard proof, bounded invalid used-descriptor-id proof, and bounded stale IRQ
after-detach, counter-backed after-revoke, counter-backed route-registry
reset-reuse, and pending IRQ token checks described above. The
bounded pure capos-lib::device_authority
validator and host tests cover the documented identity, state,
side-effect-blocking, non-wrapping epoch cases, and every current operation
variant’s exact blocked side-effect label for stale owner/subrecord, freed,
revoked, and retired failures. The zero-live
device-manager DMAPool lifecycle proof now validates a proof-scoped
tampered budget-policy record through the manager policy helper and records
fail-closed, no fake allocation, no ledger mutation, no teardown advancement,
and side-effect blocking while preserving the positive
budget_policy_result=ok path. The positive zero-live and imported-live
budget-accounting labels now go through the manager-owned active-record helper,
and synthetic over-budget attached-accounting candidates fail closed with exact
reasons while preserving the active manager record and blocking allocation,
ledger, teardown, and side effects; an over-budget attach candidate fails
before pool generation allocation. It also records a bounded
manager-attached DMA buffer handle under the attached pool, validates active
SubmitDescriptor and manager-record CompleteDescriptor through the pure
DMA-buffer validator, and records stale-after-revoke, freed-buffer, and
reused-slot rejection with exact reasons and side-effect-blocked; it now
also blocks pool teardown as
dmapool-buffer-attached, rejects a stale same-slot proof-scoped FreeBuffer
as dmabuffer-stale-handle with stale-slot-generation and
side-effect-blocked, rejects wrong-owner-generation, wrong-pool, wrong-pool
generation, and wrong-buffer-slot FreeBuffer attempts with exact pure
validator reasons and side-effect-blocked, preserves that manager-owned
buffer record after each failed free, and clears the record only after a
proof-scoped active FreeBuffer validation, proof-page scrub/free, and
manager-owned buffer-record detach. The completion proof does not publish a CQ
entry, complete a real descriptor, grant userspace authority, or clean up or
reuse production userspace DMA pages. The live virtio-net queue-completion
path now gates completion accounting on the completed descriptor’s
DeviceDmaAllocation rather than the queue id alone: callers must validate
the used descriptor id, recover the matching DmaPage, and pass its physical
address, queue, label, and generation to the kernel-owned ledger before
in-flight accounting is decremented. The paired run-net proof records that a
stale generation for a live kernel-owned page fails as stale-dma-handle,
leaves queue accounting and the live page unchanged, and blocks CQ publication
plus new-owner exposure. This closes a live accounting prerequisite only; it
does not inject a real post-reset device completion or expose userspace DMA
authority. The live virtio-net used-ring path also carries bounded software
descriptor generations: submissions reject invalid or already-active descriptor
ids before accounting, completions must consume the active software token
exactly once, and the run-net proof records side-effect blocking for active
reuse, double completion, and an old software token after descriptor-id reuse.
That guard does not make a stale hardware used-ring id distinguishable after
deliberate id reuse because virtio used entries carry no device generation. The
same gate now also covers invalid used-descriptor ids without corrupting the
hardware ring: an out-of-range id fails as descriptor-id-out-of-range before
completion observation, completion accounting, used_seen_idx, CQ publication,
or new-owner exposure can change. This is still a software-token and
constructed-token prerequisite, not a real malformed-device or post-reset
completion injection. The
same zero-live proof now also constructs the result-only
DMAPool.info cap skeleton from the manager-issued DmaPoolHandle, validates
the active manager record before returning conservative status labels plus
numeric device/BDF/owner/pool identity fields, proves the serialized cap call
path decodes to those labels and identity fields with host physical exposure
off and direct DMA blocked, and proves the cap’s info path fails closed as
dmapool-stale-handle after revoke begins. It also exercises
DMAPool.allocateBuffer through call_with_table() on a real cap-table entry,
returns zero-indexed DMABuffer result caps for eight fixed manager-owned
bounce-buffer slots, validates those result caps’ DMABuffer.info, and proves
a ninth allocation fails through the manager-owned budget policy as
dmapool-budget-exceeded / over-buffer-budget before publishing another
result cap or corrupting live slot state; full-pool allocation also preserves
manager generation counters. Stale-after-revoke allocations still fail closed
without publishing another result cap. The same zero-live proof
constructs the result-only
DMABuffer.info cap skeleton from the manager-attached DmaBufferHandle,
validates the active manager-owned buffer record through the pure DMA-buffer
validator before returning conservative no-authority labels plus numeric
device/BDF, owner/pool/slot identity fields, proves the serialized cap call
path decodes to those labels and identity fields with host physical exposure
off and direct DMA blocked, and proves the cap’s info path fails closed as
dmabuffer-stale-handle after revoke begins; the same stale cap’s serialized
method-0 path fails as invoke-failed. The first DmaBufferCap release hook
now reuses the bounded FreeBuffer validation shape to clear only the
manager-attached proof_buffer record during cap-table removal, production
ring CAP_OP_RELEASE, and real Process::release_caps_for_exit() paths. It
proves stale same-slot release is side-effect-blocked, proves the parent
DMAPool remains attached after buffer release, proves the bounded manifest
grant can allocate the slot again after explicit freeBuffer with a fresh slot
generation, and still requires staged zero-live evidence before the parent pool
can detach. The selected provider-TX path now adds a bounded exception to the
default manager-accounting descriptor contract: queue 1 submits may publish
the selected eight-entry TX queue depth, descriptors 0..7, into the existing
kernel-owned virtio-net TX ring before the first completion, ring one selected
notify doorbell per accepted provider descriptor, and then complete each
descriptor only after DMABuffer.completeDescriptor observes the matching
used-ring entry for the stored software descriptor generation. Those handoffs
clear the matching manager in-flight records, record bounded provider CQ
completion and acknowledgement counts, and can deliver ordered bounded
completion events to live tx_interrupt.wait calls for the same selected
route. The selected provider-TX path also proves a teardown-only drain when one
descriptor has completed and seven provider-published descriptors remain
incomplete: direct DMABuffer.freeBuffer remains blocked while in flight,
release explicitly drains only the incomplete matching used-ring entries and
retires those allocation-backed TX DMA queue ledgers without
DMABuffer.completeDescriptor results, no provider CQ/IRQ event is published
for the quiesced descriptors, release retires seven delivered-but-unacked
completion events, and later slot reuse requires a fresh generation plus normal
completion. Wrong-queue, stale-buffer, stale-notify, inflight-publication,
wrong-descriptor, duplicate-completion, and stale-tx_interrupt issue paths
remain side-effect-blocked before their guarded effects. This does not grant
direct DMA, arbitrary doorbells, arbitrary CQ ownership outside the selected TX
route, full virtio-net ownership, production NIC/storage migration, IOMMU
programming, hardware IRQ ownership, hardware acknowledgement, or broad
interrupt ownership beyond the bounded selected TX MSI-X mask/unmask proof.
The bounded DeviceMmio proof also records the manager-attached policy
metadata listed above, fails closed on a tampered cache/write-policy record
before creating any mapping, and validates active hostile handle identities for
wrong owner generation, wrong mapping generation, wrong mapping id, wrong BAR,
and wrong BDF/device with exact pure-validator reasons while preserving the
attached record and blocking mapping/doorbell side effects. Its serialized cap
call path also decodes to the direct DeviceMmio.info no-authority labels plus
numeric device/BDF, owner, BAR, mapping id, and mapping generation identity
fields with host physical exposure off and direct MMIO blocked, and its stale
serialized method-0 path fails as invoke-failed. The DMAPool.info skeleton
has the same kernel-side serialized stale failure evidence. The
interrupt handoff proof now also constructs a result-only Interrupt.info
cap skeleton from the manager-issued device handle and attached route record,
records active info success, proves the serialized cap call path decodes to the
direct no-authority labels plus numeric device/BDF, owner, source, source
generation, and route generation identity fields, proves those source and
route generations are distinct in the bounded route record, and proves
stale-after-revoke info fails closed as
interrupt-stale-handle plus stale serialized method-0 failure as
invoke-failed before any acknowledgement, mask, unmask, blocking wait, or
delivery authority exists. The manifest-granted skeleton now also exposes an
admission-only Interrupt.wait method that returns the pending-token
validator’s fail-closed labels without waking a waiter or changing delivery
counts, and an admission-only Interrupt.acknowledge method that validates the
active route while blocking hardware acknowledgement and preserving delivery
counts. It also exposes route-state-control Interrupt.mask and
Interrupt.unmask methods that validate the active route before changing the
manager-attached dispatch slot between claimed-masked and
driver-unmasked, while preserving delivery counts. A bounded
Interrupt.wait call observed after unmask installs a fixed-table userspace
waiter object for the current manager-granted route; the existing
route-dispatch delivery counter can now complete that waiter as
interrupt-delivered / waiter-completed-irq with
real_interrupt_delivery=delivered and an advanced delivery count. The same
focused smoke then submits a second unmasked wait, observes it remains pending,
calls Interrupt.mask, and finishes that wait as
interrupt-waiter-cancelled / route-masked /
waiter-completed-no-irq with wake_blocked=false, preserved source/route
generations, and unchanged delivery counts. The selected provider TX
tx_interrupt cap can now observe the bounded used-ring completion event
described above and account the already observed selected TX dispatch token
paired with that delivered provider CQ event, but hardware MSI/MSI-X programming
beyond the selected vector-control proof, full hardware IRQ ownership, deferred
EOI, LAPIC/MSI-X acknowledgement, and broader production interrupt dispatch
remain blocked. Provider TX MSI-X mask/unmask is limited to the selected-route
vector-control proof described earlier. Provider RX MSI-X mask/unmask remains
bounded to the selected RX route as well; release while masked restores that
selected vector-control bit and route state before clearing the live issue gate.
RX unmask admits the route transition before exposing the MSI-X vector-control
bit, and the focused QEMU proof shows a failed route unmask leaves the selected
vector masked with the route ledger preserved. Cleanup failure still leaves the
issue uncleared so future RX cap issuance stays blocked on uncertain route
state. RX wait/ack is now bounded to one selected-route zero-CQ dispatch token;
RX descriptors and CQ ownership remain blocked.
This is manager-record skeleton/no-production-DMA, no-real-MMIO-mapping, and
bounded route-dispatch interrupt-waiter prerequisite evidence only. Production
DMAPool, DeviceMmio, and Interrupt capability handles, production
userspace DMAPool buffer handles, real DeviceMmio BAR mapping objects,
real cache attributes/write policy enforcement, production kernel device-path
wiring beyond the current proof adapters, real device-manager page
quiesce/scrub/release hooks and real page cleanup/reuse beyond the bounded
kernel-owned proof pages, production
handle-attached budget/OOM enforcement beyond the current manager-owned
DMAPool.allocateBuffer budget slice, IOMMU remapping domains, production
handle-attached host tests, QEMU stale-handle smokes, broader userspace
exposure, production NIC/storage migration, cloud readiness, and S.11.2
hostile smokes remain open.
Do not weaken the short-term virtio-net bounce-buffer path until DMAPool,
DeviceMmio, Interrupt, device-manager ownership transactions, lifecycle
teardown, accounting, and hostile smokes all exist.
Design Risks and Open Questions Register
Consolidated index of known design risks and open architectural questions for capOS. Every entry routes to the file that owns the long-form design or the remediation backlog for that risk; this register itself is a pointer document, not a place to put new design.
Use this document to answer “is this risk already tracked, and where?” without re-deriving the state from the proposal tree on each review.
Last refresh: 2026-06-07 08:02 UTC.
How To Use
- Each design-risk row records the current observable state (what the code and docs say today), the owning tracker (the proposal/backlog/design file to update when the state changes), and the remaining gap (what is still open).
- Each open-question row records a current answer if one exists in the tree, plus a pointer to the canonical tracker. Questions that are genuinely unanswered are marked Open; those should not be closed by guessing here – update the relevant proposal, then update this register.
- When a risk is closed by code or by an explicit design decision, move the
short closure summary into
docs/changelog.mdand remove the row. - New review findings go into task records under
docs/tasks/; this register is about long-horizon design risks, not concrete unresolved review issues.
Design Risks
R1 – Process-wide ring vs multi-threaded userspace and full SMP
- State. The capability ring is one per process.
capos-rtenforces a single-ownerRuntimeRingClient. After in-process threading, at most one process-ring waiter is allowed. The first SMP Phase C AP scheduler-owner proof deliberately keeps process-wide ring execution on a single CPU at a time behind a scheduler-owner latch. - Owner.
docs/proposals/ring-v2-smp-proposal.md,docs/research/completion-ring-threading.md,docs/backlog/smp-phase-c.md,docs/architecture/threading.md. - Gap. Per-thread capability rings, per-thread completion routing, and the
Multi-Process / In-Process Threading Scalability milestones in
docs/roadmap.mdremain future work. Userspace threading scales only as far as the single ring waiter allows.
R2 – “Interface IS the permission” pushes safety into wrapper TCB
- State. capOS deliberately has no parallel rights bitmask: attenuation is
done by handing out a narrower
CapObjectwrapper, not a flag-reduced copy of the same cap. Wrapper correctness is therefore part of the trust base. - Owner.
docs/capability-model.md,docs/proposals/session-bound-invocation-context-proposal.md,docs/security/trust-boundaries.md,docs/backlog/stage-6-capability-semantics.md. - Gap. The completed Session-Bound Invocation Context migration has the one-session-per-process proof, privacy-preserving endpoint caller-session metadata, explicit subject-disclosure coverage, chat session-keyed state, Adventure service grants, terminal/stdio bridge liveness guards, and final Gate 4 verification. The first Tier-1 paper claim, covering session-bound invocation context evidence for implementation review, is closed. Remaining non-gating cleanup is stable service-audit identity across service replacement and legacy internal receiver-selector naming.
R3 – Legacy endpoint metadata as transitional service identity
- State. Legacy endpoint receiver metadata is contained as internal transport/debug state for normal paths. Chat uses session-keyed membership, terminal/stdio bridges enforce live caller-session guards, and delegated relabeling containment plus the historical service-object routing/lifecycle proof have landed. Adventure/shared-service cleanup is landed for normal workload paths.
- Owner.
docs/proposals/session-bound-invocation-context-proposal.md,docs/backlog/stage-6-capability-semantics.md. - Gap. Finish final legacy cleanup. Receiver metadata must remain internal transport state or hostile-test fixture, not subject identity or disclosure.
R4 – Resource accounting is fragmented
- State. Per-process memory, cap-table, and thread quotas exist;
ResourceProfile, session quotas, scheduling-context donation, and cross-service donation/fairness are still proposal-shaped. - Owner.
docs/proposals/resource-accounting-proposal.md,docs/proposals/memory-authority-model-proposal.md,docs/proposals/oom-and-swap-proposal.md,docs/proposals/user-identity-and-policy-proposal.md,docs/proposals/system-monitoring-proposal.md,docs/proposals/scheduler-evolution-proposal.md,docs/backlog/scheduler-evolution.md. - Gap. Phase D WFQ has landed; Phase E
SchedulingContextbind/revoke, budget, donation/return, and depletion notification are closed at the scheduler-cap layer, but cross-service donation semantics, per-service fairness beyond thread weights, log volume accounting, memory authority/residency proof obligations, unified resource bundles for guest/anonymous/external/service principals, and the scratch-bytes / outstanding-calls / endpoint-queue / in-flight-call quota fields tracked in review-finding task records remain open.
R5 – Copy-transfer SQE replay is repeatable by design
- State.
docs/authority-accounting-transfer-design.mddocuments that userspace replay of a copy-transfer SQE is repeatable per dispatch attempt, with move-transfer replay failing closed once the source slot is removed/reserved. Exactly-once replay suppression is explicitly future work (security invariant T3). - Owner.
docs/authority-accounting-transfer-design.md,docs/proposals/security-and-verification-proposal.md. - Gap. The
(sender_pid, call_id, sqe_seq)plus monotonic transfer-epoch identity needed for exactly-once replay across dispatch attempts is not implemented. Each transferable interface must continue to acknowledge this in its threat model.
R6 – CAP_OP_RELEASE is deferred / queued, not synchronous
- State. Owned-handle drop in
capos-rtqueues one localCAP_OP_RELEASEon the ring; process exit performs fallback cleanup. Release does not run before the next ring flush (cap_enteror process exit). - Owner.
docs/authority-accounting-transfer-design.md,docs/proposals/error-handling-proposal.md,docs/capability-model.md. - Gap. Resource-pressure or revocation-sensitive flows must not assume a
Drop call has already taken effect at the kernel layer. Time-critical
revocation should use
CapabilityManager.revokeor epoch revocation rather than relying on Drop.
R7 – Shared memory / zero-copy / shared park are incomplete
- State.
MemoryObjectsubstrate exists;SharedBufferprovenance, file/network/DMA zero-copy paths, and shared park/SharedParkSpaceare blocked on mapping provenance / object pinning work. - Owner.
docs/proposals/storage-and-naming-proposal.md,docs/proposals/memory-authority-model-proposal.md,docs/proposals/networking-proposal.md,docs/architecture/park.md,docs/backlog/runtime-network-shell.md. - Gap. Workloads that need true zero-copy IPC, storage, or network
pipelines pay a copy/serialization cost until provenance/pinning lands.
ParkSpace private cleanup now covers anonymous
VirtualMemory.unmap,VirtualMemory.decommit, and explicitMemoryObject.unmapfor borrowed mappings; shared park keys and address-space generation cleanup remain open.
R8 – Networking lives inside the kernel TCB
- State. Largely resolved: the Phase C userspace NIC driver and smoltcp
network-stack process own the production socket path, the kernel no longer
depends on
smoltcp, and the kernel socketCapObjects are qemu-only fixtures that fail closed without a kernel socket owner. The Telnet and SSH terminal-host proofs that sat on the kernel path are retired. - Owner.
docs/proposals/networking-proposal.md,docs/dma-isolation-design.md,docs/backlog/runtime-network-shell.md. - Gap. The remaining qemu-only kernel virtio-net fixture and socket
CapObjectsurface is fixture code, not production authority. The kernel-sideSocketTerminalSessiontransitional shim is retired (2026-06-10):TcpSocket.intoTerminalSessionfails closed, and a network-backedTerminalSessionmust be re-built as a userspace terminal-session service over the userspace TCP stack if byte-stream terminal transport is needed again.
R9 – DMA isolation is backend-scoped, not a hostile-hardware blanket
- State.
docs/dma-isolation-design.mdnow records runtime fail-closed DMA backend selection. The current no-IOMMU cloud/DDF path uses manager-owned, brokered bounce buffers for userspace provider authority and hides host physical addresses and IOVAs from the driver. The selected QEMU Intel VT-d path has bounded per-device remapping evidence, but that remains emulator evidence rather than a general hardware-isolation claim. Without trusted remapping, hostile bus-mastering hardware remains out of scope. - Owner.
docs/dma-isolation-design.md,docs/proposals/networking-proposal.md,docs/proposals/cloud-deployment-proposal.md,docs/backlog/hardware-boot-storage.md. - Gap. The retained DDF production-authority finding is closed in
docs/tasks/done/2026-06-07/ddf-production-authority-closeout.md. Remaining work is explicit task or proposal scope: direct-remapping/vIOMMU production hardware support, broader provider/device variants, and device-autonomous MSI-X delivery rather than the current polled or kernel-injected waiter proofs.
R10 – Boot package model embeds all binaries
- State.
tools/mkmanifestembeds every declared binary as aNamedBlobinsidemanifest.bin. The kernel loads onlyinit; everything else is fetched byinitfrom the in-memoryBootPackage. - Owner.
docs/backlog/hardware-boot-storage.md,docs/proposals/storage-and-naming-proposal.md,docs/trusted-build-inputs.md. - Gap. Boot binary ISO layout (separate ELF payloads), package/storage update model, and persistent storage-backed delivery are not yet designed as code; the current scheme is an explicit prototype compromise.
R11 – Pre-auth and post-auth share a shell process
- State. The shell-led boot flow folds
console-loginintocapos-shelland uses an anonymous-first session that escalates vialogin/setup. The pre-auth and post-auth code paths run in one userspace process and address space. - Owner.
docs/proposals/boot-to-shell-proposal.md,docs/proposals/shell-proposal.md,docs/security/trust-boundaries.md,docs/proposals/user-identity-and-policy-proposal.md. - Gap. Separation depends on shell/auth implementation quality, not on a process boundary. The future direction (separate login service with minimal authority, restricted launchers, WebShell/SshGateway) is proposal-shaped. Remote and non-loopback shells must remain blocked until pre-auth and post-auth authority are process-isolated or a shared-process proof is accepted.
R16 – Remote shell ingress is demo/prototype only
- State. Telnet is a plaintext loopback-only QEMU demo. SSH has SSH-shaped prerequisites, fixture authentication proofs, dev key material, policy classification, and restricted-shell launcher coverage, but no production encrypted SSH transport, durable key/account storage, full OpenSSH-compatible userauth/channel handling, channel binding, or complete audit/storage gates.
- Owner.
docs/proposals/ssh-shell-proposal.md,docs/proposals/telnet-tls-shell-proposal.md,docs/backlog/runtime-network-shell.md,docs/tasks/README.md,docs/build-run-test.md. - Gap. Production/non-loopback shell exposure is blocked on SSH transport, key, account, audit, storage, session-bound delegation, and pre-auth/post-auth isolation gates.
R17 – Remote-session UI bridge and Tauri wrapper are research-only
- State. The Linux remote-session-ui bridge and the repo-local Tauri wrapper run as trusted local backends that hold the upstream capOS session and project view models / call results to the browser/webview. A policy preflight now proves the wrapper remains check/dev only; distributable packaging and desktop automation modes are intentionally blocked.
- Owner.
docs/proposals/remote-session-ui-security-proposal.md,docs/proposals/remote-session-capset-client-proposal.md,docs/backlog/remote-session-capset-client.md. - Gap. Distributable packaging, desktop automation, and a reviewed production posture for the remote-session UI surface remain unreviewed in the relevant remote-session proposal/backlog task records. Non-loopback remote-session UI exposure must stay blocked until that posture is accepted.
R12 – Verification coverage is partial, not full proof
- State. Bounded Kani gate (
make kani-lib/make kani-lib-full), Loom ring model, Miri lib tests, proptest, fuzz harnesses, panic-surface inventory, and CI dependency policy exist. Coverage is not whole-system and not seL4-style functional refinement. - Owner.
docs/proposals/security-and-verification-proposal.md,docs/security/verification-workflow.md,docs/panic-surface-inventory.md,docs/backlog/security-verification.md. - Gap. Public/external claims must distinguish “bounded model checked”
from “fully verified”. Promote new properties into Kani/Loom only when the
invariant is concrete and bounded. IPC/scheduler panic-surface hardening
also remains open around guarded unwraps, rollback restoration, stale
queues, blocking waits, process/thread exit, endpoint cancellation, TLB
shootdown send failures, and scheduler hot-path expects. Kernel upper-half
page-table mutation after AP startup is closed for the current
MMIO/firmware helper path by
docs/tasks/done/2026-06-07/kernel-upper-half-pml4-propagation-hardening.md; future helper windows or allocator-growth paths that need a new kernel-half PML4 slot still require boot preseed or synchronized live-root propagation.
R13 – Trusted build inputs are partly pinned
- State. Limine (commit + artifact SHA-256),
capnp1.2.0 source tarball, CUE 0.16.0, mdBook/mdbook-mermaid, Typst 0.14.2, Cargo lockfiles, the Rust nightly date policy, the Kani toolchain bundle, OVMF firmware hash, and the CI apt package versions forqemu-system-x86,xorriso,make,git, andovmfare pinned or policy-pinned.make build-provenancerecords local runner identity, GitHub-hosted image identity when present, selected host-tool paths, package identities and normalized apt source pockets when discoverable, and OVMF path/package/hash or absence. CI pull requests run a blocking environment provenance comparison against the latest successful main-branchqemu-smokeprovenance artifact. - Owner.
docs/trusted-build-inputs.md,docs/proposals/cloud-deployment-proposal.md. - Gap. The PR-blocking environment comparison and qemu-smoke package pins
close the previous
make/gitidentity and advisory-compare gap for CI proof branches, butubuntu-24.04is still a GitHub-managed mutable runner label, not an immutable production image digest. Full production reproducibility still needs a self-built runner image referenced by digest, repo-managed download-and-verify tool digests for the apt-pinned build/boot tools, or both.
R14 – User identity / policy is proposal-shaped
- State. Anonymous/operator sessions, password setup/login, broker-issued shell bundles, and redacted audit records exist. Durable accounts, ABAC/MAC context, OIDC/passkeys, disk-backed account stores, and resource bundles are proposal-shaped. Stale-session calls and retained shell-bundle caps fail closed for current proof paths, but session liveness is still represented by immutable metadata plus expiry timestamps rather than a mutable session-manager cell with logout, revocation, recovery-only, and renewal state.
- Owner.
docs/proposals/user-identity-and-policy-proposal.md,docs/backlog/local-users-management.md,docs/backlog/session-bound-invocation-context.md,docs/proposals/oidc-and-oauth2-proposal.md,docs/proposals/certificates-and-tls-proposal.md,docs/proposals/cryptography-and-key-management-proposal.md. - Gap. Until durable identity / persistence / passkey paths land, capOS
is not a complete multi-user OS. Demo claims must scope to the proven
anonymous + operator + manifest-seeded local accounts model. Before treating
fixed short session expiry as production interactive UX, capOS needs
explicit
logout, owner-shell/gateway close propagation, and renewal paths that mint fresh grant leases without reviving stale ordinary grants.
R15 – App exception serialization depends on result-buffer capacity
- State. Application-level exceptions are serialized into the caller’s result buffer; if the target cannot be identified, invocation fails earlier with transport errors. Truncation/transport failures are documented.
- Owner.
docs/proposals/error-handling-proposal.md,docs/capability-model.md. - Gap. Service UX/debuggability can degrade for malformed or small-buffer clients. No remediation is required in code today, but each service contract should document its expected result-buffer capacity.
Open Design Questions
The following questions came up in external review. Each row gives the current best answer observed in the tree, the canonical tracker to update, and an explicit status.
Q1 – Cap’n Proto ABI compatibility policy
- Current answer.
docs/abi-evolution-policy.mddefines compatibility classes, stable schema ordinals, reserved-field rules, ring layout rules, version negotiation, deprecation windows, and review gates. Generated-code drift is still checked throughmake generated-code-checkandtools/check-generated-capnp.sh. - Tracker.
docs/abi-evolution-policy.md,docs/trusted-build-inputs.md,schema/capos.capnp,capos-config/src/ring.rs. - Status. Answered for the current research tree. Ring v2 compatibility remains a separate open question below.
Q2 – Ring v2 backward compatibility
- Current answer.
docs/proposals/ring-v2-smp-proposal.mdtreats per-thread ring ownership as the full-SMP target and frames it as an evolution that may need ABI changes;docs/tasks/README.mdcalls runtime ring reactor work the compatibility bridge. - Tracker.
docs/proposals/ring-v2-smp-proposal.md,docs/backlog/smp-phase-c.md. - Status. Open. Whether Ring v2 is backward-compatible with the process-wide ring or an explicit ABI break has not been decided.
Q3 – Which capabilities are copy-transferable vs move-only vs non-transferable
- Current answer.
docs/authority-accounting-transfer-design.mddefines copy/move/none transfer modes and the accounting/rollback rules. Per-interface transfer mode is encoded on the schema-definedCapObject. - Tracker.
docs/authority-accounting-transfer-design.md,schema/capos.capnp. - Status. Partial. The mode is enforced per object, but the user-visible matrix (which named caps are copy/move/none) is not consolidated in one document.
Q4 – Copy-transfer replay: feature or compromise
- Current answer. Repeatable copy-transfer replay is documented as the current accepted semantics. Exactly-once replay suppression is future work. See R5.
- Tracker.
docs/authority-accounting-transfer-design.md. - Status. Decided as “current semantics, future tightening optional”.
Q5 – When legacy endpoint identity is replaced and what migrates
- Current answer.
docs/backlog/session-bound-invocation-context.mddecomposes the selected migration: one immutable session context per process, privacy-preserving endpoint caller-session metadata, chat/adventure/stdio session-keyed migration, and legacy endpoint-identity cleanup. The old service-object identity plan is superseded. - Tracker.
docs/proposals/session-bound-invocation-context-proposal.md,docs/backlog/session-bound-invocation-context.md,docs/backlog/stage-6-capability-semantics.md. - Status. Selected milestone. See R3.
Q6 – Minimum production TCB target
- Current answer.
docs/proposals/security-and-verification-proposal.mdnow enumerates the current demo/proof TCB and the target production TCB. Current proofs still trust kernel networking, init/supervisors, broker/session services, harnesses, and QEMU virtio. The target production TCB removes ordinary apps and shell children but still includes minimal init/supervisor, credential/session/broker/key/audit services, production device managers, and ABI/schema/build-signature inputs. - Tracker.
docs/security/trust-boundaries.md,docs/proposals/userspace-authority-broker-proposal.md,docs/proposals/boot-to-shell-proposal.md. - Status. Partially answered. The TCB statement exists; reducing the actual implementation to that target and proving the non-loopback shell gates remains open.
Q7 – Revocation strategy
- Current answer. Generation/epoch revocation exists for endpoint-backed
caps;
CapabilityManager.revokecleans up endpoint-backed service objects by object behavior. Session-bound dispatch now fails closed for stale proof paths, but the target lifecycle splits revocation into session liveness cells, grant leases, and object/facet epochs. Revocation trees, leases, supervisor-owned-cap patterns, and session renewal/close propagation are proposal-shaped. - Tracker.
docs/proposals/service-architecture-proposal.md,docs/proposals/session-bound-invocation-context-proposal.md,docs/proposals/user-identity-and-policy-proposal.md,docs/capability-model.md. - Status. Open. The chosen revocation primitive set (epochs vs trees vs leases vs explicit-revoke methods per object) needs an explicit decision, and interactive session lifecycle needs a concrete liveness-cell plus renewal protocol.
Q8 – Boundary between kernel and service-level resource accounting
- Current answer. Memory frame grants and cap-table slots are kernel accounting; storage/network buffer accounting is proposed at the service layer. The boundary is not yet implementation-driven.
- Tracker.
docs/proposals/resource-accounting-proposal.md,docs/proposals/storage-and-naming-proposal.md,docs/proposals/networking-proposal.md. - Status. Open.
Q9 – CPU accounting and scheduling contexts
- Current answer. Per-CPU WFQ run queues, per-thread weighted vruntime,
SchedulingPolicyCapweight/latency-class authority, and Phase ESchedulingContextbind/revoke, budget, donation/return, and depletion notification are implemented perdocs/changelog.md(Phase D closed 2026-05-10) anddocs/proposals/scheduler-evolution-proposal.md. Cross-service donation policy, priority inheritance broader than scheduling contexts, explicit scheduling-cap fairness across principals, and full nohz activation remain proposal-shaped. - Tracker.
docs/proposals/smp-proposal.md,docs/proposals/scheduler-evolution-proposal.md,docs/backlog/scheduler-evolution.md,docs/proposals/resource-accounting-proposal.md,docs/architecture/scheduling.md. - Status. Partial. The base CPU accounting and scheduling-context model is implemented through Phase E; the surrounding policy (cross-service donation, full nohz activation, isolation leases, fairness across principals) is the remaining decision.
Q10 – IOMMU requirement for userspace networking
- Current answer.
docs/dma-isolation-design.mdselects a runtime fail-closed backend: direct remapping only when capOS can discover and program trusted translation authority, otherwise a labeled brokered bounce-buffer fallback or unsupported. The current GCP/no-IOMMU userspace-driver evidence uses the brokered bounce path. - Tracker.
docs/dma-isolation-design.md,docs/proposals/networking-proposal.md,docs/proposals/cloud-deployment-proposal.md. - Status. Answered for the current no-IOMMU cloud path. Future direct-remapping, vIOMMU, or hostile-hardware isolation claims require their own evidence and remain outside the brokered-bounce production authority closeout.
Q11 – Capability persistence model
- Current answer. All capabilities are runtime-only today; sealed/stored caps and namespace-mediated reconstitution are storage-proposal scope.
- Tracker.
docs/proposals/storage-and-naming-proposal.md,docs/proposals/volume-encryption-proposal.md,docs/paper/plan.md(paper-scoped persistence Tier-1 prerequisite). - Status. Open.
Q12 – Least-privilege shell command invocation
- Current answer.
capos-shellruns commands using broker-issued bundles; the broker, not the shell, is the policy decision point.RestrictedShellLauncherkeeps remote shell launches off raw spawn authority. - Tracker.
docs/proposals/shell-proposal.md,docs/proposals/userspace-authority-broker-proposal.md,docs/proposals/boot-to-shell-proposal.md. - Status. Direction agreed, complete migration to broker-only authority for every shell-driven invocation is open.
Q13 – Formal properties to prove
- Current answer. Existing bounded proofs cover cap-table non-forgery, frame-bitmap invariants, transfer rollback, and ring producer-consumer invariants. seL4-style full functional refinement is explicitly out of scope.
- Tracker.
docs/proposals/security-and-verification-proposal.md,docs/security/verification-workflow.md,docs/proposals/formal-mac-mic-proposal.md. - Status. Partially answered. A definitive list of “what we will keep proving” vs “what we will keep testing” should be added when the next Kani/Loom obligation set is concrete.
Q14 – Threat model coverage
- Current answer.
docs/proposals/security-and-verification-proposal.mdnow contains a threat actor matrix for local physical attackers, malicious DMA devices, malicious boot manifests, compromised init/supervisors, compromised narrow services, hostile network peers, and malicious build dependencies. - Tracker.
docs/security/trust-boundaries.md,docs/proposals/security-and-verification-proposal.md,docs/dma-isolation-design.md,docs/trusted-build-inputs.md. - Status. Answered at design level. Remaining work is implementation/proof through the relevant task records.
Q15 – Language runtimes integration model
- Current answer.
capos-rtis the canonical no_std Rust runtime. Go, Python, Lua, JavaScript/TypeScript, WASI, C/C++, and POSIX-shaped software are future tracks. The current documentation separates native runtime adapters, capability-native bindings, POSIX compatibility adapters, and WASI host adapters instead of treating “compatibility layer” as one shared ABI. - Tracker.
docs/programming-languages.md,docs/proposals/userspace-binaries-proposal.md,docs/proposals/go-runtime-proposal.md,docs/proposals/lua-scripting-proposal.md. - Status. Open. A common ABI layer vs per-runtime generated clients has not been decided; the current default is per-runtime or adapter-specific clients backed by explicit capabilities.
Device Driver Specifications
The pages under docs/devices/ are per-device driver references. Each one
captures the authoritative hardware/protocol specification a capOS device
driver is built from, the subset of that specification the driver actually
implements, and how the device binds onto capOS’s reviewed userspace
hardware-authority gate.
A device page is a navigational / provenance document, not a re-spec. It cites the spec (name, version, source), summarizes only the wire-format subset the driver actually implements, and points into the implementation with file + symbol references (the function, type, or constant name – not line numbers, which drift) so the doc maps to the code. Do not copy the full spec or dump exhaustive register tables: if something is in the spec and not specially handled by the driver, link to it rather than transcribing it.
Depth scales to maturity and risk. Transitional or stable in-kernel drivers get a concise provenance map – do not over-document stable code. Actively developed or higher-risk drivers (new DMA paths, cloud NICs/storage behind the userspace-authority gate) get fuller treatment.
These pages are the provenance map of record for device-driver work. Landing or
modifying a device driver requires creating or updating the matching
docs/devices/<device>.md page as part of the same change; it is part of that
change’s acceptance, not an afterthought. Each page reads as a reader-facing
capability map – the driver’s currently-implemented subset in present tense and
what is future or not yet implemented – not a per-slice development log.
docs/devices/ is distinct from the other device-adjacent doc areas:
docs/research/holds OS-design research deep-dives (capability models, IPC, scheduling, IOMMU prior art). It informs architecture; it does not specify a concrete device.docs/*-design.mdanddocs/proposals/describe capOS subsystem designs (the DMA isolation model, the device-manager refactor, the userspace-driver authority gate). They define the framework a driver binds into; a device page maps one device onto that framework.docs/devices/<device>.mdis the narrow, per-device contract: which external spec, which wire-format fields, and which capOS grants and fail-closed rules the driver depends on.
Three-part structure
Every device page follows the same three sections. See Device Spec Template for the blank form and virtio-net for a worked example.
- Spec basis – the authoritative specification(s) the driver is built from: name, version, and source (URL or ref). For open vendor devices without a freely published register spec (for example AWS ENA or Azure MANA), cite the upstream open-source driver and any published datasheet as the basis of record.
- Wire format (relevant subset) – the registers / BARs, queues / rings, descriptor and completion formats, and admin / management commands that the driver actually implements. Document the subset, not the whole spec.
- capOS mapping – how the device binds (note transitional in-kernel
status and any pending userspace move where applicable); its
DeviceMmio/Interrupt/DMAPoolusage; the fail-closed and validation rules it relies on (stale-generation rejection, bounds checks, doorbell scoping); and what is QEMU-emulable versus hardware-only. The last point drives whether the driver carries a QEMU proof or a host-side conformance gate plus a deferred live proof.
Pages
- Device Spec Template – the blank three-part form for a new device page.
- virtio-net – the in-tree modern virtio-net PCI NIC: the worked first example, sourced from the kernel virtio transport and the public virtio specification.
- NVMe – the queue-base/PRP register and descriptor subset the conditional kernel on-notify DMA validator scans on the NVMe doorbell path, plus the no-IOMMU brokered-DMA correction (validator mechanism + bounded hostile-scan proof + brokered controller bring-up).
- AWS Nitro EBS (NVMe storage) – the AWS cloud-shape
classification on top of the shared NVMe foundation: EBS exposed as NVMe
namespaces, the Nitro IOMMU-availability DMA-backend policy, and the local
make run-pci-nvmeprecursor proof. - Azure managed disk (NVMe storage) – the Azure cloud-shape
classification on the same shared NVMe foundation: Azure Boost managed disks
exposed as NVMe namespaces, the Azure IOMMU-availability DMA-backend policy,
why the older-family Hyper-V/SCSI path is out of scope, and the local
make run-pci-nvmeprecursor proof. - GCP Persistent Disk (storage) – the GCP cloud-shape
classification on the same shared NVMe foundation: PD exposed as NVMe
namespaces on current GCE generations, the GCE IOMMU-availability DMA-backend
policy, why the older-family
virtio-scsiPD path is out of scope, and the production storage-bind proof (cloud-prod-storage-bound-local-proof) that precedes a billable live-GCE storage driver bind. - GCE gVNIC – a grounding map for the Google Virtual NIC: spec
basis from the public gVNIC docs and the GVE Linux driver, the wire-format
subset (BARs, admin queue, MSI-X interrupt classes, GQI/DQO formats, QPL/RDA
addressing, reset) a future reusable capOS driver would implement, and the
DDF authority mapping. capOS has live-GCE inventory, admin-queue/register,
bounded GQI/QPL raw-frame TX/RX, and typed
Nic-adaptation proofs for the1ae0:0042PCI function, but no reusable gVNIC provider service, QEMU model, DQO/RDA path, or host conformance suite; it is a separate GCE portability lane, not a blocker for the virtio-net Web UI proof.
<Device> Driver Specification
Copy this file to docs/devices/<device>.md, set the front matter
(status, description, last_reviewed, topics), add the page to
docs/SUMMARY.md, and fill in the three sections below. Document only the
subset the driver actually implements; cite, do not transcribe, the full spec.
1. Spec basis
- Device: name, PCI/MMIO class and IDs (vendor/device), instance shapes.
- Authoritative spec(s): name, version, and source (URL or ref). For open vendor devices without a published register spec, cite the upstream open-source driver and any datasheet as the basis of record, and say so explicitly.
- Reference driver(s) (optional): upstream implementations cross-checked for behavior.
2. Wire format (relevant subset)
- Registers / BARs: BAR layout, register map offsets, doorbell offsets the driver reads or writes.
- Queues / rings: queue kinds (admin/management vs I/O), ring layout, sizes.
- Descriptor + completion formats: the descriptor and completion entry fields the driver encodes/decodes, including flags and status codes.
- Admin / management commands: feature negotiation, identify/configure, and lifecycle commands the driver issues.
3. capOS mapping
- Authority gate: how the device is enumerated, claimed, and bound through the reviewed userspace-driver hardware-authority gate and the device-manager ownership ledger.
DeviceMmio: which BAR ranges are mapped, with what page attributes (device-uncacheable, NX), and how register/doorbell writes are scoped.Interrupt: MSI/MSI-X vector binding, completion-IRQ waiter model.DMAPool: queue/buffer DMA allocation, the selected DMA backend (direct IOMMU vs labeled bounce buffer), quiesce/scrub-before-reuse rules, and the host-physical-address / IOVA non-exposure policy.- Fail-closed / validation rules: stale-generation rejection, BAR bounds, doorbell scoping, malformed-descriptor handling, release/reset/driver-death teardown.
- QEMU-emulable vs hardware-only: which parts are end-to-end provable in
QEMU (and the
make run-*target) versus hardware-only (host-conformance gate now, deferred live proof when the hardware is provisioned).
virtio-net (modern PCI NIC)
This is a provenance map for the in-tree virtio-net driver: it cites the spec, summarizes only the wire-format subset the code actually implements, and points into the implementation. It is not a re-spec – where the spec is implemented unchanged, it links rather than transcribes. The driver is mature and transitional (in-kernel today, slated to move to a userspace network-stack process), so the treatment is a concise map rather than exhaustive register tables.
1. Spec basis
- Device: virtio network device, modern (virtio 1.x) PCI transport.
PCI vendor
0x1af4; device0x1041(modern) /0x1000(transitional). IDs atkernel/src/pci.rs(VIRTIO_VENDOR_ID,VIRTIO_NET_MODERN_DEVICE_ID,VIRTIO_NET_TRANSITIONAL_DEVICE_ID; matched byPciDevice::is_virtio_net). - Authoritative spec: Virtual I/O Device (VIRTIO) Version 1.2, OASIS Committee Specification 01 (2022-07-01). Source: https://docs.oasis-open.org/virtio/virtio/v1.2/virtio-v1.2.html. Relevant sections: 4.1 (virtio over PCI bus), 2.7 (split virtqueues), 5.1 (network device).
- Reference: cross-checked against the Linux
virtio_netandvirtio_pci_moderndrivers for the modern-transport handshake and split-ring layout.
2. Wire format (implemented subset)
Only the modern split-ring subset the driver uses is summarized here; feature bits and structures the spec defines but the driver does not specially handle are linked, not transcribed.
- PCI capabilities / BAR layout: virtio modern PCI vendor capabilities
(common / notify / ISR / device / PCI cfg) parsed from the capability list;
type constants
VIRTIO_PCI_CAP_COMMON_CFG/..._NOTIFY_CFG/..._ISR_CFG/..._DEVICE_CFG/..._PCI_CFGand length floorsVIRTIO_PCI_CAP_MIN_LEN/VIRTIO_COMMON_CFG_MIN_LENinkernel/src/virtio.rs; common-config register offsets are thetransport::COMMON_*constants (COMMON_DEVICE_FEATURE,COMMON_QUEUE_SELECT,COMMON_QUEUE_NOTIFY_OFF, …). The notify capability carriesnotify_off_multiplier(ModernTransport::notify_off_multiplier) used to compute per-queue notify addresses. - Split-ring layout: 16-byte descriptors (
transport::VIRTQ_DESC_SIZE), available and used ring offsets, and thetransport::VIRTQ_DESC_F_NEXT/transport::VIRTQ_DESC_F_WRITEflags. Descriptor lifecycle is generation-tracked through a boundedtransport::VIRTQ_DESCRIPTOR_TRACKING_SLOTSslot array (DescriptorTrackingSlot). - Queues: RX queue (
VIRTIO_NET_RX_QUEUE), TX queue (VIRTIO_NET_TX_QUEUE), negotiated to a bounded target size (VIRTIO_NET_QUEUE_TARGET_SIZE); the target size must not exceed the tracking slot count (compile-timeassert!againsttransport::VIRTQ_DESCRIPTOR_TRACKING_SLOTS). - Net header / framing: 12-byte
virtio_netheader prepended to frames (VIRTIO_NET_HDR_LEN); proof TX buffers carry the header plus a minimum Ethernet frame (TX_PROOF_BUFFER_LEN,TX_PROOF_ETHERNET_OFFSET). - Feature negotiation: device/driver feature select/read registers in the
common config; the driver negotiates
transport::VIRTIO_F_VERSION_1/transport::VIRTIO_F_ACCESS_PLATFORM(generic, intransport) plus the net-specificVIRTIO_NET_F_MAC(1 << 5) and acknowledgesVIRTIO_NET_F_MRG_RXBUF(1 << 15).
3. capOS mapping
-
Binding (transitional): virtio-net is currently driven in the kernel. PCI/MSI-X transport discovery, the split-ring transport, smoltcp, TCP listeners, the line discipline, and the Telnet IAC filter live in
kernel/src/virtio.rsandkernel/src/cap/network.rs. This is explicitly transitional: Phase C of the networking proposal (docs/proposals/networking-proposal.md) moves the NIC driver and stack into a userspace network-stack process once the userspace-driver authority gate applies to it. Until then it does not bind through theDeviceMmio/Interrupt/DMAPoolprovider grants the DDF cloud-NIC drivers use; the sections below describe its kernel-owned equivalents. -
MMIO: modern-transport common/notify/ISR/device config regions are mapped from the device BARs and accessed through the
transportMMIO helpers (kernel/src/virtio.rstransportmodule). Doorbell (queue-notify) writes are scoped to the per-queue notify address computed fromnotify_off_multiplier; the DDFDeviceMmiocap (kernel/src/cap/device_mmio.rs) is the userspace successor surface. -
Interrupt: MSI-X vectors are programmed for config and per-queue interrupts; route records and vector dispatch are tracked by the kernel-owned device-interrupt ledger (
kernel/src/device_interrupt.rs). Themake run-netsmoke asserts MSI-X metadata selection, vector-pool/exhaustion policy, masked route lifecycle, queue vector assignment, descriptor guards, ARP, and ICMP. Device-autonomous delivery proofs live in the dedicated userspace-provider MSI-X gates, not in the retired kernel L4 owner. -
DMA: ring pages and TX/RX buffers are allocated and accounted through the net-keyed kernel DMA ledger (
kernel/src/device_dma.rs).make run-netruns without an emulated IOMMU, so DMA uses the intended bounce-buffer fallback; no host physical address or IOVA is exposed beyond the kernel boundary. -
Production cloud build cfg surgery (DMA ledger + DDF caps):
kernel/src/device_dma.rsand the cap surfaceskernel/src/cap/dma_pool.rs(DmaPoolCap/DmaPoolCapInfo) andkernel/src/cap/dma_buffer.rs(DmaBufferCap/DmaBufferCapInfo) compile in the non-qemubuild. Thecloud-prod-dmapool-bounce-buffer-grant-proofwires the first production caller throughkernel/src/cap/dmapool_bounce_buffer_grant_proof.rs: it stages a parked manager-attachedDMAPoolrecord over one DMA-capable PCI function from the inventory (stage_bounce_buffer_dmapool_recordinkernel/src/device_manager/stub.rs), builds aDmaPoolCapover the parked handle, allocates one bounded bounce-bufferDMABufferthroughdevice_manager::issue_manager_attached_dmabuffer_handle_with_request(which routes todevice_dma::allocate_manager_attached_dmapool_bounce_buffer_page), asserts cap-info labels (userspace_dmapool=manager-issued-bounce-buffer,allocation=single-bounce-buffer-page,real_dma=not-attempted,direct_dma=blocked,host_physical_user_visible=0,iova_export=disabled-future-only), thedma_backend::select_and_reportbounce-bufferverdict, quiesce-before- release (release_dmapool_record_for_cap_releasereturnspending-buffer-releasewhile the buffer is live), scrub-before-reuse (the released bounce-buffer frame is zeroed in place before the frame returns to the allocator), and stale-handle-after-detach, then emitscloudboot-evidence: dma-pool-grant <token>for the cloudboot harness. The qemu-only surface that stays gated includes thecap::dmapool_grant_sourcebootstrap source (kernel/src/cap/dmapool_grant_source.rs), theKernelCapSource::DmaPoolgrant arms inkernel/src/cap/mod.rsandkernel/src/cap/process_spawner.rs, theDmaBufferCompleteDescriptorAdmission::provider_cq_eventfield that carriescap::interrupt_grant_source::ProviderCompletionCqEventIdentity, and the entirekernel/src/device_manager/qemu_full.rsDDF backend (includingdevice_dma::{begin_virtio_net_pool, allocate_virtio_net_page, ...}). The proof maps no userspace VMA, programs no real DMA, attaches no queue, programs no interrupt, and emits noprovider-nic-bound/storage-bound; descendants indocs/backlog/hardware-boot-storage.md#cloud-device-trackscover those. -
Fail-closed rules: requested ranges are validated against device-reported geometry and destination buffer length before any device access; descriptor reuse is generation-tracked; the bounded tracking-slot array (
transport::VIRTQ_DESCRIPTOR_TRACKING_SLOTS,DescriptorTrackingSlot) caps in-flight descriptors. Stale/over-range requests fail closed. -
QEMU-emulable vs hardware-only: fully QEMU-emulable. QEMU provides virtio-net-pci;
make run-netis the end-to-end proof. No hardware-only path – this is the local-binding reference the cloud NIC drivers (ENA, MANA, GCP virtio-net) mirror for their QEMU-provable halves. -
GCP cloud-shape classification: GCP 1st/2nd-gen x86 non-Confidential machine families (e.g.
n1-*,e2-*) present the virtual NIC as exactly this standard virtio-net device (vendor0x1af4) under a no-IOMMU / SWIOTLB bounce-buffer DMA backend, so the QEMUvirtio-net-pcibinding is the local precursor for the GCP NIC path. The enumeration path emits avirtio-net: cloud shape classificationproof line (kernel/src/pci.rsreport_cloud_virtio_net_shape) classifying the enumerated function against that documented GCP surface; bothmake run-netandmake run-ddf-provider-consumerassert it conjunctively with the GCP-mapped bounce-bufferdma: backend selectionline (kernel/src/dma_backend.rsselect_and_report). The GCP→bounce-buffer mapping itself is the support-policy expectation recorded indocs/research/cloud-dma-provider-evidence.md. The proof carries explicit scope flags (local_qemu_precursor=true,real_gcp_enumeration=not-claimed,gvnic=separate-driver-out-of-scope); live GCP enumeration and cloud used-ring ownership remaincloud-gcp-virtio-net-nic-driver. -
Production cloud-boot evidence marker (
dma-backend): the production boot path (the kernel built without theqemufeature, which is whatmake capos-cloudboot-imagepackages) emits the parseablecloudboot-evidence: dma-backend <token>serial marker thetools/cloudboot/harness reads (serial_marker_tokens; “Serial evidence-marker contract” intools/cloudboot/README.md). It is emitted bykernel/src/dma_backend.rsselect_and_report(always-compiled, so it fires on the production cloud image, not just theqemusmoke build) alongside the human-readabledma: backend selectionline. The marker uses the harness token namespace (direct_dma/trusted_domain/bounce_buffer), mapped from the resolvedDmaBackendbycloudboot_evidence_token– deliberately not theDmaBackendDisplay string (direct-remapping/bounce-buffer). The current two-variant resolved backend maps todirect_dma/bounce_buffer; on the probed GCE shapes (IOMMU disabled) the value isbounce_buffer. Thetrusted_domainslot has no current producer and is reserved. This marker is honest read-side evidence of the boot-time DMA-backend selection; it asserts no device bind and is independent of the bound-through-authorityprovider-nic-boundmarker, which remains thecloud-gcp-virtio-net-nic-driverclaim. -
Production cloud-boot evidence marker (
device-class): the production boot path also emits the companioncloudboot-evidence: device-class <token>serial marker (one per distinct enumerated PCI base class, harness-deduped viasort -u; “Serial evidence-marker contract” intools/cloudboot/README.md).- Spec basis: PCI base-class codes from the PCI Code and ID Assignment
Specification (PCI-SIG); the base class is the high byte of the
class-code/revision dword at config-space offset
0x08(kernel/src/pci.rsPCI_CLASS_REVISION). - Implemented wire-format subset: a genuinely read-only config-space scan
over the production source resolved from the boot-time MCFG probe: ECAM when
MCFG validates, otherwise legacy I/O.
report_cloudboot_device_class_evidence(kernel/src/pci.rs) walks each bus/device/function viafor_each_enumerated_functionand the read-onlyfunctions_to_scanhelper, reading only the vendor-id, header-type, and class-code (PCI_CLASS_REVISION) words. The base class is the high byte ofPCI_CLASS_REVISION. It deliberately does not callread_device/read_bars, which would perform transient BAR-sizing config writes. Each distinct base class is emitted once in ascending order, formatted{:#04x}(e.g.0x02). The marker is emitted fromkernel/src/main.rsrun_init, so it fires on every build configuration (not only theqemu/diagnosticsPCI-diagnostics path), including the non-qemuproduction cloud image. - capOS mapping: enumeration evidence only – it allocates no
DeviceMmio/Interrupt/DMAPool, claims no device ownership, performs no bus-master enable, BAR mapping, BAR-sizing write, or DMA, and never emitsprovider-nic-bound. - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by a
QEMU boot of
target/disk.raw(themake capos-cloudboot-imageproduction image; README “Local boot test”), which shows base classes0x01(storage),0x02(network),0x03(display), and0x06(bridge). No GCE resources are created and nomake cloudboot-testrun is required.
- Spec basis: PCI base-class codes from the PCI Code and ID Assignment
Specification (PCI-SIG); the base class is the high byte of the
class-code/revision dword at config-space offset
-
Production cloud-boot evidence marker (
device-inventory): the production boot path also emits a per-function PCI claim-identity inventory so later bind children discover the real device identity instead of assuming the QEMU-fixed BDF layout the--features qemupath hard-codes. It emits a human-readablepci-inventory:detail line per enumerated function plus the parseablecloudboot-evidence: device-inventory <token>marker (one per function, harness-deduped viasort -u; “Serial evidence-marker contract” intools/cloudboot/README.md).- Spec basis: the PCI Local Bus Specification (PCI-SIG) Type 0 configuration
header — vendor/device ids at offsets
0x00/0x02, the class-code triple (base class / subclass / prog-if) in the high three bytes of the class-code/revision dword at offset0x08, header type at offset0x0e, and interrupt line / pin at offset0x3c(§6.1 “Configuration Space Organization”). BAR registers are not part of this production marker. - Implemented wire-format subset:
report_cloudboot_device_inventory_evidence(kernel/src/pci.rs) walks each bus/device/function viafor_each_enumerated_functionand the read-onlyfunctions_to_scanhelper. For each present function,read_cloudboot_inventory_recordreads only vendor/device, class/subclass/prog-if, revision, header type, interrupt pin, and interrupt line.report_cloudboot_inventory_recordformats one identity token:<seg>.<bus>.<dev>.<fn>-<vendor>.<device>-<class>.<subclass>.<progif>-rev.<rev>-hdr.<hdr>-irq.<pin>.<line>. It is emitted fromkernel/src/main.rsrun_initright after thedevice-classmarkers, on every build configuration including the non-qemuproduction cloud image. - capOS mapping: read-only enumeration evidence. The production marker
performs no BAR-size probe, config-space write, BAR mapping, bus-master /
memory-space / IO-space command-bit enable, doorbell write, DMA, or device
ownership claim, and never emits
provider-nic-bound. The later cloud-NIC bind children consume this inventory to resolve the real PCI function identity instead of the QEMU-fixed BDF fixtures; BAR/MMIO authority is proven by separateDeviceMmioevidence paths. - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by a
QEMU boot of
target/disk.raw(themake capos-cloudboot-imageproduction image; README “Local boot test”), which shows the per-functionpci-inventory:lines anddevice-inventorymarkers for the emulated functions (virtio-net1af4, storage, display, bridge). No GCE resources are created and nomake cloudboot-testrun is required.
- Spec basis: the PCI Local Bus Specification (PCI-SIG) Type 0 configuration
header — vendor/device ids at offsets
-
Production-build
device_manager/DeviceMmiocompile surface:kernel/src/device_manager/mod.rsis now always compiled, but it is a thin orchestrator that re-exports a shared subset (error.rs,handles.rs,mmio.rs,types.rs—DeviceManagerError,DeviceMmioHandle/DeviceOwner/PciBdf/DeviceMmioRegion, the MMIO record / map / unmap / read32 / write32 admission types,DeviceMmioCapReleaseOutcome,ProviderNotifyDoorbellWrite) plus a feature-gated implementation: undercfg(feature = "qemu")it routes throughqemu_full.rs(the full DDF surface —dma_buffer.rs/dma_pool.rs/interrupt.rs/proofs.rs, NVMe brokered controller registers, IOMMU domain ledgers, virtio TX/RX ring publication); undercfg(not(feature = "qemu"))it routes throughstub.rs, which now carries a bounded one-slot parked-region path used by the production bar-readback proof (stage_bar_readback_region,validate_devicemmio_record,read_devicemmio_u32,detach_devicemmio_record_for_cap_release,trigger_*_for_devicemmio). The DMA/write/notify/map shims still reportDeviceMmioStaleHandlebecause no production caller exists yet for those; the descendant slices indocs/backlog/hardware-boot-storage.mdun-gate them through the reviewed grant path.kernel/src/cap/device_mmio.rsand itssuper::hardware_audit/super::hardware_release_logaudit hooks are likewise always compiled. TheKernelCapSource::DeviceMmiouser-facing grant arm inkernel/src/cap/mod.rsstayscfg(feature = "qemu")-gated; the production bar-readback proof builds itsDeviceMmioCapfrom boot (cap::devicemmio_bar_readback, see below) without going through that user-facing grant arm. Thecrate::iommumodule and the realkernel/src/virtio.rsstaycfg(feature = "qemu")-gated. Thecrate::device_dmamodule compiles in both builds for the dmapool-grant proof, and thecrate::device_interruptmodule compiles in both builds for the interrupt route/source allocation proof below; theirKernelCapSource::Interruptuser-facing grant arm andinterrupt_grant_sourcebootstrap-grant module inkernel/src/cap/mod.rsstaycfg(feature = "qemu")-gated. -
Production cloud-boot evidence marker (
device-mmio-bar-read): the production boot path also exercises one PCI function’s first memory BAR through the reviewedDeviceMmioCapread32surface and emits a parseablecloudboot-evidence: device-mmio-bar-read <token>marker.- Spec basis: PCI Local Bus Specification (PCI-SIG) Type 0 memory BAR
semantics. The marker carries the function’s BDF, the BAR index, the
32-bit value read at offset 0, and the kernel-mapped window length. The
kernel-side cache policy is device-uncacheable (UC) + NX + GLOBAL +
WRITABLE, matching the existing
mem::paging::map_kernel_mmio_rangecontract for MMIO windows. - Implemented wire-format subset:
cap::devicemmio_bar_readback::report(kernel/src/cap/devicemmio_bar_readback.rs) enumerates PCI functions viapci::enumerate(), picks the first with a memory BAR of at least 4 KiB at a non-zero base, maps the first 4 KiB of that BAR throughmem::paging::map_kernel_mmio_range, stages a parked region throughdevice_manager::stage_bar_readback_region(one slot, mapping generation monotonic), constructs aDeviceMmioCapover the resultingDeviceMmioHandle, and callscap.read32(0). The read goes through the samevalidate_devicemmio_record→ range/alignment check →read_volatilepath as the qemu DDF surface; on the production path the parked region’s recorded kernel virtual address backs the read. The marker token shape is<seg>.<bus>.<dev>.<fn>-b<bar>-<value>-len.<len>(value in 32-bit hex with0xprefix, length in hex bytes), inside the harness grammar[A-Za-z0-9._-]+. - Fail-closed assertions: the proof immediately retries
read32at exactlylengthand assertsrange_result != "ok"(out-of-range read is rejected with no MMIO side effect), then detaches the parked record throughdetach_devicemmio_record_for_cap_releaseand asserts the nextread32(0)fails closed at the device manager (DeviceMmioStaleHandle). Both outcomes are logged on adevicemmio-bar-readback: range_bounding .../stale_generation ...line so a regression trips the boot log alongside the missing marker. - capOS mapping: the mapping is boot-only kernel-half (no userspace VMA
is exposed by this proof); revocation drops the parked slot, which
invalidates the cap-side identity without removing the kernel mapping
itself (the boot-only mapping stays installed for the rest of the boot).
The descendant userspace-driver slices in
docs/backlog/hardware-boot-storage.md#cloud-device-tracksadd the userspace VMA path with TLB shootdown on revoke. - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by
a QEMU boot of
target/disk.raw(themake capos-cloudboot-imageproduction image; README “Local boot test”), which shows the marker for the emulated virtio function. No GCE resources are created and nomake cloudboot-testrun is required. The qemu build keeps the existingmake run-devicemmio-grantsmoke as the end-to-end DDF proof; the bar-readback caller incap::devicemmio_bar_readbackis gated to the production (non-qemu) build so it does not collide with the qemu DDF surface’s ownDeviceMmioclaim path.
- Spec basis: PCI Local Bus Specification (PCI-SIG) Type 0 memory BAR
semantics. The marker carries the function’s BDF, the BAR index, the
32-bit value read at offset 0, and the kernel-mapped window length. The
kernel-side cache policy is device-uncacheable (UC) + NX + GLOBAL +
WRITABLE, matching the existing
-
Production cloud-boot evidence marker (
interrupt-route-allocated): the production boot path also exercises one PCI function’s MSI-X capability through the revieweddevice_interruptvector pool and emits a parseablecloudboot-evidence: interrupt-route-allocated <token>marker.- Spec basis: PCI Local Bus Specification (PCI-SIG) 3.0 §6.8.2 / PCI Express Base Specification 4.0 §7.7.2.2 MSI-X capability structure. The capability header dword exposes Control (function-mask, table-size-1, enable), the Table BIR/Offset dword exposes the BAR index in the low 3 bits and the byte offset in the upper bits (each table entry is 16 bytes), and the PBA BIR/Offset dword exposes the Pending Bit Array location. The marker carries the function’s BDF, the selected MSI-X table entry, the kernel-pool MSI vector, and the route/source generation pair allocated for the entry. No live MSI-X table write or device interrupt is performed on this path.
- Implemented wire-format subset:
cap::interrupt_route_alloc::report(kernel/src/cap/interrupt_route_alloc.rs) enumerates PCI functions viapci::enumerate(), walks each function’s capability list throughpci::capabilities, parses MSI-X capability fields throughpci::interrupt_capabilities/parse_msix_capability(offset,control,table_size,table_bir,table_offset,pba_bir,pba_offset, both validated through the existing MSI-X region BAR checks), picks the first MSI-X capability withtable_size >= 1, and allocates a kernel-owned MSI vector + interrupt source/route record over its first table entry (SELECTED_TABLE_ENTRY = 0) through the productiondevice_interrupt::register_pci_msix_route_by_bdfvector pool (kernel/src/device_interrupt.rs,lapic::DEVICE_MSI_VECTOR_BASE = 0x50, 16 device-MSI vectors). It thendevice_interrupt::claim_routes the route underDeviceInterruptDriver::ManagerGrantSource. The marker token shape is<seg>.<bus>.<dev>.<fn>-entry.<n>-vector.<hex>-src.<id>.gen.<g>-route.gen.<g>(vector in 2-digit hex, source-id and generations decimal), inside the harness grammar[A-Za-z0-9._-]+. - Fail-closed assertions: the proof asserts three invariants inline
before emitting the marker. (1) Claimed-state visibility:
validate_claimed_routesucceeds for the correctManagerGrantSourceowner and fails closed withWrongOwnerfor a distinctKernelIoApicProofowner – the route is owner-scoped. (2) Duplicate-source rejection: a secondregister_pci_msix_route_by_bdfagainst the same(bdf, table_entry)while the original route is live is rejected withDuplicateSource– the source identity is unique. (3) Stale-after-release:release_claimed_routeclears the slot and a subsequentvalidate_claimed_routeon the same handle fails closed withStaleRoute– no stale handle can re-enter the route table. Each outcome is logged on aninterrupt-route-alloc: claimed_state .../duplicate_source .../stale_after_release ...line so a regression trips the boot log alongside the missing marker. - capOS mapping: route/source-allocation evidence only. The proof
parses the MSI-X capability and consumes one slot from the kernel-owned
device-MSI vector pool, then returns it on release; it does NOT map
the MSI-X table or PBA BAR window, write a table entry, program a
LAPIC dispatch slot for live delivery, raise/handle a device
interrupt, install a waiter, acknowledge an EOI, or exercise
mask/unmask/reset on the live vector. No
provider-nic-boundorstorage-boundmarker. The follow-on live-delivery proof (interrupt-route-delivered) extends this surface; see the next section. Thecap::interrupt_route_alloccaller is gated to the production (non-qemu) build so it does not collide with the qemu DDF surface’s ownInterruptclaim path. - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by
a QEMU boot of
target/disk.raw(themake capos-cloudboot-imageproduction image; README “Local boot test”), which shows the marker for the emulated virtio function (QEMU’s modern virtio-pci front-end exposes a per-function MSI-X capability). No GCE resources are created and nomake cloudboot-testrun is required. The qemu build keeps the existingmake run-interrupt-grantandmake run-netsmokes as the end-to-end DDF and virtio-net MSI-X proofs.
-
Production cloud-boot evidence marker (
interrupt-route-delivered): the production boot path then extends the route-allocation proof to live MSI-X delivery: it programs the table entry, attaches the route to device manager, arms the deferred-LAPIC-EOI gate, injects one grant- source dispatch, retires the deferred EOI, masks and re-injects to prove no stale wake, reassigns to bump the route generation, asserts the stale handle + stale pending token both fail closed, then releases. Emits onecloudboot-evidence: interrupt-route-delivered <token>marker.- Spec basis: PCI Local Bus Specification 3.0 §6.8.2 / PCI Express
Base Specification 4.0 §7.7.2.2 MSI-X table entry layout (16-byte
entries: 64-bit message address, 32-bit message data, 32-bit vector
control with bit 0 = entry mask) plus the per-spec mask-first write
ordering (the entry mask must be asserted before message address/data
are torn). Intel SDM Vol. 3A §10.8 LAPIC EOI semantics for the
deferred-EOI write retired by
acknowledge_deferred_lapic_eoi_for_routeagainst the LAPIC EOI register (arch::x86_64::lapic::eoi). - Implemented wire-format subset:
cap::interrupt_delivery_proof::report(kernel/src/cap/interrupt_delivery_proof.rs) reusespci::map_msix_tableto map the MSI-X table BAR window kernel-side (UC + NX + GLOBAL + WRITABLE throughmem::paging::map_kernel_mmio_range) andpci::write_msix_table_entryto program entry 0 with the route’smessage_address(fromarch::x86_64::lapic::current_device_msi_delivery) andmessage_data(the allocated kernel-pool vector) under per-spec mask-first ordering. It then attaches the route throughdevice_interrupt::attach_claimed_route_to_device_manager, enables the deferred-LAPIC-EOI gate viadevice_interrupt::enable_deferred_lapic_eoi_for_route, unmasks the route throughdevice_interrupt::unmask_device_manager_attached_routeand the table entry throughpci::set_msix_table_entry_mask, drives one injected dispatch throughdevice_interrupt::handle_lapic_delivery(the same dispatch slot the qemumake run-interrupt-grantproof andnvme-admin-interrupt-deliveryexercise), retires the deferred EOI viadevice_interrupt::acknowledge_deferred_lapic_eoi_for_route, masks both surfaces and re-injects throughdevice_interrupt::record_lapic_delivery, reassigns viadevice_interrupt::reassign_claimed_routeto bump the route generation, and asserts stale-handle / stale-pending-token rejection throughdevice_interrupt::validate_claimed_route/device_interrupt::check_pending_lapic_token. - Fail-closed assertions: five inline assertions gate the marker.
(1) Live delivery:
handle_lapic_deliveryreturns aDelivered { .. }outcome bound to the live route’s(source_id, source_generation, route_generation, owner),delivery_countadvances by 1,eoi_deferred=true, andpending_deferred_eoi_count >= 1. (2) Ordered acknowledge:acknowledge_deferred_lapic_eoi_for_routereportseoi_written=true,ack_delta=1, andpending_after=0– each pending unit retires exactly one LAPIC EOI through the counter- based exclusiondevice_interrupt.rsdocuments atacknowledge_deferred_lapic_eoi_for_route/close_deferred_eoi_gate_and_drain. (3) Masked no-wake: after mask,record_lapic_deliveryreturnsMasked { state: ClaimedMasked, .. }anddelivery_countdoes not advance. (4) Reassign generation bump + stale handle: the prior handle’svalidate_claimed_routereturnsStaleRoute; the stale pending token’scheck_pending_lapic_tokenreportswake_blocked=truewith eitherUnregistered(the live evidence case: reassign’sfirst_available_vectorruns beforeclear_dispatch_slotretires the old vector, so the next pool slot is chosen and the stale token names an unregistered vector) orSourceRouteGenerationMismatch(the single-slot-pool degenerate case where reassign reused the same vector); and a fresh injected dispatch under the reassigned route + vector lands on the new generation while leaving the stale token blocked. (5) Release:release_claimed_routeclears the slot andvalidate_claimed_routeon the reassigned handle now fails closed withStaleRoute. Each outcome is logged oninterrupt-delivery: live_delivery .../ordered_acknowledge .../masked_no_wake .../reassign_stale .../release ...lines so a regression trips the boot log alongside the missing marker. - capOS mapping: route/source allocation + live delivery + ordered
acknowledge + mask/unmask + reset/reassignment + stale-route-generation
rejection, all on the production cloud kernel. The MSI-X table entry
is programmed but the PCI function-level
MSIX_CONTROL_ENABLEbit is intentionally NOT toggled (the proof never enables MSI-X on the function, so no real device-autonomous interrupt can fire on the programmed entry); the proof exits with the table entry re-masked. There is no userspaceInterruptwaiter on the production cloud kernel yet, so the proof’s “waiter wake” boundary is the kernel-side dispatch slot a real provider waiter would consume — the marker reportswaiter_wake=kernel-side-proxyrather than overclaiming a provider-cap-side wake. Noprovider-nic-boundorstorage-boundmarker. Thecap::interrupt_delivery_proofcaller is gated to the production (non-qemu) build so it does not collide with the qemu DDF surface’s ownInterruptclaim path; the qemu build keepsmake run-interrupt-grantas the broader end-to-end exercise of the interrupt grant surface with the full DDF backend. - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved
locally by a QEMU boot of
target/disk.raw(themake capos-cloudboot-imageproduction image; README “Local boot test”), which shows the marker for the emulated virtio function. No GCE resources are created and nomake cloudboot-testrun is required. PBA handling is recorded by including thepba_birandpba_offsetfromMsixCapabilityInfoin the proof’sokline; the kernel does not read or clear PBA bits (devices set them, and this proof never enables the function so no PBA bit can be set in practice).
- Spec basis: PCI Local Bus Specification 3.0 §6.8.2 / PCI Express
Base Specification 4.0 §7.7.2.2 MSI-X table entry layout (16-byte
entries: 64-bit message address, 32-bit message data, 32-bit vector
control with bit 0 = entry mask) plus the per-spec mask-first write
ordering (the entry mask must be asserted before message address/data
are torn). Intel SDM Vol. 3A §10.8 LAPIC EOI semantics for the
deferred-EOI write retired by
-
Production cloud-boot evidence marker (
provider-nic-bound): the gate the billable GCE run consumes throughtools/cloudboot/NIC_PROOF_MARKER/--require-provider-nic-proof. It is sourced from real userspace driver progress: the marker fires only after the always-built polled virtio-net providercap::virtio_net_polled_providerhas completed a TX+RX over the live function and observed the RX completion by polling the latched used ring (zero kernel-injected interrupts). The predecessor staged its ownDeviceMmioDMAPool/DMABuffer+ MSI-XInterruptgrant surfaces at boot and proved the “queue-completion handoff” by callingdevice_interrupt::handle_lapic_delivery— a kernel-side dispatch-slot proxy (theinject_real_lapic_int_for_proofprecedent). That proxy is removed as the source of the gate:cap::provider_nic_bind_proof::reportnow runs once at boot and emits no marker (it records the deferral to the real provider’s completion); the marker is emitted later fromcap::provider_nic_bind_proof::report_real_completion, called from the provider’s release-time completion path.
- Spec basis: virtio 1.2 §2.7 split-ring used-ring semantics (the device
writes a used element; the driver observes
used.idxadvance) — the completion the provider polls; virtio 1.2 §5.1.6 virtio-net receiveq frame layout (12-byte modern header + ethernet frame) for the EtherType read-back; inherited MSI-X table layout / mask-first ordering (PCI 3.0 §6.8.2) only for the release-time route assertion chain, which never delivers an interrupt on the completion path. - Implemented wire-format subset:
cap::virtio_net_polled_provider(staged when the booted manifest declares thecloud-provider-nic-bound-real-polled-driver-smokebinary) drives the modern virtio status sequence toDRIVER_OK, materializes the RX virtqueue (queue 0) + TX stimulus virtqueue (queue 1), holds the PCI function-level MSI-X enable mask-first, maps the notify region, and programs the RX MSI-X route over table entry 0 (used only by the release-time assertion chain). Itsattempt_rx_submit(admitted from the userspaceDMABuffer.submitDescriptor(queue=0)) publishes the RX descriptor (VIRTQ_DESC_F_WRITE), drives the ARP TX stimulus, polls the latched used ring for the one real device->host RX DMA, and resets the device; itsinvoke_waitreads the latchedPublishedRxwithdelivery_countunchanged.report_real_completionthen sources theprovider-nic-boundtoken from thatPublishedRx(used.idx,used[0].id,used[0].len, observed EtherType) plus the picked function identity. - Fail-closed assertions:
report_real_completionre-asserts the real RX completion facts independently of the provider’s own gate before the marker is emitted. (1) Real device->host RX DMA: the latched used ring advanced exactly once (used.idx == 1), the completion is the posted descriptor (used[0].id == 0), the device wrote a non-empty frame (used[0].len > 0), and the provider read back a non-zero EtherType. (2) Polled, not injected: the provider’sInterrupt.waitadvanced no kernel dispatch (provider_observed_dispatch == 0) and retired no deferred LAPIC EOI (provider_observed_ack == 0). On any regression aprovider-nic-bind: real-completion regression (no marker): ...line trips the boot log and no marker is emitted, soprovider.json’sprovider_nic_proofstaysnulland--require-provider-nic-prooffails closed. - capOS mapping: the marker is now backed by real userspace driver progress
on the production cloud kernel. It carries the real-provider labels
waiter_wake=polled-used-ring,rx_completion=polled-used-ring,int_injected=0,userspace_driver_authority=present-real-polled-provider,virtio_common_config_write=performed,provider_tx_rx=completed,device_autonomous_raise=not-claimed,host_physical_user_visible=0,direct_dma=blocked,iova_export=disabled-future-only, andlive_cloud=not-attempted— never the predecessor’swaiter_wake=kernel-side-proxy/userspace_driver_authority=absent-on-non-qemu. The RXqueue_msix_vectorstaysVIRTIO_MSI_NO_VECTORand the PCI function mask stays held, so the device cannot autonomously raise the MSI either; the completion stays polled. The literalsystem.cuefold (so a plain default cloudboot also emitsprovider-nic-boundfrom real progress without a focused manifest) is not yet implemented, to avoid perturbing themake runinteractive shell/login boot; device-autonomous MSI-X is the parallel future workcloud-prod-virtio-net-rx-device-autonomous-msix-raise-local-proof. - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by
make run-cloud-provider-nic-bound-real-polled-driveron the default non-qemukernel with nocloud_*_prooffeature, on themake run-netdevice shape. No GCE resources are created;live_cloud=not-attempted.
-
Production cloud-boot evidence marker (
virtio-net-device-bringup): the production boot path, under the focused-proof Cargo featurecloud_virtio_net_device_bringup_proof, drives a bounded virtio-net device bringup sequence kernel-side over the same virtio function theprovider-nic-boundproof maps – but writes the virtio common-configuration status register (whichprovider-nic-boundnever does). It is the first device-activation step toward the still-blockedcloud-gcp-virtio-net-nic-drivertrack. Emits onecloudboot-evidence: virtio-net-device-bringup <token>marker on thetools/cloudboot/harness’s serial-port-1 path throughmake run-cloud-provider-virtio-net-bringup.- Spec basis: virtio 1.2 §3.1.1 device initialization (reset, ACKNOWLEDGE, DRIVER, feature discovery + driver-feature select, FEATURES_OK re-read, DRIVER_OK), §4.1 (modern virtio over PCI: common / notify / ISR / device / PCI-cfg capabilities, common-config register layout from Table 4.1).
- Implemented wire-format subset:
cap::virtio_net_device_bringup_proof::report(kernel/src/cap/virtio_net_device_bringup_proof.rs) picks the virtio-net PCI function (vendorVIRTIO_VENDOR_ID = 0x1af4, deviceVIRTIO_NET_TRANSITIONAL_DEVICE_ID = 0x1000/VIRTIO_NET_MODERN_DEVICE_ID = 0x1041) frompci::enumerate, walks the modern virtio PCI vendor-capability chain throughvirtio_transport::parse_modern_pci_transport_capabilities, maps the resolved common-configuration region throughpci::map_bar_region(UC + NX + GLOBAL + WRITABLE – same flags as the BAR-readback path), and drives the bringup using the sharedMmioRegionaccessors plusvirtio_transport::{read_device_features, write_driver_features, STATUS_ACKNOWLEDGE, STATUS_DRIVER, STATUS_FEATURES_OK, STATUS_DRIVER_OK, STATUS_FAILED, VIRTIO_F_VERSION_1, COMMON_NUM_QUEUES}. The selected driver feature word is exactlyVIRTIO_F_VERSION_1; no other device- or net-specific bit is accepted, so the proof never crosses into the queue-setup or descriptor surface the userspace virtio-net provider will own. - Fail-closed assertions: four inline assertions gate the marker.
(1) Negotiated feature set: the device’s offered 64-bit feature word
advertises
VIRTIO_F_VERSION_1, the written driver feature word equals exactlydevice_features & VIRTIO_F_VERSION_1. (2) Queue count visibility: the liveCOMMON_NUM_QUEUESread returns>= 2(virtio-net always exposes RX + TX virtqueues, which this proof does not publish). (3) DRIVER_OK observation: the post-DRIVER_OKstatus read carriesSTATUS_ACKNOWLEDGE | STATUS_DRIVER | STATUS_FEATURES_OK | STATUS_DRIVER_OKset withSTATUS_FAILEDclear. (4) Final reset: a write of0todevice_statusreads back as0. The proof wraps the status sequence so every exit path (success or any intermediate failure) writes0todevice_statusbefore returning, leaving the device in its post-reset state regardless of outcome. Per-stage outcomes log onvirtio-net-device-bringup: ok .../virtio-net-device-bringup: ... failed closed: ...lines so a regression trips the boot log alongside the missing marker. - capOS mapping: focused-proof child of
provider-nic-boundthat extends the proven bind composition with virtio’s status sequence, kernel-side, over the same mapped BAR. The PCI function-levelMSIX_CONTROL_ENABLEbit stays untoggled, no queue is published, no descriptor is written, no doorbell is rung, and no userspace virtio-net provider cap is issued. The marker’s trailing labels (queue_setup=not-attempted,tx_descriptor=not-published,userspace_cap=not-issued,msix_function_enable=not-toggled,device_autonomous_raise=not-attempted,live_cloud=not-attempted) re-anchor those bounds. Queue setup, descriptor publication, doorbell writes, and a userspace virtio-net provider on the production cloud boot manifest stay deferred to the still-blockedcloud-gcp-virtio-net-nic-drivertrack. Thecap::virtio_net_device_bringup_proofcaller is gated tocfg(all(not(feature = "qemu"), not(feature = "cloud_provider_cap_waiter_proof"), feature = "cloud_virtio_net_device_bringup_proof")); the qemu build keepsmake run-net/make run-ddf-provider-consumeras the end-to-end exercise of the same surface with the full driver. - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved
locally by
make run-cloud-provider-virtio-net-bringup, which boots the focused-proof cloudboot kernel + manifest under QEMU and asserts the marker on serial. No GCE resources are created.
-
Production cloud-boot evidence marker (
nic-driver-userspace-features-ok): the userspace virtio handshake step of the Phase C NIC-driver relocation track. Under the focused-proof Cargo featurecloud_virtio_net_userspace_features_ok_proof, thecap::devicemmio_grant_source_prodsource stages the picked virtio-net function’s modern virtio common-configuration window (resolved throughvirtio_transport::parse_modern_pci_transport_capabilities, mapped at the region’s first byte) as a writable selected-writeDeviceMmiogrant (stage_virtio_net_common_config). The userspacecloud-prod-nic-driver-userspace-features-ok-smokeservice then drives the virtio device handshake from userspace – the authority delta from the kernel-sidevirtio-net-device-bringupproof, which drives the same sequence in the kernel.- Authority delta: the handshake registers move from kernel-internal MMIO
to a userspace driver over the existing
DeviceMmio.read32/write32path. The write admission (device_manager::stub::write_devicemmio_u32under the feature) admits exactly four common-config registers –device_feature_select(0x00),driver_feature_select(0x08),driver_feature(0x0C), anddevice_status(0x14, written as a single byte) – each range-checked against the decoded BAR and kernel read-back-asserted (feature-register read-backs must echo the written value;device_statusis left to the driver’s own re-read since the device may legitimately diverge). This is the same selected-write + range-check + read-back discipline the notify doorbell (notifyDoorbell @5) and the NVMeCCreset write (cloud_nvme_controller_reset_proof) already enforce – not a new write primitive. - Fail-closed assertions: the shim drives reset -> ACKNOWLEDGE -> DRIVER ->
read device features -> write the negotiated driver features
(VIRTIO_F_VERSION_1 only) -> FEATURES_OK, re-reading
device_statusto confirm FEATURES_OK stuck, then proves aqueue_desc(0x20) write fails closed (result=write-blocked register_write=blocked). The released cap fails closed on the next call. The headlinecloudboot-evidence: nic-driver-userspace-features-ok <token>marker lands only after every assertion passes. - capOS mapping: the handshake step of the
Phase C userspace NIC driver relocation.
Queue/vring and IRQ ownership stay kernel-owned: queue-address registers
fail closed, so no buffer address is ever programmed (the userspace-ownable
vring over the DMA-isolation track is the next capability below). The marker’s
trailing labels (
handshake=features-ok,queue_setup=not-attempted,queue_address_write=blocked,vring=not-owned,irq=not-owned,driver_ok=not-attempted,live_cloud=not-attempted) re-anchor those bounds. The feature is mutually exclusive withqemu,cloud_provider_cap_waiter_proof, andcloud_virtio_net_device_bringup_proof. - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by
make run-cloud-prod-nic-driver-userspace-features-ok. No GCE resources are created.
- Authority delta: the handshake registers move from kernel-internal MMIO
to a userspace driver over the existing
-
Production cloud-boot evidence marker (
nic-driver-userspace-ownable-vring): the userspace-owned vring step of the Phase C NIC-driver relocation track. Under the focused-proof Cargo featurecloud_virtio_net_userspace_ownable_vring_proof(which implies the handshake featurecloud_virtio_net_userspace_features_ok_proof), thecap::devicemmio_grant_source_prodsource stages the writable common-config window andcap::dmapool_grant_source_prodstages a bounce-bufferDMAPoolon the same virtio-net function. The userspacecloud-prod-nic-driver-userspace-ownable-vring-smokeservice drives the handshake to FEATURES_OK, then allocates and owns its own virtqueue rings.- Authority delta: the queue-address-class registers move from
kernel-internal MMIO (the
virtio-net-tx-queue-materializationproof programs them in the kernel) to a userspace driver over the sameDeviceMmio.write32path. The write admission (device_manager::stub::write_devicemmio_u32under the feature,admit_virtio_queue_address_write) admitsqueue_select(0x16) andqueue_size(0x18) as range-checked pass-through selected writes, and the 64-bitqueue_desc(0x20) /queue_driver(0x28) /queue_device(0x30) base registers via a token-resolve selected write: the driver writes the opaque per-buffer device-usable handle it learned fromDMABuffer.info(deviceIova, scopebounce-handle), and the kernel resolves it against the liveDMAPoolgrant ledger (resolve_virtio_vring_device_address) to the real bounce host-physical address, programs that address (never the handle, never an address the driver authored), and read-back-asserts. Reads of the queue-address base registers (0x20..0x38) are refused inread_devicemmio_u32, so the resolved host-physical address is never exposed to userspace (host_physical_user_visiblestays 0).queue_enablestays fail-closed (it is armed by the queue-enable/DRIVER_OK capability below). - Reuses landed DMA isolation: the ring pages are manager-owned
DMAPoolbounce buffers under the landed scrub-before-free / owner+slot generation / quiesce-before-release discipline (kernel/src/device_dma.rs); the no-host-physical-exposure posture (host_physical_user_visible=0,iova_export=disabled-future-only) is unchanged. This capability is wiring, not a new isolation backend. The opaque device-usable handle is a deterministic, non-address encoding of the buffer’s manager-owned identity under a fixed tag, so it can never collide with a page-aligned host-physical address and carries no host-physical information. - Fail-closed assertions: the shim allocates its descriptor / available /
used ring pages, programs each handle, then proves a queue-address read is
refused and that an out-of-grant handle, a raw host-physical-looking value
(
0x40000000), and a stale (freed-buffer) handle each fail closed (result=write-blocked register_write=blocked) before any MMIO write. The releasedDeviceMmiocap fails closed on the next call. The headlinecloudboot-evidence: nic-driver-userspace-ownable-vring <token>marker lands only after every assertion passes. - capOS mapping: the userspace-owned vring step of the
Phase C userspace NIC driver relocation.
The marker’s trailing labels (
vring=userspace-owned,queue_address_programming=token-resolved,host_physical_user_visible=0,provider_visible_queue_address=hidden,iova_export=disabled-future-only,out_of_grant=blocked,host_physical=blocked,stale_generation=blocked,queue_enable=not-attempted,driver_ok=not-attempted,irq=not-owned,live_cloud=not-attempted) re-anchor those bounds. - QEMU-emulable vs hardware-only: fully QEMU-emulable (the bounce backend is
the probe-selected default without a guest IOMMU). Proved locally by
make run-cloud-prod-nic-driver-userspace-ownable-vring. No GCE resources are created.
- Authority delta: the queue-address-class registers move from
kernel-internal MMIO (the
-
Production cloud-boot evidence marker (
nic-driver-userspace-queue-enable-driver-ok): the userspace queue-enable / DRIVER_OK step of the Phase C NIC-driver relocation track. Under the focused-proof Cargo featurecloud_virtio_net_userspace_queue_enable_driver_ok_proof(which implies the ownable-vring featurecloud_virtio_net_userspace_ownable_vring_proof), the userspacecloud-prod-nic-driver-userspace-queue-enable-driver-ok-smokeservice drives the handshake to FEATURES_OK and programs its owned vring exactly as the ownable-vring capability does, then completes device bring-up from userspace: it arms its programmed TX queue and writesDRIVER_OK.- Authority delta: two more writes join the handshake/ownable-vring
selected-write admission, both under the same range-check + read-back
discipline. (1)
queue_enable(0x1c, u16): a range-checked pass-through selected write, admitted bydevice_manager::stub::write_devicemmio_u32(admit_virtio_queue_address_write) only when the active queue’s vring memory is live and page-fitting (selected_queue_ready_to_enable): the kernel reads the activequeue_desc/queue_driver/queue_deviceback kernel-side and requires each to currently hold the host-physical address of a live grantedDMABufferon this device (a freed buffer’s stale address no longer matches a live buffer, so it cannot arm a use-after-free DMA target), and requires the activequeue_sizeto fit every split-ring structure (16*sizedesc table,6+8*sizeused ring,6+2*sizeavail ring) inside one granted bounce page. An enable of an unprogrammed, freed, or oversized queue fails closed before any MMIO side effect; the enable is read-back-asserted. Once a queue is enabled its vring base registers are immutable – a queue-address repoint (even with an otherwise-valid live token) is refused (devicemmio-queue-address-immutable-after-enable) so the driver cannot mutate the vring under a running device. (2)DRIVER_OK(a bit in device-status 0x14): the device-status register is already writable (from the handshake capability), but settingDRIVER_OKis kernel-asserted – the kernel re-reads device-status and fails closed (devicemmio-driver-ok-not-observed) unless the device latched the fullACKNOWLEDGE | DRIVER | FEATURES_OK | DRIVER_OKbyte exactly (rejectingFAILED0x80,DEVICE_NEEDS_RESET0x40, and any reserved bit), so a userspace driver cannot claim a brought-up device the hardware did not accept. - Reuses landed DMA isolation: this capability adds no new register write
primitive, no new isolation backend, and no host-physical exposure. It reuses
the ownable-vring bounce /
DMAPool/DeviceMmiogrants and writable window unchanged; queue-address reads (0x20..0x38) stay refused (host_physical_user_visible=0). The enable binds to live, page-fitting, post-enable-immutable queue memory, so the device is never armed at a freed, oversized, or mutated vring. - Bounded residual (handled by the RX bring-up capability below): the
enable’s live + page-fit check
is point-in-time and matches by host-physical address, not buffer identity;
it does not pin the ring buffers against
freeBuffer/ process-teardown release while the queue is enabled. Both are use-after-free-DMA hazards only once a descriptor is posted and the doorbell rung – which this capability never does (frame_tx=not-attempted; the RX queue is never enabled; the TX queue is kick-driven), so no device DMA is reachable here and DMA stays confined to the granted bounce pool. Buffer-identity binding and pinning are the data path’s responsibility (vring_buffer_pinning=deferred-slice-4); tracked by the userspace RX/DMA task records. - Fail-closed assertions: the shim proves the ownable-vring out-of-grant /
host-physical / stale (freed-throwaway-buffer) queue-address writes fail
closed and an enable of the unprogrammed RX queue (index 0) fails closed,
then arms the programmed TX queue (
queue_enable=1,register_write=performed), setsDRIVER_OKand re-reads device-status to confirm the full brought-up byte, and proves a post-enable queue-address repoint (with an otherwise-valid live token) fails closed. The releasedDeviceMmiocap fails closed on the next call. The headlinecloudboot-evidence: nic-driver-userspace-queue-enable-driver-ok <token>marker lands only after every assertion passes. - capOS mapping: the userspace queue-enable / DRIVER_OK step of the
Phase C userspace NIC driver relocation.
The marker’s trailing labels (
vring=userspace-owned,queue_enable=performed,unprogrammed_queue_enable=blocked,device_brought_up=driver-ok,status_full=0f,driver_ok=observed,vring_live_bound=enforced,queue_size_fits_grant=enforced,post_enable_immutable=blocked,host_physical_user_visible=0,provider_visible_queue_address=hidden,frame_tx=not-attempted,nic_cap=not-implemented,irq=not-owned,live_cloud=not-attempted) re-anchor those bounds. TheNic-cap TX/RX round-trip (no frame crosses the wire here) and userspace IRQ ownership are later capabilities below. - QEMU-emulable vs hardware-only: fully QEMU-emulable (the bounce backend is
the probe-selected default without a guest IOMMU). Proved locally by
make run-cloud-prod-nic-driver-userspace-queue-enable-driver-ok. No GCE resources are created.
- Authority delta: two more writes join the handshake/ownable-vring
selected-write admission, both under the same range-check + read-back
discipline. (1)
-
Production cloud-boot evidence marker (
nic-driver-userspace-rx-bringup): the userspace RX-queue bring-up step of the Phase C NIC-driver relocation track. Undercloud_virtio_net_userspace_rx_bringup_proof(implies the queue-enable feature) thecloud-prod-nic-driver-userspace-rx-bringup-smokeservice brings up the RX virtqueue (index 0) over its own vring – the handshake/vring/enable capabilities above brought up only the TX queue (index 1); thequeue_enableadmission is queue-agnostic, so RX bring-up reuses it.- capOS mapping: the kernel
(
device_manager::stub) retains each programmed queue’s vring physes + originatingDMABufferhandle identity onProductionDeviceRecord(admit_virtio_queue_address_write), bindsqueue_enableto that identity (selected_queue_identity_bound; a freed/realloc’d handle fails closed withdevicemmio-queue-enable-identity-mismatch), and pins the ring buffers againstfreeBuffer/ process-teardown release while the queue is enabled (blocked_pinned_enabled_vring→dmabuffer-pinned-enabled-vring), released only on queue disable/reset with device quiesce. This closes the queue-enable capability’s pre-migration buffer-lifetime/identity residual at the bring-up boundary. Marker labels:rx_queue_brought_up=driver-ok,buffer_identity_bound=enforced,vring_buffer_pinning=enforced,pinning_free_while_enabled=blocked,int_injected=0,nic_cap=not-implemented,irq=not-owned,live_cloud=not-attempted. - First real RX DMA: the same feature also
drives the first real RX DMA from the shim-owned vring. The shim also
brings up TX queue 1 over its own vring, posts one device-writable RX
receive buffer on queue 0 (
DMABuffer.submitDescriptor), and rings the productionDeviceMmio.notifyDoorbell @5. capOS mapping: the kernel maps the notify region kernel-side and captures the per-queue notify slot offsets (cap::devicemmio_grant_source_prod), andprovider_notify_doorbell_write_for_cap(wasErr(stale_handle)) is now a live drive; the RX-DMA flow (cap::virtio_net_userspace_rx_dma_proof, byte-level vring helpers duplicated fromcap::virtio_net_polled_providerto protectrun-net) writes the RX descriptor + avail over the shim’s retained RX vring physes, rings the RX doorbell, submits a kernel-half ARP request over the shim’s retained TX vring physes, polls one real device->host completion (used_len > 0, observed EtherType0x0806), resets the device (quiescing both queues and releasing the ring-buffer pins viamark_retained_vring_queue_disabled), and latches the used-ring index. Completion stays kernel-latched used-ring polled (int_injected=0, noInterruptcap). Marker labels addtx_queue_brought_up=driver-ok,frame_rx=performed,rx_used_ring=kernel-latched. The kernel emits onevirtio-net-userspace-rx-dma: rx_dma=performed ... used_len=<n> ethertype=0x0806 device_reset=ok queues_cleared=ok int_injected=0evidence line. - Not yet implemented: the deterministic freed-then-reallocated-frame identity
negative (
identity_realloc_negative=deferred-needs-allocator-reuse-seam). Thecapos-libFrameBitmapis next-fit andfree_framedoes not rewindnext_hint, so the allocation after a free never returns the just-freed frame; a deterministic same-phys realloc (needed to reach the buffer-identity gate rather than the host-physical gate) requires an allocator reuse seam. Tracked bycloud-prod-nic-driver-userspace-rx-dma-identity-realloc-negative-local-proof. - Future work: the
Nic-cap round-trip (the next capability below, unblocked by this data path) and userspace IRQ ownership. - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by
make run-cloud-prod-nic-driver-userspace-rx-bringup. No GCE resources are created.
- capOS mapping: the kernel
(
-
Production cloud-boot evidence marker (
nic-driver-userspace-nic-cap-roundtrip): theNic-cap round-trip step of the Phase C NIC-driver relocation track. It implements the handshake-stepNicinterface stub as a liveCapObject. Under thecloud_virtio_net_userspace_nic_cap_roundtrip_prooffeature (implies the RX-bring-up feature) thecloud-prod-nic-driver-userspace-nic-cap-roundtrip-smokeservice brings the device fully up from userspace (RX queue 0 + TX queue 1 enabled, DRIVER_OK), then holds a typedNiccap and round-trips two sequential frames. capOS mapping:- The new
nicKernelCapSource(registered incapos-configmanifest.rslib.rs::NIC_INTERFACE_ID+ thenic @49schema/capos.capnpKernelCapSourceenum value; clientNicClientincapos-rt) is granted fromcap::nic_grant_source_prod, which maps the picked virtio-net function’s device-config window kernel-side formacAddress/linkStatusand binds theNiccap to that BDF.
transmit/receivedrive the shim’s retained vring physes throughcap::virtio_net_userspace_rx_dma_proof::{nic_transmit, nic_receive}(reusing the RX-bring-up byte-level vring helpers) with manager-owned kernel bounce payloads – not a shim-submittedDMABuffer– so a frame crosses the cap boundary as inlineDatawithhost_physical_user_visible = 0and no device-usable handle exported.receivedrives the coupled ARP-TX-stimulus + RX-poll and returns the frame inline + observed EtherType;transmitstages a frame into a manager-owned TX page and ringsnotify_doorbell @5.- The device is left live for the cap’s lifetime (a monotonic per-queue avail
cursor lets
transmitandreceivecompose without re-enabling) and quiesced once on cap release (nic_quiesce: device reset + queues-cleared assertion +mark_retained_vring_queue_disabledto release the enabled-vring pins). Completion stays kernel-latched used-ring polled (int_injected = 0, noInterruptcap); no new selected-write register beyond the landed handshake / ownable-vring / queue-enable set. The kernel emits twovirtio-net-userspace-nic-cap: receive ... used_len=<n> ethertype=0x0806 int_injected=0 host_physical_user_visible=0evidence lines and avirtio-net-userspace-nic-cap: quiesce ... device_reset=ok queues_cleared=okline on release. The proof also covers lifecycle ordering: aDMAPoolcap release while ring buffers are still live recordspending-buffer-release, an early release of one pinned ringDMABufferrecordsdmabuffer-pinned-enabled-vring,Nicquiesce replays that buffer detach after the queues are reset, and the pending parent pool release completes only after the remaining ring buffers are freed. - Future work: the clean independent TX/RX split and userspace IRQ ownership (both later capabilities below).
- QEMU-emulable vs hardware-only: fully QEMU-emulable (the RX reply is QEMU
SLIRP’s ARP answer to the kernel-half stimulus). Proved locally by
make run-cloud-prod-nic-driver-userspace-nic-cap-roundtrip. No GCE resources are created.
- The new
-
Production cloud-boot evidence marker (
nic-driver-userspace-irq-ownership): the userspace RX-interrupt-lifecycle ownership step of the Phase C NIC-driver relocation track. It gives the userspace NIC driver real RX-interrupt-lifecycle ownership. TheNic-cap round-trip capability above hasint_injected = 0and noInterruptcap on the data path; this capability adds a realInterruptcap whosewait/acknowledge/mask/unmaskdrive the route’s MSI-X vector-control + deferred LAPIC EOI (the frame bytes still arrive viaNic.receive’s used-ring read). Under thecloud_virtio_net_userspace_irq_ownership_prooffeature (implies thenic-cap-roundtripfeature) thecloud-prod-nic-driver-userspace-irq-ownership-smokeservice holds aDeviceMmio+DMAPool+Nic+Interruptcap on the same virtio-net function. capOS mapping:- A new
Interruptgrant source (cap::virtio_net_userspace_irq_ownership_proof) replaces the admission-onlyinterrupt_grant_source_prodsource via theKernelCapSource::Interruptarm under this feature. At boot it programs the staged virtio-net function’s RX MSI-X route (table entry 0) mask-first through the landed always-builtcap::interrupt_programmed::program_attach_arm_unmask(route register / claim / MSI-X table map+write / device-manager attach / deferred-LAPIC-EOI arm / unmask) and tears it down (teardown) on cap release. - The
Interruptcap’s methods are real for this device RX route:waitblocks on a real interrupt dispatch through the route’s MSI-X / LAPIC dispatch slot (device_interrupt::wait_kernel_injected_dispatch;delivery_countadvances, soint_injectedflips from 0 – theNic-cap round-trip capability had noInterruptcap on the data path at all). The wake is a bounded kernel-injected dispatch (not yet a device-autonomous raise causally tied to a frame), andNic.receivestill reads the frame bytes from the used ring, so the delta is IRQ-lifecycle ownership (realwait/acknowledge/mask/unmask), not interrupt-coalesced RX completion.acknowledgeretires exactly one deferred LAPIC EOI throughdevice_interrupt::acknowledge_deferred_lapic_eoi_for_route(hardwareDispatchAckDelta = 1, the one-ack-per-delivery /hardware_eoi_deltainvariant);mask/unmasktoggle the route’s own MSI-X vector-control bit (mask-first per PCI 3.0 §6.8.2: table-mask then route-state on mask; route-state then table-unmask on unmask) throughpci::set_msix_table_entry_mask+device_interrupt::{mask,unmask}_device_manager_attached_route(driver-unmasked<->claimed-masked). - The driver brings the device up from userspace (the
nic-cap-roundtripbring-up verbatim), drives the owned RX-interrupt lifecycle (info/wait/acknowledge/mask/unmask/release), and reads the completed frame back throughNic.receive(inlineData,host_physical_user_visible = 0). The PCI function-level MSI-X enable bit is not toggled and no device-autonomous raise is attempted (device_autonomous_raise=not-attempted,waiter_wake=kernel-injected-dispatch); the landed DMA isolation, the owned-vring grants, and buffer-identity / ring-buffer pinning are reused unchanged (queue-address reads still refused). No newInterruptinterface or method. - Future work: the clean independent TX/RX split (the next capability below);
the device-autonomous MSI-X raise (program the device RX
queue_msix_vector+ clear the PCI function mask) and the smoltcp network-stack relocation. - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by
make run-cloud-prod-nic-driver-userspace-irq-ownership(onecloudboot-evidence: nic-driver-userspace-irq-ownership <token>marker). No GCE resources are created.
- A new
-
Production cloud-boot evidence marker (
nic-driver-userspace-clean-tx-rx-split): the independent-TX/RX step of the Phase C NIC-driver relocation track. It decouples the last data-path coupling – the userspace NIC driver’sNic.transmitandNic.receivebecome truly independent. In thenic-cap-roundtrip/ IRQ-ownership capabilities,Nic.receive(virtio_net_userspace_rx_dma_proof::nic_receive) self-stimulated by submitting a kernel-half ARP TX over the retained TX vring inside the same call. Under thecloud_virtio_net_userspace_clean_tx_rx_split_prooffeature (implies theirq-ownershipfeature) theNiccap’sreceive @1dispatches instead tonic_receive_independent. capOS mapping:nic_receive_independentposts a manager-owned device-writable RX buffer on the retained RX vring, rings the RX doorbell, waits on the driver’s OWNED RX interrupt route (the IRQ-ownershipdevice_interrupt::wait_kernel_injected_dispatchdispatch slot, resolved throughvirtio_net_userspace_irq_ownership_proof::owned_rx_route;int_injectedflips from 0), retires the deferred LAPIC EOI (acknowledge_deferred_lapic_eoi_for_route), then polls the RX used ring and reads the completed frame – with no internal ARP-TX self-stimulus (it never submits to the TX vring; the kernel diagnostic reportstx_submissions=0 self_stimulus=removed).- The RX frame is driven by an external stimulus: the consumer’s preceding
independent
Nic.transmitof a real broadcast ARP request (who-has the QEMU SLIRP gateway 10.0.2.2). SLIRP answers; the inbound reply is held in the host net queue untilreceiveposts the RX buffer + kicks the RX queue. Nic.transmitstays independent: it submits the caller’s frame to the TX vring and rings the TX doorbell with no RX involvement (the kernel diagnostic reportsrx_polls=0 rx_submissions=0). Neither call performs the other’s submission.- The wake stays the bounded kernel-injected dispatch the IRQ-ownership
capability owns
(
waiter_wake=kernel-injected-dispatch,device_autonomous_raise=not-attempted). The landed owned-vring / owned-IRQ / DMA-isolation, the writable common-config window, and the buffer-identity / ring-buffer pinning are reused unchanged: no new selected-write register, no new MSI-X surface, no newNic/Interruptmethod, no host-physical / handle exposure (host_physical_user_visible = 0, queue-address reads refused). - Follow-up work: DHCP/IPv4 configuration, legacy kernel socket-path
retirement, kernel
smoltcp/ virtio-net hot-path removal, and the device-autonomous MSI-X raise. The 7c-ii(b) serve-from-userspace local proof is now landed. - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by
make run-cloud-prod-nic-driver-userspace-clean-tx-rx-split(onecloudboot-evidence: nic-driver-userspace-clean-tx-rx-split <token>marker). No GCE resources are created.
-
Production cloud-boot evidence marker (
nic-driver-userspace-sustained-receive-pool): Phase C slice 7d (DONE 2026-06-04) adds the sustained-receiveNicABI the multi-frame TCP path (7c-iii) needs. The landedreceive @1is single-frame + reset-on-empty-poll; this adds a non-resetting poll over a kernel-owned bounce RX pool. Under thecloud_virtio_net_userspace_sustained_receive_pool_prooffeature (implies the clean-split feature) theNiccap servesreceivePoll @4(cap::nic_grant_source_prod->virtio_net_userspace_rx_dma_proof::nic_receive_poll). capOS mapping:- Arm. On first
receivePollthe kernel allocatesNIC_RX_POOL_SIZEmanager-owned bounce RX frames (frame::alloc_frame_zeroed), posts one device-writable descriptor + avail entry per frame on the retained RX vring, publishesavail.idx, and rings the RX doorbell. The device masters only into these kernel-private pages; no host-physical or device-usable address is exported (host_physical_user_visible = 0). - Drain one per poll. Each
receivePollre-kicks the RX doorbell (so QEMU flushes a queued inbound frame into an armed buffer during the MMIO VM exit) and reads the RX used ring. If it advanced, the kernel copies the frame out into the inlineDatareply (bounded by the posted buffer length) and recycles that bounce slot. - The per-buffer invariant replacing reset-before-reclaim. A bounce slot is
re-exposed to the device only after its copy-out completes and its slot
generation is bumped, with the slot scrubbed before the re-post – the
production handle-epoch slot identity (
docs/dma-isolation-design.md) applied at recycle granularity instead of device-reset granularity. The device is not reset per frame (device_reset=none); teardown (on_releasevianic_quiesce, or an unprovable in-flight-DMA error) still quiesces (reset + queues cleared) and scrubs + frees the pool. - No frame yet. If the used ring did not advance,
receivePollreturnsframePresent = falsewith no reset and the device stays armed (device_armed=true) – the cheap speculative poll asmoltcpphy::DeviceRX token needs.receive @1semantics are unchanged. - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by
make run-cloud-prod-nic-driver-userspace-sustained-receive-pool(onecloudboot-evidence: nic-driver-userspace-sustained-receive-pool <token>marker after draining more than one frame with at least one non-resetting empty poll). No GCE resources are created. - Follow-up work: DHCP/IPv4 configuration consumes the served socket path;
later cleanup removes or fixture-gates the legacy kernel socket path and
kernel
smoltcp/ virtio-net hot path. The 7c-ii(b) production manifest proof now consumes the userspace-servedTcpListenAuthority,TcpListener, andTcpSocketsubstrate locally.
- Arm. On first
-
Production cloud-boot evidence marker (
network-stack-process-smoltcp-skeleton): Phase C slice 7a (first increment, DONE 2026-06-03) is the first time a real TCP/IP stack runs outside the kernel over the relocated NIC authority. A userspace network-stack process builds ansmoltcpInterface(Ethernet medium, MAC fromNic.macAddress, static IPv410.0.2.15/24) over aphy::Deviceadapter whose RX/TX is the slice-6 independentNic.receive/Nic.transmit, clocked by theTimercap monotonic source (monotonic_ns). capOS mapping:- The
phy::Deviceadapter is buffered: outbound framessmoltcpproduces queue in a process-localVecthat the poll loop drains and submits viaNic.transmit; one inbound frame fetched viaNic.receiveis handed back forsmoltcpto consume. The adapter holds no vring, DMA handle, or host-physical address – every frame is a process-local byte buffer crossing the cap boundary as inlineDatathrough the manager-owned bounce page (host_physical_user_visible = 0). - The proof is that
smoltcp– not hand-rolled frame code – drives the exchange: a UDP datagram queued to the on-link gateway makessmoltcpemit an ARP request (out throughNic.transmit), the SLIRP ARP reply is consumed (in throughNic.receive, EtherType 0x0806), and – with the neighbour now resolved –smoltcpemits the queued IPv4/UDP datagram, so the neighbour cache observably advances (smoltcp_tx_arp>=1,smoltcp_rx_consumed>=1,smoltcp_tx_ipv4>=1). The internalsmoltcpUDP socket is only an egress stimulus; no socket capability is exposed. - Implementation note: the landed
Niccap rides on the userspace driver shim’s retained vring (the kernel does not own the vring), so the skeleton process performs the slice-1-6 bring-up itself before runningsmoltcp. Relocating the bring-up to a separate long-lived NIC-driver service is folded into the slice-7c contract relocation. - Out of scope: the socket caps (slice 7b), the
cap/network.rscontract relocation (slice 7c,virtio_stub.rsstays fail-closed), and the kernelsmoltcp/ virtio-net removal (slice 8). - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by
make run-cloud-prod-network-stack-process-smoltcp-skeleton(onecloudboot-evidence: network-stack-process-smoltcp-skeleton <token>marker). No GCE resources are created.
- The
-
Production cloud-boot evidence marker (
network-stack-smoltcp-socket-caps): Phase C slice 7b (DONE 2026-06-03) adds a userspaceUdpSocketcap layer on top of the slice-7a substrate: the userspace network-stack process now implements theUdpSocketschema’ssendTo/recvFromsemantics (UdpSocketCapLayer) over the samesmoltcpInterfaceand proves one bounded UDP request/response through it. capOS mapping:- The socket layer drives the slice-7a
phy::Device/Nicpump:sendToresolves the destination’s on-link ARP (oneNic.receivefor the guaranteed ARP reply, EtherType 0x0806) and transmits the datagram throughNic.transmit;recvFromfetches the single solicited reply datagram throughNic.receive(EtherType 0x0800) and returns it. Frames stay process-local byte buffers (host_physical_user_visible = 0); a queue-address read stays refused. - The request/response is a DNS A query for
example.comto SLIRP’s built-in resolver at10.0.2.3:53(the same resolver the Cposix-dns-resolversmoke uses); the decoded response is returned throughrecvFromand the proof asserts source10.0.2.3:53, the transaction-id/QR/RCODEcorrelation, and a decoded A record. The landedNic.receiveresets the device on an empty poll, so the proof only receives when a reply is guaranteed pending and spaces aTimerpre-delay before the datagram receive. - Honest boundary: the socket layer is in-process – it implements the
socket interface semantics over the userspace stack but does not yet serve
them as inter-process transferable capabilities, and it does not touch
the production
kernel/src/cap/network.rscontract (virtio_stub.rsstays fail-closed). Preserving that contract behind a userspace network-stack service is slice 7c. - QEMU-emulable vs hardware-only: fully QEMU-emulable (relies on SLIRP’s DNS
forwarder). Proved locally by
make run-cloud-prod-network-stack-smoltcp-socket-caps(onecloudboot-evidence: network-stack-smoltcp-socket-caps <token>marker). No GCE resources are created.
- The socket layer drives the slice-7a
-
Production cloud-boot evidence marker (
userspace-network-stack-smoltcp): Phase C slice 7c, first increment (DONE 2026-06-03) serves the slice-7bUdpSocketCapLayeras a real inter-process transferable capability. capOS mapping:- A network-stack server process holds the bring-up caps plus an exported
Endpoint; after bring-up it serves theUdpSocketschema (sendTo/recvFrom/close) over that endpoint, driving the sameUdpSocketCapLayeron its own ring (decoding/encoding the capnp params and results). A separate client process holds onlyConsoleand the served cap; it re-interprets the importedEndpointas aUdpSocketand drives one bounded DNS A query/response through the productionUdpSocketClient. smoltcpstill moves every frame through theNiccap (ARP reply EtherType 0x0806 + DNS reply 0x0800 throughNic.receive);host_physical_user_visible = 0is preserved and a queue-address read stays refused. Onclosethe server releases its owned RXInterrupt(route_torn_down=ok).- Honest boundary: the
UdpSocketcontract lives behind a userspace network-stack service. Later Phase C increments added theTcpListener/TcpSocketsubstrate, inter-process serving, and the local 7c-ii(b) serve-from-userspace manifest proof forTcpListenAuthority. DHCP/IPv4, Web UI L4, private GCE reachability, public ingress, and legacy kernel-socket cleanup remain separate work. - QEMU-emulable vs hardware-only: fully QEMU-emulable (relies on SLIRP’s DNS
forwarder). Proved locally by
make run-cloud-prod-network-stack-smoltcp-udp-socket-cap-ipc(onecloudboot-evidence: userspace-network-stack-smoltcp <token>marker). No GCE resources are created.
- A network-stack server process holds the bring-up caps plus an exported
-
Production cloud-boot evidence marker (
virtio-net-tx-authority-bundle): under the focused-proof Cargo featurecloud_virtio_net_tx_authority_bundle_proof, the cloudboot kernel layers a bundle observer (cap::virtio_net_tx_authority_bundle_proof) on top of the three existing production grant sources (devicemmio_grant_source_prod,dmapool_grant_source_prod,interrupt_grant_source_prod). Under the feature, the DeviceMmio source filters its PCI candidate to the same virtio/NVMe-class function the DMAPool and Interrupt sources already match, so all three grants land on the same virtio-net function. Exposed throughmake run-cloud-provider-virtio-net-tx-authority-bundle.- Implemented wire-format subset: no new MMIO/DMA/IRQ writes. The
bundle reuses the existing prod sources’ grant + per-cap on-release
surfaces and asserts the bundle identity over their issue and release
notifications via the
record_devicemmio_grant/record_dmapool_grant/record_interrupt_grant/record_devicemmio_release/record_dmapool_release/record_interrupt_releasehooks called from the existing build_cap_for_grant / on_release / release_cap paths. - Fail-closed assertions: the userspace
cloud-provider-virtio-net-tx-authority-bundle-smokeservice callsinfoon each of the three caps and asserts they all report the same BDF. The kernel-side bundle observer records each grant’s(bdf, generation)identity at issue and at release; the headlinecloudboot-evidence: virtio-net-tx-authority-bundle <token>marker is emitted only after all three caps have been issued and released andsame_dm/same_dp/same_ir/same_bdfall evaluate true. A BDF mismatch logsvirtio-net-tx-authority-bundle: assertion regression: ...and leaves the marker unprinted. Per-cap stale-handle fail-closed is inherited from the existing prod sources’validate_*_recordpaths; the smoke re-tests it explicitly after each release. - capOS mapping: bundle authority composition over the
DeviceMmio+DMAPool+Interruptgrant arms; first child of the blockedcloud-prod-virtio-net-userspace-provider-tx-local-proofparent. No virtio queue setup, no descriptor publication, no notify doorbell, no PCI function-level MSI-X enable, noInterrupt.wait, no TX completion claim, no live cloud traffic. The marker’s trailing labels (same_bdf=true,queue_setup=not-attempted,tx_descriptor=not-published,notify=not-rung,msix_function_enable=not-toggled,tx_completion=not-claimed,live_cloud=not-attempted) re-anchor those bounds. The bundle feature is mutually exclusive withqemu,cloud_provider_cap_waiter_proof, andcloud_virtio_net_device_bringup_proofat thecap::mod.rsactivation site. - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally
by
make run-cloud-provider-virtio-net-tx-authority-bundle. No GCE resources are created.
- Implemented wire-format subset: no new MMIO/DMA/IRQ writes. The
bundle reuses the existing prod sources’ grant + per-cap on-release
surfaces and asserts the bundle identity over their issue and release
notifications via the
-
Production cloud-boot evidence marker (
virtio-net-tx-queue-materialization): under the focused-proof Cargo featurecloud_virtio_net_tx_queue_materialization_proof, the cloudboot kernel runscap::virtio_net_tx_queue_materialization_proof(kernel/src/cap/virtio_net_tx_queue_materialization_proof.rs) over the same virtio-net function the authority bundle picks. The proof materializes one manager-owned TX virtqueue: it allocates three zeroed physical frames from the kernel frame allocator, programs the TX queue’s common-configurationQUEUE_DESC/QUEUE_DRIVER/QUEUE_DEVICE+QUEUE_ENABLE = 1, asserts the device read-backs match the manager-authored host-physical addresses, then writes 0 todevice_statusand asserts every TX queue-state register has cleared to 0. Exposed throughmake run-cloud-provider-virtio-net-tx-queue-materialization.- Spec basis: virtio 1.2 §2.1.2 (reset clears all virtqueue state), §2.7 (split-ring queue layout), §4.1.4.3 (common configuration queue registers), §5.1.2 (virtio-net advertises receiveq1=0, transmitq1=1).
- Implemented wire-format subset: the proof drives the modern virtio
status sequence through reset / ACK / DRIVER / feature select
(
VIRTIO_F_VERSION_1only) / FEATURES_OK, assertsCOMMON_NUM_QUEUES >= 2, writesCOMMON_QUEUE_SELECT = 1(TX), readsCOMMON_QUEUE_SIZE, clamps to a power-of-two bound (MAX_QUEUE_SIZE = 256, so each region fits in one 4 KiB frame), allocates desc/avail/used frames throughmem::frame::alloc_frame_zeroed, programsCOMMON_QUEUE_DESC/COMMON_QUEUE_DRIVER/COMMON_QUEUE_DEVICEwith the resulting host-physical addresses +COMMON_QUEUE_ENABLE = 1, reads every queue register back through theMmioRegionaccessors (the proof grew aread_u64companion to the existingwrite_u64) and asserts the values match, setsDRIVER_OK, then writes 0 todevice_statusand asserts post-resetCOMMON_QUEUE_ENABLE/..._DESC/..._DRIVER/..._DEVICEare all 0 after re-selecting queue 1. Token grammar:<seg>.<bus>.<dev>.<fn>-vendor.<v>-dev.<d>-bar.<b>-len.<hex>-q.<index>-size.<u>-desc.<hex>-drv.<hex>-dev.<hex>. - Fail-closed assertions: five inline assertions gate the marker.
(1) Initial reset reads back as 0. (2) Negotiated feature set
matches exactly
VIRTIO_F_VERSION_1. (3) Post-DRIVER_OK status reads back withACK|DRIVER|FEATURES_OK|DRIVER_OKset andFAILEDclear. (4) Programmed queue addresses + enable read back exactly as written. (5) Post-reset re-read of the TX queue state reports every queue-state register cleared to 0. The proof wraps the materialization so every exit path (success or any intermediate failure) writes 0 todevice_statusand frees every allocated frame back to the bitmap before returning. Per- stage outcomes log on thevirtio-net-tx-queue-materialization: ok .../... failed closed: ...lines so a regression trips the boot log alongside the missing marker. - capOS mapping: focused-proof child of the TX authority bundle
that extends the proven bundle composition with one round of
real common-configuration queue setup + reset cleanup. The same
boot still spawns the
cloud-provider-virtio-net-tx-authority-bundle-smokeuserspace service, which receives the bundle of caps over the same BDF and asserts same-BDF + per-cap stale-handle from userspace; the bundle observer compiles in (the picker filteris_bundle_candidate_classfires under either feature) so every grant + release identity still pairs up for the debug trail, but the bundle’s headline marker is intentionally suppressed under this feature because itsqueue_setup=not-attemptedclaim would be inaccurate now. The queue-materialization marker’s trailing labels (tx_descriptor=not-published,notify=not-rung,msix_function_enable=not-toggled,tx_completion=not-claimed,provider_visible_queue_address=hidden,iova_export=disabled-future-only,live_cloud=not-attempted) re-anchor the bounds the descendant slices (descriptor publication, notify doorbell, MSI-X function enable, userspace submit, used-ring polling, live cloud) carry. Thecap::virtio_net_tx_queue_materialization_proofcaller is mutually exclusive withqemu,cloud_provider_cap_waiter_proof,cloud_virtio_net_device_bringup_proof, andcloud_virtio_net_tx_authority_bundle_proofat thecap::mod.rsactivation site. - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally
by
make run-cloud-provider-virtio-net-tx-queue-materialization. No GCE resources are created.
-
Production cloud-boot evidence marker (
virtio-net-rx-queue-materialization): under the focused-proof Cargo featurecloud_virtio_net_rx_queue_materialization_proof, the cloudboot kernel runscap::virtio_net_rx_queue_materialization_proof(kernel/src/cap/virtio_net_rx_queue_materialization_proof.rs) over the same virtio-net function the authority bundle picks. It is the structural mirror of the TX queue-materialization proof, one virtqueue index over: it materializes one manager-owned RX virtqueue (queue index 0) instead of the TX virtqueue (queue index 1). The proof allocates three zeroed physical frames from the kernel frame allocator, programs the RX queue’s common-configurationQUEUE_DESC/QUEUE_DRIVER/QUEUE_DEVICE+QUEUE_ENABLE = 1, asserts the device read-backs match the manager-authored host-physical addresses, then writes 0 todevice_statusand asserts every RX queue-state register has cleared to 0. Exposed throughmake run-cloud-provider-virtio-net-rx-queue-materialization.- Spec basis: virtio 1.2 §2.1.2 (reset clears all virtqueue state), §2.7 (split-ring queue layout), §4.1.4.3 (common configuration queue registers), §5.1.2 (virtio-net advertises receiveq1=0, transmitq1=1).
- Implemented wire-format subset: identical to the TX
queue-materialization proof except it writes
COMMON_QUEUE_SELECT = 0(RX) instead of1(TX) and re-selects queue 0 for the post-reset read-back. The proof drives the modern virtio status sequence through reset / ACK / DRIVER / feature select (VIRTIO_F_VERSION_1only) / FEATURES_OK, assertsCOMMON_NUM_QUEUES >= 2, readsCOMMON_QUEUE_SIZE, clamps to a power-of-two bound (MAX_QUEUE_SIZE = 256, so each region fits in one 4 KiB frame), allocates desc/avail/used frames throughmem::frame::alloc_frame_zeroed, programsCOMMON_QUEUE_DESC/COMMON_QUEUE_DRIVER/COMMON_QUEUE_DEVICE+COMMON_QUEUE_ENABLE = 1, reads every queue register back through theMmioRegionaccessors and asserts the values match, setsDRIVER_OK, then writes 0 todevice_statusand asserts post-resetCOMMON_QUEUE_ENABLE/..._DESC/..._DRIVER/..._DEVICEare all 0 after re-selecting queue 0. Token grammar:<seg>.<bus>.<dev>.<fn>-vendor.<v>-dev.<d>-bar.<b>-len.<hex>-q.<index>-size.<u>-desc.<hex>-drv.<hex>-dev.<hex>(withq.0for RX). - Fail-closed assertions: the same five inline assertions gate the
marker as in the TX proof — initial reset reads back 0; negotiated
feature set is exactly
VIRTIO_F_VERSION_1; post-DRIVER_OK status hasACK|DRIVER|FEATURES_OK|DRIVER_OKset andFAILEDclear; programmed queue addresses + enable read back exactly as written; post-reset re-read of the RX queue state reports every queue-state register cleared to 0. The proof wraps the materialization so every exit path (success or any intermediate failure) writes 0 todevice_statusand frees every allocated frame back to the bitmap before returning. Per-stage outcomes log on thevirtio-net-rx-queue-materialization: ok .../... failed closed: ...lines. - capOS mapping: focused-proof sibling of the TX
queue-materialization proof that drives the same kernel-side queue
setup + reset cleanup against the receive virtqueue. The same boot
still spawns the
cloud-provider-virtio-net-tx-authority-bundle-smokeuserspace service, which receives the bundle of caps over the same BDF and asserts same-BDF + per-cap stale-handle from userspace; the bundle observer compiles in through its shared cfg gate so every grant + release identity still pairs up for the debug trail, but the bundle’s headline marker is intentionally suppressed under this feature (it is gated oncloud_virtio_net_tx_authority_bundle_proof, which this feature does not enable) because itsqueue_setup=not-attemptedclaim would be inaccurate now. The RX-queue-materialization marker’s trailing labels (rx_buffer=not-posted,avail=not-published,notify=not-rung,rx_completion=not-claimed,msix_function_enable=not-toggled,provider_visible_queue_address=hidden,iova_export=disabled-future-only,device_autonomous_raise=not-claimed,live_cloud=not-attempted) re-anchor the bounds the descendant slices (receive-buffer post, avail publication, notify doorbell, used-ring consumption, MSI-X function enable, live cloud) carry. Thecap::virtio_net_rx_queue_materialization_proofcaller is mutually exclusive withqemu,cloud_provider_cap_waiter_proof,cloud_virtio_net_device_bringup_proof,cloud_virtio_net_tx_authority_bundle_proof, and every TX proof feature at thecap::mod.rsactivation site. - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally
by
make run-cloud-provider-virtio-net-rx-queue-materialization. No GCE resources are created. Live-GCE RX stays undercloud-gcp-virtio-net-nic-driver.
-
Production cloud-boot evidence marker (
virtio-net-rx-buffer-post): under the focused-proof Cargo featurecloud_virtio_net_rx_buffer_post_polled_completion_proof(which impliescloud_virtio_net_rx_queue_materialization_proofso the bundle observer- production grant-source pickers + userspace bundle smoke keep their
plumbing), the cloudboot kernel runs
cap::virtio_net_rx_buffer_post_polled_completion_proof(kernel/src/cap/virtio_net_rx_buffer_post_polled_completion_proof.rs) over the same virtio-net function the authority bundle picks. It is the RX analogue of the TX submit-doorbell -> polled-completion progression: it materializes the RX virtqueue (queue index 0) AND the TX virtqueue (queue index 1, the SLIRP stimulus path), setsDRIVER_OK, posts ONE manager-owned device-writable receive buffer to the RX avail ring, rings the RX notify doorbell once, fills and TX-submits ONE broadcast ARP request as the SLIRP stimulus, rings the TX notify doorbell once, then polls the manager-owned RX used ring with a bounded spin budget untilused.idx == 1and assertsused[0].id == 0andused[0].len > 0— ONE real device->host RX DMA landed in the manager-owned bounce page. Exposed throughmake run-cloud-provider-virtio-net-rx-buffer-post.
- Spec basis: virtio 1.2 §2.1.2 (reset clears virtqueue state), §2.7
(split-ring queue layout), §2.7.6 (available ring:
flags @0,idx @2,ring @4), §2.7.7 (VIRTQ_AVAIL_F_NO_INTERRUPT), §2.7.8 (used ring layout), §4.1.4.3 (common configuration queue registers), §4.1.5.2 (notify doorbell), §5.1.2 (virtio-net advertises receiveq1=0, transmitq1=1), §5.1.6 (12-byte modern virtio-net header). - Implemented wire-format subset: stages 1-9 materialize the RX queue
(index 0) and the TX queue (index 1) identically to the queue-
materialization proof (modern status sequence to FEATURES_OK,
VIRTIO_F_VERSION_1only,COMMON_NUM_QUEUES >= 2, clampCOMMON_QUEUE_SIZEto a power of two<= MAX_QUEUE_SIZE = 256, allocate desc/avail/used frames per queue, program the per-queue registers +QUEUE_ENABLE = 1, read-back assert), thenDRIVER_OK. The RX-DMA delta authors RX descriptor slot 0 over the HHDM (addr = rx_payload_phys,len = 2048,flags = VIRTQ_DESC_F_WRITE,next = 0), sets the RX avail ring (flags = VIRTQ_AVAIL_F_NO_INTERRUPT,ring[0] = 0, release fence,idx = 1), maps the modern notify region bounded to the smallest page covering both per-queue notify slots, and rings the RX-queue notify doorbell. The stimulus mirrorsvirtio.rs::write_arp_request_frame: it fills one TX payload frame with a broadcast ARP request for the SLIRP gateway IP (10.0.2.2), authors TX descriptor slot 0 (flags = 0, device-readable), sets the TX avail ring (alsoVIRTQ_AVAIL_F_NO_INTERRUPT), and rings the TX-queue notify doorbell. The proof then polls the RXused.idxwith a bounded spin budget (POLL_USED_RING_BUDGET = 50_000_000, an order of magnitude above the in-kernelARP_RX_POLL_LIMIT = 500_000) and reads the device-authoredused[0].(id, len)plus the observed EtherType. Token grammar:<seg>.<bus>.<dev>.<fn>-vendor.<v>-dev.<d>-bar.<b>-len.<hex>-q.<index>-size.<u>-rxnotify.bar.<b>.off.<hex>.mult.<u>.addr.<hex>-rxdesc.<hex>-rxdrv.<hex>-rxdev.<hex>-rxpay.<hex>-rxlen.<u>-availidx.<u>-usedidx.<u>-usedid.<u>-usedlen.<u>-ethertype.<hex>. - Fail-closed assertions: the queue-materialization assertions gate
both queue setups (initial reset reads 0; negotiated features exactly
VIRTIO_F_VERSION_1; post-DRIVER_OKstatus hasACK|DRIVER|FEATURES_OK|DRIVER_OKset andFAILEDclear; programmed queue addresses + enable read back exactly as written). The RX-DMA delta adds: the polledused.idxreaches1within the spin budget (else fail closed, no marker); the post-completion RXavail.idxreads back1;used[0].id == 0(the published descriptor head);used[0].len > 0(a real device->host frame); and the post-reset re-read of both the RX and TX queue-state registers reports every register cleared to 0. The proof resets the device on every exit path (success or any intermediate failure) and frees the eight manager-owned frames (RX desc/avail/used, TX desc/avail/used, RX payload, TX payload) only after a confirmed reset read-back of 0; if reset cannot be confirmed the frames stay retained so the device cannot DMA into a freed page. Per-stage outcomes log on thevirtio-net-rx-buffer-post: ok .../... failed closed: ...lines. - capOS mapping: focused-proof child of the RX queue-materialization
proof that drives the first real RX DMA on the production cloud
kernel. Completion is polled only: MSI-X stays disabled, no MSI-X
table entry is programmed, no
Interruptwaiter is installed, no dispatch slot is claimed, and both avail rings carryVIRTQ_AVAIL_F_NO_INTERRUPTso the device does not raise a queue-completion interrupt. The same boot still spawns thecloud-provider-virtio-net-tx-authority-bundle-smokeuserspace service, which receives the bundle of caps over the same BDF and asserts same-BDF + per-cap stale-handle from userspace; both companion headline markers (virtio-net-tx-authority-bundleandvirtio-net-rx-queue-materialization) are intentionally suppressed under this feature because this proof is the new headline owner. The marker’s trailing labels (rx_buffer=posted,avail=published,notify=rung-once,rx_completion=polled-used-ring,msix_rx_function_enable=not-toggled,msix_table_write=not-performed,device_autonomous_raise=not-claimed,provider_visible_queue_address=hidden,provider_rx_submit=kernel-proxy-bounded,iova_export=disabled-future-only,live_cloud=not-attempted) re-anchor the bounds the descendant slices (RX MSI-X wait/ack, provider-driven RX submit, live cloud) carry. Thecap::virtio_net_rx_buffer_post_polled_completion_proofcaller is mutually exclusive withqemu,cloud_provider_cap_waiter_proof,cloud_virtio_net_device_bringup_proof,cloud_virtio_net_tx_authority_bundle_proof, and every TX proof feature at thecap::mod.rsactivation site (inherited through the implied RX-materialization feature’scompile_error!s). - QEMU-emulable vs hardware-only: fully QEMU-emulable; the SLIRP
-netdev userbackend delivers the ARP reply that drives the RX DMA. Proved locally bymake run-cloud-provider-virtio-net-rx-buffer-post. No GCE resources are created. Live-GCE RX stays undercloud-gcp-virtio-net-nic-driver.
- production grant-source pickers + userspace bundle smoke keep their
plumbing), the cloudboot kernel runs
-
Production cloud-boot evidence marker (
virtio-net-msix-function-enable): under the focused-proof Cargo featurecloud_virtio_net_msix_function_enable_proof(which impliescloud_virtio_net_tx_queue_materialization_proofso the bundle observer- production grant-source pickers + userspace bundle smoke keep their
plumbing), the cloudboot kernel runs
cap::virtio_net_msix_function_enable_proof(kernel/src/cap/virtio_net_msix_function_enable_proof.rs) over the same virtio-net function the authority bundle and queue-materialization proofs pick. The proof re-drives the modern virtio status sequence toDRIVER_OK, materializes one manager-owned TX virtqueue (identical to the queue-materialization proof), then walks the PCI MSI-X capability mask-first: it reads the Message Control register, writesFUNCTION_MASK = 1first, reads back, writesENABLE = 1while keeping the function mask set, reads back, then cleans up by clearing both bits and reads back to assert PCI config-space MSI-X state is restored. Exposed throughmake run-cloud-provider-virtio-net-msix-function-enable.
- Spec basis: PCI SIG MSI-X §6.8.2 Message Control Register (bits 14 = Function Mask, 15 = MSI-X Enable); virtio 1.2 §2.1.2 (reset clears virtqueue state), §4.1.4.3 (common configuration queue registers), §5.1.2 (virtio-net advertises receiveq1=0, transmitq1=1).
- Implemented wire-format subset: stages 1-11 mirror the queue-
materialization proof. Stages 12-15 are this proof’s delta:
pci::interrupt_capabilities+MsixCapabilityInfo.offsetresolve the capability header;pci::try_read_config_u16reads the Message Control register atcapability_offset + 0x02; the proof asserts the pre-state hasMSIX_CONTROL_ENABLEclear, performs the mask-first write throughpci::try_write_config_u16, reads back and assertsFUNCTION_MASK = 1, ENABLE = 0, performs the enable write keeping the mask, reads back and asserts both bits are set, then performs the cleanup write that clears both bits, reads back and asserts both are clear. The proof never programs an MSI-X table entry, never claims an interrupt-dispatch slot, and never raises a device-autonomous interrupt. Token grammar:<seg>.<bus>.<dev>.<fn>-vendor.<v>-dev.<d>-bar.<b>-len.<hex>-q.<index>-size.<u>-msix.cap.<hex>-msix.tsize.<u>-pre.<hex>-mask.<hex>-en.<hex>-cleanup.<hex>. - Fail-closed assertions: stages 1-11 inherit the queue-
materialization proof’s five inline assertions. The MSI-X delta adds
four more. (6) Pre-state read-back has
MSIX_CONTROL_ENABLEclear. (7) Post-mask-write read-back hasFUNCTION_MASK = 1andENABLE = 0. (8) Post-enable-write read-back has both bits set. (9) Post-cleanup-write read-back has both bits clear. Every exit path (success or any intermediate failure) runs a best-effortpci::try_write_config_u16that clearsMSIX_CONTROL_ENABLEandMSIX_CONTROL_FUNCTION_MASKregardless of the result chain, then writes 0 todevice_statusand frees every allocated queue frame. Per-stage outcomes log on thevirtio-net-msix-function-enable: ok .../... failed closed: ...lines so a regression trips the boot log alongside the missing marker. - capOS mapping: focused-proof child of the TX queue
materialization proof that extends the same kernel-side
activation surface with one round of canonical mask-first MSI-X
function-level enable + cleanup. The same boot still spawns the
cloud-provider-virtio-net-tx-authority-bundle-smokeuserspace service, which receives the bundle of caps over the same BDF and asserts same-BDF + per-cap stale-handle from userspace; both companion headline markers (virtio-net-tx-authority-bundleandvirtio-net-tx-queue-materialization) are intentionally suppressed under this feature because theirqueue_setup=not-attempted/msix_function_enable=not-toggledclaims would be inaccurate now, so the MSI-X function-enable marker is the sole headline. The marker’s trailing labels (tx_descriptor=not-published,notify=not-rung,msix_function_enable=toggled-mask-first,msix_function_enable_cleanup=cleared,msix_table_write=not-performed,device_autonomous_raise=not-claimed,tx_completion=not-claimed,provider_visible_queue_address=hidden,iova_export=disabled-future-only,live_cloud=not-attempted) re-anchor the bounds the descendant slices (interrupt-dispatch slot, descriptor publication, used-ring polling, live cloud) carry. Thecap::virtio_net_msix_function_enable_proofcaller is mutually exclusive withqemu,cloud_provider_cap_waiter_proof,cloud_virtio_net_device_bringup_proof, andcloud_virtio_net_tx_authority_bundle_proofat thecap::mod.rsactivation site. - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved
locally by
make run-cloud-provider-virtio-net-msix-function-enable. No GCE resources are created.
- production grant-source pickers + userspace bundle smoke keep their
plumbing), the cloudboot kernel runs
-
Production cloud-boot evidence marker (
virtio-net-tx-submit-doorbell): under the focused-proof Cargo featurecloud_virtio_net_tx_submit_doorbell_proof(which impliescloud_virtio_net_msix_function_enable_proofand transitivelycloud_virtio_net_tx_queue_materialization_proofso the bundle observer + production grant-source pickers + userspace bundle smoke + mask-first MSI-X plumbing all keep firing), the cloudboot kernel runscap::virtio_net_tx_submit_doorbell_proof(kernel/src/cap/virtio_net_tx_submit_doorbell_proof.rs) over the same virtio-net function the authority bundle, queue-materialization, and MSI-X function-enable proofs pick. The proof re-drives the modern virtio status sequence toDRIVER_OK, materializes one manager-owned TX virtqueue, enables MSI-X function-level control mask-first, then allocates one brokered TX payload frame and fills it kernel-half as a proxy for the userspace provider’s brokered fill, writes one TX descriptor at slot 0 of the descriptor ring, publishes one avail-ring entry and advancesavail.idxto 1, maps the modern virtio notify region, rings the notify doorbell exactly once for the selected TX queue, reads the post-doorbellavail.idxand device-used.idxfor visibility, then cleans up MSI-X mask-first, resets the device, and frees all four manager-owned frames. Exposed throughmake run-cloud-provider-virtio-net-tx-submit-doorbell.- Spec basis: virtio 1.2 §2.7.6 (driver-area / available ring layout
including
idxat +2 and ring slots at +4), §2.7.8 (device-area / used ring layout includingidxat +2), §4.1.4.4 (notify-cfg capability and per-queue notify address resolution asnotify_bar_base + cap.bar_offset + queue_notify_off * notify_off_multiplier), §4.1.5.2 (modern virtio doorbell: u16 write of the queue index to the per-queue notify address), §5.1.6.2 (virtio-net TX descriptor layout). The submit ordering follows virtio 1.2 §2.7.13 (driver writes the descriptor head index toavail.ring[avail.idx % size], then bumpsavail.idxafter a suitable memory barrier). - Implemented wire-format subset: stages 1-14 mirror the MSI-X
function-enable proof (status sequence, queue materialization,
mask-first MSI-X enable). Stages 15-21 are this proof’s submit
/doorbell delta:
frame::alloc_frame_zeroedallocates one payload frame,frame::hhdm_offsettranslates the manager-owned host- physical to a kernel virtual address for the kernel-proxy fill (a minimal 12-byte modern virtio-net header followed by an 8-byteb"CAPOSTX1"body, total payload length 20 bytes); slot 0 of the descriptor ring receivesaddr = payload_phys, len = 20, flags = 0, next = 0over the HHDM write;avail.ring[0] = 0andavail.idx = 1over the HHDM with a compiler fence between them; the notify region is mapped throughpci::map_bar_regionand the kernel writes the queue index1as a u16 tonotify_vaddr + queue_notify_off * notify_off_multiplier;avail.idxis read back and asserted as1, and the device-writtenused.idxis read for visibility only. The proof never polls the used ring beyond the single visibility read, never claims a TX completion, never programs an MSI-X table entry, never raises a device-autonomous interrupt, never registers anInterruptwaiter, never performs direct DMA, never programs the IOMMU, and never exports a host-physical address or IOVA. Token grammar:<seg>.<bus>.<dev>.<fn>-vendor.<v>-dev.<d>-bar.<b>-len.<hex>-q.<index>-size.<u>-msix.cap.<hex>-msix.tsize.<u>-pre.<hex>-mask.<hex>-en.<hex>-cleanup.<hex>-notify.bar.<b>.off.<hex>.mult.<u>.addr.<hex>-desc.<hex>-payload.<hex>-paylen.<u>-availidx.<u>-usedidx.<u>. - Fail-closed assertions: stages 1-14 inherit the MSI-X function-
enable proof’s nine inline assertions. The submit/doorbell delta
adds three more. (10) Notify region length must be large enough to
contain
queue_notify_off * notify_off_multiplier + 2. (11) Notify-region map length must cover that minimum. (12) Post- doorbellavail.idxround-trip must read back as1. Every exit path (success or any intermediate failure) runs the best-effort MSI-X cleanup, writes 0 todevice_status, asserts every TX queue- state register cleared to 0, and frees all four manager-owned frames (descriptor, avail, used, payload) regardless of the result chain. The device-used.idxread is deliberately NOT asserted: QEMU may or may not have drained the descriptor by the time the kernel reads it, and the proof’s discipline saystx_completion=not- claimedregardless of the observed value. Per-stage outcomes log on thevirtio-net-tx-submit-doorbell: ok .../... failed closed: ...lines so a regression trips the boot log alongside the missing marker. - capOS mapping: focused-proof child of the MSI-X function-enable
proof that extends the same kernel-side activation surface with one
round of single-slot descriptor publish + single avail-ring entry +
single notify doorbell ring, with no used-ring polling or
completion claim. The same boot still spawns the
cloud-provider-virtio-net-tx-authority-bundle-smokeuserspace service, which receives the bundle of caps over the same BDF and asserts same-BDF + per-cap stale-handle from userspace; all three companion headline markers (virtio-net-tx-authority-bundle,virtio-net-tx-queue-materialization, andvirtio-net-msix-function-enable) are intentionally suppressed under this feature because theirtx_descriptor=not-published/notify=not-rung/queue_setup=not-attempted/msix_function_enable=not-toggledclaims would all be inaccurate now, so the submit/doorbell marker is the sole headline. The marker’s trailing labels (tx_descriptor=published,notify=rung-once,msix_function_enable=toggled-mask-first,msix_function_enable_cleanup=cleared,msix_table_write=not-performed,device_autonomous_raise=not-claimed,tx_completion=not-claimed,provider_visible_queue_address=hidden,provider_fill=kernel-proxy-bounded,iova_export=disabled-future-only,live_cloud=not-attempted) re-anchor the bounds the descendant slices (used-ring polling, provider waiter/ack, interrupt-dispatch slot claim, live cloud) carry. Thecap::virtio_net_tx_submit_doorbell_proofcaller is mutually exclusive withqemu,cloud_provider_cap_waiter_proof,cloud_virtio_net_device_bringup_proof, andcloud_virtio_net_tx_authority_bundle_proofat thecap::mod.rsactivation site. - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved
locally by
make run-cloud-provider-virtio-net-tx-submit-doorbell. No GCE resources are created.
- Spec basis: virtio 1.2 §2.7.6 (driver-area / available ring layout
including
-
Kernel-half TX polled-completion proof (predecessor of
virtio-net-userspace-provider): under the focused-proof Cargo featurecloud_virtio_net_tx_polled_completion_proof(which impliescloud_virtio_net_tx_submit_doorbell_proofand transitivelycloud_virtio_net_msix_function_enable_proof/cloud_virtio_net_tx_queue_materialization_proofso every shared plumbing gate keeps firing), the cloudboot kernel runscap::virtio_net_tx_polled_completion_proof(kernel/src/cap/virtio_net_tx_polled_completion_proof.rs) over the same virtio-net function the authority bundle, queue-materialization, MSI-X function-enable, and submit/doorbell proofs pick. The proof re-drives the modern virtio status sequence toDRIVER_OK, materializes one manager-owned TX virtqueue, enables MSI-X function-level control mask-first, allocates one brokered TX payload frame and fills it kernel-half as a proxy for the userspace provider’s brokered fill, publishes one TX descriptor + one avail-ring entry over the manager-owned ring frames, rings the notify doorbell exactly once for the selected TX queue, polls the device-authoredused.idxfrom the manager-owned used-ring frame with a bounded retry budget until it reaches1(one consumed TX descriptor), reads the post-completionavail.idxand the device-authoredused[0].(id, len), then cleans up MSI-X mask-first, resets the device, and frees all four manager-owned frames. The module is the predecessor of the userspace-submit polled-completion proof and is dropped from the compile set when the live-publish feature is on (the live-publish proof is the new headline owner ofvirtio-net-userspace-providerand exercises the same polled completion path through the userspace cap method instead of the kernel-half proxy).- Spec basis: inherits the submit/doorbell proof’s basis (virtio
1.2 §2.7.6 / §2.7.8 / §4.1.4.4 / §4.1.5.2 / §5.1.6.2 / §2.7.13).
The polled-completion delta uses §2.7.8 (used-ring layout:
idxat +2 and 8-byte(id, len)slots at +4) for both the boundedused.idxpoll and the device-authoredused[0]slot read. - Implemented wire-format subset: stages 1-20 mirror the
submit/doorbell proof (status sequence, queue materialization,
mask-first MSI-X enable, payload kernel-proxy fill, descriptor
publish, avail bump, notify doorbell ring). Stages 21-23 are this
proof’s polled-completion delta: the manager-owned used-ring
used.idxHHDM read is wrapped in a bounded retry loop (withcore::hint::spin_loop()between iterations) that converges on the target completion count1, the post-completionavail.idxHHDM round-trip is asserted as1, and the device-authoredused[0].id/used[0].lenare read with anAcquirecompiler fence on the success path so the slot data is observed consistently with theused.idxbump. Token grammar:<seg>.<bus>.<dev>.<fn>-vendor.<v>-dev.<d>-bar.<b>-len.<hex>-q.<index>-size.<u>-msix.cap.<hex>-msix.tsize.<u>-pre.<hex>-mask.<hex>-en.<hex>-cleanup.<hex>-notify.bar.<b>.off.<hex>.mult.<u>.addr.<hex>-desc.<hex>-payload.<hex>-paylen.<u>-availidx.<u>-usedidx.<u>-polled.iter.<u>-usedid.<u>-usedlen.<u>. - Fail-closed assertions: stages 1-20 inherit the submit/doorbell
proof’s twelve inline assertions. The polled-completion delta adds
three more. (13) The bounded
used.idxpoll must converge on the target1within the retry budget; budget exhaustion fails closed and reports the last observed value. (14) The post-completionavail.idxHHDM round-trip must still read back as1. (15) The device-authoredused[0].idmust equal the published descriptor head index0;used[0].lenis recorded for visibility but is deliberately NOT asserted (virtio-net leaveslenat0for TX device-readable chains, but the kernel does not gate the proof on that). Every exit path (success or any intermediate failure) runs the best-effort MSI-X cleanup, writes 0 todevice_status, asserts every TX queue-state register cleared to 0, and frees all four manager-owned frames (descriptor, avail, used, payload) only after the final reset read-back is confirmed; if reset cannot be confirmed the frames stay retained rather than being returned while the device may still DMA them. Per-stage outcomes log on thevirtio-net-userspace-provider: ok .../... failed closed: ...lines so a regression trips the boot log alongside the missing marker. - capOS mapping: focused-proof child of the submit/doorbell proof
that extends the same kernel-side activation surface with one
round of bounded
used.idxpolling + one accounted completion + one device-authoredused[0]slot read, paired with the userspace bundle smoke’sInterruptcap handle-lifecycle discipline on the same MSI-X BDF (Interrupt.inforound-trip identity assertion +release+ post-releaseInterrupt.infofail-closed). That cap-side pairing covers cap-handle identity and post-release stale-handle rejection on the productionInterruptcap; it deliberately does NOT exerciseInterrupt.wait/acknowledge, because the productionInterruptCapProd::waitandInterruptCapProd::acknowledgepaths are unimplemented in the non-qemucloud kernel and fail closed (kernel/src/cap/interrupt_prod.rs). Real waiter/ack pairing on the virtio-net TX MSI-X route is deferred to a future child that either ports thecap::provider_cap_waiter_proofkernel-injected-dispatch + deferred-EOI discipline onto this route or programs an actual MSI-X table entry + dispatch slot. All four companion headline markers (virtio-net-tx-authority-bundle,virtio-net-tx-queue-materialization,virtio-net-msix-function-enable, andvirtio-net-tx-submit-doorbell) are intentionally suppressed under this feature because theirtx_descriptor=not-published/notify=not-rung/queue_setup=not-attempted/msix_function_enable=not-toggled/tx_completion=not-claimedclaims would all be inaccurate now, so the polled-completion marker is the sole headline. The marker’s trailing labels (tx_descriptor=published,notify=rung-once,msix_function_enable=toggled-mask-first,msix_function_enable_cleanup=cleared,msix_table_write=not-performed,device_autonomous_raise=not-claimed,tx_completion=polled-used-ring,provider_visible_queue_address=hidden,provider_fill=kernel-proxy-bounded,iova_export=disabled-future-only,live_cloud=not-attempted) re-anchor the bounds the descendant slices (interrupt-dispatch slot claim, realInterrupt.waitwaiter, live cloud) carry. Thecap::virtio_net_tx_polled_completion_proofcaller is mutually exclusive withqemu,cloud_provider_cap_waiter_proof,cloud_virtio_net_device_bringup_proof, andcloud_virtio_net_tx_authority_bundle_proofat thecap::mod.rsactivation site. The marker keepsdevice_autonomous_raise=not-claimedbecause the proof never enables the per-vector MSI-X table entry, never registers anInterrupt.waitwaiter, and observes the completion strictly through the device-authored used-ring update. - QEMU-emulable vs hardware-only: predecessor-only. The active
make run-cloud-provider-virtio-netheadline target switched to the userspace-submit polled-completion proof below, which exercises the same polled-completion path through the userspace cap method and supersedes this kernel-half proxy.
- Spec basis: inherits the submit/doorbell proof’s basis (virtio
1.2 §2.7.6 / §2.7.8 / §4.1.4.4 / §4.1.5.2 / §5.1.6.2 / §2.7.13).
The polled-completion delta uses §2.7.8 (used-ring layout:
-
Production cloud-boot evidence marker (
virtio-net-userspace-provider): under the focused-proof Cargo featurecloud_virtio_net_tx_dmabuffer_live_publish_proof(which impliescloud_virtio_net_tx_polled_completion_proofand transitively the submit/doorbell, MSI-X function-enable, queue-materialization, and authority-bundle proofs so every shared plumbing gate stays compiled in), the cloudboot kernel runscap::virtio_net_tx_dmabuffer_live_publish_proof(kernel/src/cap/virtio_net_tx_dmabuffer_live_publish_proof.rs) over the same virtio-net function the predecessor picks. Unlike the kernel-half polled-completion predecessor, the kernel-side proof here splits the work into two phases: at-bootinit()stages the modern virtio status sequence + TX queue materialization + MSI-X mask-first enable + notify mapping, leaving the device inDRIVER_OKwith MSI-X enabled-but-globally-masked; the per-callattempt_live_publishruns from the non-qemudevice-manager stub’svalidate_dmabuffer_submit_descriptor_admissionwhen the userspacecloud-provider-virtio-net-tx-dmabuffer-live-publish-smokeservice’sDMABuffer.submitDescriptoris admitted (queue == 1, descriptor_id == 0, length <= PAGE_SIZE, no user mapping live, no in-flight submit, kernel-knownDmaBufferHandle). The cap method resolves the buffer’s host-physical bounce-buffer page, authors one TX descriptor + avail-ring entry over the manager-owned ring frames, rings the notify doorbell exactly once, polls the device-authoredused.idxwith the same bounded budget the polled-completion predecessor uses, readsused[0].(id, len), tears the device down (MSI-X mask-first cleanup + device reset + queue-state register read-back asserted to zero, three manager-owned queue frames freed), and emits onecloudboot-evidence: virtio-net-userspace-provider <token>headline marker withprovider_fill=userspace-brokered-bufferanchoring the userspace-driven submit boundary. Exposed throughmake run-cloud-provider-virtio-net– the terminal local harness for the virtio-net userspace-provider chain.- Spec basis: inherits the polled-completion proof’s basis (virtio
1.2 §2.7.6 / §2.7.8 / §4.1.4.4 / §4.1.5.2 / §5.1.6.2 / §2.7.13).
The live-publish delta drives the same descriptor/avail/notify
write sequence and the same bounded
used.idxpoll, but the descriptor’saddrfield is the userspace-allocatedDMABuffer’s host-physical bounce-buffer page resolved through the kernel DMA ledger, not a manager-allocated payload frame. - Implemented wire-format subset: at-boot
init()covers stages 1-14 of the polled-completion sequence (status sequence + queue materialization + mask-first MSI-X enable + notify mapping) and stashes the staged state. Per-callattempt_live_publishcovers stages 15-24: descriptor publish (withdesc[0].addr = payload_physfrom the userspaceDMABuffer), avail-ring entry +avail.idxbump with a release compiler fence, notify doorbell ring, used-ringused.idxbounded poll,used[0]slot read with an acquire compiler fence, MSI-X cleanup, device reset, queue-state register read-back, and queue-frame release. Token grammar addspool.<u>.gen.<u>-buf.<u>.gen.<u>-payload.<hex>-paylen.<u>so the manager-issued single-slot bounce-buffer pool’s slot/generation pair, the buffer’s slot/generation pair, and the resolved payload host-physical address are observable from the marker; the polled-completion marker’sdesc.<hex>field is intentionally not in the live-publish marker because the per-call descriptor write happens after the marker emission window’s boundaries. - Fail-closed assertions: stages 1-14 inherit the polled-completion
proof’s assertions for status/queue/MSI-X bring-up. Stages 15-24
inherit its assertions for descriptor publish, doorbell, polled
completion, MSI-X cleanup, and reset. The per-call admission gate
adds five more, surfaced through the cap-side
DmaBufferSubmitDescriptorAdmissionshape: (1)queue != 1fails closed withdmabuffer-tx-queue-required/non-tx-queue-rejected(RX is rejected explicitly; queue >= 2 trips the standardqueue-out-of-rangerequest gate). (2)descriptor_id != 0fails closed withdescriptor-id-out-of-range. (3)length > PAGE_SIZEfails closed withlength-exceeds-buffer. (4) A live userspace VMA fails closed withdmabuffer-mapping-live(the cap-sideblock_submit_for_live_mappingshort-circuit handles this before the device-manager runs; the stub defends in depth). (5) A secondsubmitDescriptoron the same buffer without an interveningfreeBufferfails closed withdmabuffer-descriptor-already-inflight; the in-flight slot is dropped only when the parked-buffer record drops onfreeBuffer. A post-freeBuffersubmitDescriptorfails closed with the standard stale-handle error. - capOS mapping: terminal local headline that flips the
descriptor
addrsource from a manager-allocated kernel-proxy payload frame to the userspace-allocatedDMABuffer’s host-physical bounce-buffer page resolved through the kernel DMA ledger. The marker’s trailing labelprovider_fill=userspace-brokered-bufferreplaces the kernel-half polled-completion predecessor’sprovider_fill=kernel-proxy-boundedto reflect the change. All four companion headline markers (virtio-net-tx-authority-bundle,virtio-net-tx-queue-materialization,virtio-net-msix-function-enable, andvirtio-net-tx-submit-doorbell) are suppressed because the userspace-submit polled-completion proof is the new headline owner; the predecessorcap::virtio_net_tx_polled_completion_proofmodule is dropped from the compile set under this feature so its competing emission of the samevirtio-net-userspace-providermarker cannot fire. Thecap::virtio_net_tx_dmabuffer_live_publish_proofcaller is mutually exclusive withqemu,cloud_provider_cap_waiter_proof,cloud_virtio_net_device_bringup_proof, andcloud_virtio_net_tx_authority_bundle_proofat thecap::mod.rsactivation site. The marker keepsdevice_autonomous_raise=not-claimed,msix_table_write=not-performed, andlive_cloud=not-attemptedbecause the proof never enables the per-vector MSI-X table entry, never registers anInterrupt.waitwaiter, and observes the completion strictly through the device-authored used-ring update. - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved
locally by
make run-cloud-provider-virtio-net. No GCE resources are created.
- Spec basis: inherits the polled-completion proof’s basis (virtio
1.2 §2.7.6 / §2.7.8 / §4.1.4.4 / §4.1.5.2 / §5.1.6.2 / §2.7.13).
The live-publish delta drives the same descriptor/avail/notify
write sequence and the same bounded
-
MSI-X wait/ack (
cap::virtio_net_tx_msix_wait_ack_proof): Carries the userspace-submit polled-completion delta one authority step further: same brokered TX submit boundary, but the userspace-observed completion event is the provider-cap-side wake from a kernel-injected dispatch on the bound virtio-net TX MSI-X route. Active under the non-qemucloud kernel built with the Cargo featurecloud_virtio_net_tx_msix_wait_ack_proof(which impliescloud_virtio_net_tx_dmabuffer_live_publish_proofand its predecessors). The wait/ack proof’s at-bootinitruns after the live-publish proof’sinit: it registers + claims an MSI-X route on the same virtio-net BDF under theManagerGrantSourceowner, maps the MSI-X table BAR kernel-side, writes table entryPROOF_TABLE_ENTRY = 1mask-first per PCI 3.0 §6.8.2, attaches the route to the device manager, arms the deferred-LAPIC-EOI gate, and unmasks the route + entry. The PCI function-level MSI-X enable bit stays set with the function mask still asserted (held by the live-publish proof’s mask-first toggle), so the virtio-net device cannot autonomously raise an interrupt on the bound route. The cloudboot manifest spawns thecloud-provider-virtio-net-tx-msix-wait-ack-smokeuserspace service, which receives oneConsole+DeviceMmio+DMAPool+Interruptbundle (theInterruptsource resolves through the wait/ack proof’s grant source, replacinginterrupt_grant_source_produnder this feature), asserts theInterrupt.infoidentity + labels (bootstrap_grant=virtio-net-tx-msix-wait-ack-proof,wait=kernel-injected-dispatch-wait,acknowledge=kernel-injected-deferred-eoi-acknowledge), drives the same brokeredDMABuffer.submitDescriptorchain the predecessor exercises, then callsInterrupt.wait(the cap’sinvoke_waitrunsdevice_interrupt::handle_lapic_deliveryand returns one delivery withdelivery_count_after == delivery_count_before + 1plus one armed deferred LAPIC EOI), callsInterrupt.acknowledge(the cap retires the deferred LAPIC EOI throughacknowledge_deferred_lapic_eoi_for_route,ack_delta == 1,pending_after == 0), frees theDMABuffer, and releases theInterruptcap. The kernel-sideon_releasethen runs the masked-no-wake + reassign + stale-handle assertion chain on the bound route (mirroringcap::provider_cap_waiter_proof’s discipline) and emits exactly onecloudboot-evidence: virtio-net-userspace-provider <token>headline marker combining the publish outcome (recorded inPUBLISH_OUTCOMEby the predecessor’sattempt_live_publishwhen the feature is on) with the wait/ack delivery counts, the reassigned route generation, and the stale-handle / stale-token assertion booleans. The marker’s trailing labels differ from the polled-completion predecessor in two places:tx_completion=msix-wait-ack-injectedreplacestx_completion=polled-used-ring(the userspace-observed completion event is the cap-waiter dispatch; the polled used-ring still runs kernel-side as defence-in-depth), andmsix_table_write=performed-masked-firstreplacesmsix_table_write=not-performed(the wait/ack proof’sinitprogrammed one MSI-X table entry). All other discipline labels are preserved:device_autonomous_raise=not-claimed,provider_visible_queue_address=hidden,provider_fill=userspace-brokered-buffer,iova_export=disabled-future-only,live_cloud=not-attempted. The predecessor live-publish proof’s standalone marker emission is suppressed under this feature so the headline marker name cannot fire twice. Thecap::virtio_net_tx_msix_wait_ack_proofactivation site is mutually exclusive withqemu,cloud_provider_cap_waiter_proof,cloud_virtio_net_device_bringup_proof, andcloud_virtio_net_tx_authority_bundle_proof. Device-autonomous MSI-X delivery (programming the virtio queue’squeue_msix_vectorfor a hardware-raised TX completion interrupt and the broader production dispatch-slot proof), RX path, multi-queue operation, full NIC readiness, and any live-GCE evidence stay out of scope.- QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved
locally by
make run-cloud-provider-virtio-net-tx-msix-wait-ack. No GCE resources are created.
- QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved
locally by
-
RX MSI-X wait/ack (
cap::virtio_net_rx_msix_wait_ack_proof): the RX analogue of the TX MSI-X wait/ack proof above. Active under the non-qemucloud kernel built with the Cargo featurecloud_virtio_net_rx_msix_wait_ack_proof(which impliescloud_virtio_net_rx_buffer_post_polled_completion_proofand its predecessors). The RX completion is staged entirely kernel-side at boot: the RX buffer-post proof’sreport()stages the device throughDRIVER_OK, posts one manager-owned device-writable RX buffer, drives the ARP TX SLIRP stimulus, polls one real device->host RX DMA (used.idx == 1,used[0].len > 0), and – under this feature – additionally holds the PCI function-level MSI-X enable mask-first (hold_msix_function_enable_mask_first:FUNCTION_MASK = 1thenENABLE = 1, held, not cleaned up) and records the publish outcome into the wait/ack proof’sPUBLISH_OUTCOMEslot instead of emitting its standalonevirtio-net-rx-buffer-postheadline. The wait/ack proof’s at-bootinitthen drives the graduated always-builtcap::interrupt_programmed::program_attach_arm_unmaskover MSI-X table entry 0 (the RX queue’s per-queue config vector, virtio-pci §4.1.5.1.2) on the same virtio-net BDF under theManagerGrantSourceowner: register + claim + write table entry 0 mask-first per PCI 3.0 §6.8.2 + manager attach + deferred-LAPIC-EOI arm + route + entry unmask, tearing the route back down viateardownon any error or lost-init race. The device’s RXqueue_msix_vectorstaysVIRTIO_MSI_NO_VECTORand the function mask stays asserted, so the device cannot autonomously raise an interrupt on the bound route. The cloudboot manifest spawns thecloud-provider-virtio-net-rx-msix-wait-ack-smokeuserspace service, which receives oneConsole+Interruptbundle (the RX completion is staged kernel-side, so – unlike the TX wait/ack provider – it needs noDMAPool/DeviceMmiocap; theInterruptsource resolves through the wait/ack proof’s grant source, replacinginterrupt_grant_source_produnder this feature), asserts theInterrupt.infoidentity + labels (bootstrap_grant=virtio-net-rx-msix-wait-ack-proof,wait=kernel-injected-dispatch-wait,acknowledge=kernel-injected-deferred-eoi-acknowledge), callsInterrupt.wait(the cap’sinvoke_waitruns the graduateddevice_interrupt::wait_kernel_injected_dispatchand returns one delivery withdelivery_count_after == delivery_count_before + 1plus one armed deferred LAPIC EOI), callsInterrupt.acknowledge(the cap retires the deferred LAPIC EOI throughacknowledge_deferred_lapic_eoi_for_route,ack_delta == 1,pending_after == 0), and releases theInterruptcap. The kernel-sideon_releasethen runs the masked-no-wake + reassign + stale-handle assertion chain on the bound RX route and emits exactly onecloudboot-evidence: virtio-net-userspace-provider <token>headline marker combining the RX publish outcome with the wait/ack delivery counts, the reassigned route generation, and the stale-handle / stale-token assertion booleans. The marker’s trailing labels differ from the RX buffer-post predecessor in three places:rx_completion=msix-wait-ack-injectedreplacesrx_completion=polled-used-ring(the userspace-observed completion event is the cap-waiter dispatch; the polled used-ring still runs kernel-side as defence-in-depth),msix_rx_function_enable=toggled-mask-firstreplacesmsix_rx_function_enable=not-toggled(the staging now holds the function-level MSI-X enable mask-first), andmsix_table_write=performed-masked-firstreplacesmsix_table_write=not-performed(the wait/ack proof’sinitprogrammed one MSI-X table entry). All other discipline labels are preserved:device_autonomous_raise=not-claimed,provider_visible_queue_address=hidden,provider_rx_submit=kernel-proxy-bounded,iova_export=disabled-future-only,live_cloud=not-attempted. The predecessor RX buffer-post proof’s standalone marker emission is suppressed under this feature so the headline marker name cannot fire twice. Thecap::virtio_net_rx_msix_wait_ack_proofactivation site is mutually exclusive withqemu,cloud_provider_cap_waiter_proof,cloud_virtio_net_device_bringup_proof,cloud_virtio_net_tx_authority_bundle_proof, and every TX/NVMe Interrupt-source proof feature. Device-autonomous RX MSI-X delivery (programming the virtio queue’s RXqueue_msix_vector), provider-driven RX submit, multi-queue operation, full NIC readiness, and any live-GCE evidence stay out of scope.- QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved
locally by
make run-cloud-provider-virtio-net-rx-msix-wait-ack. No GCE resources are created.
- QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved
locally by
-
RX provider-driven buffer submit (
cap::virtio_net_rx_userspace_submit_proof): the RX analogue of the TX DMABuffer live-publish proof, carried one authority step past the RX MSI-X wait/ack proof above. Active under the non-qemucloud kernel built with the Cargo featurecloud_virtio_net_rx_userspace_submit_proof(which impliescloud_virtio_net_rx_buffer_post_polled_completion_proofand its predecessors). Unlike the RX MSI-X wait/ack proof, the RX receive buffer is no longer a manager-owned bounce page filled kernel-side: it is the userspace provider’s brokeredDMABuffer, posted to the RX avail ring throughDMABuffer.submitDescriptor(queue=0). The feature drops the RX buffer-post module’s at-boot kernel-proxyreport(); instead this proof’s self-containedinitstages the device (status sequence + RX queue 0 + TX queue 1 materialization + held mask-first MSI-X function enable + notify map), allocates NO RX payload frame, and programs the kernel-injected RX MSI-X route over table entry 0 through the graduatedcap::interrupt_programmed::program_attach_arm_unmasksurface (same as the wait/ack proof). The cloudboot manifest spawns thecloud-provider-virtio-net-rx-userspace-submit-smokeuserspace service, which receives oneConsole+DeviceMmio+DMAPool+Interruptbundle (the RX provider, unlike the kernel-proxy RX wait/ack provider, needs theDMAPool/DeviceMmiocaps to allocate and submit its brokeredDMABuffer). The provider assertsInterrupt.infoidentity + labels (bootstrap_grant=virtio-net-rx-userspace-submit-proof), allocates one brokered bounce-bufferDMABuffer(NOT mapped or written before submit – the device is the RX writer), and callsDMABuffer.submitDescriptor(queue=0, descriptor_id=0, length=2048). The non-qemudevice-manager admission gate matches the parked bounce-buffer handle, validates the request shape (queue == 0, descriptor_id == 0, length <= PAGE_SIZE, no live user mapping, no in-flight submit), resolves the buffer’s kernel-known host-physical bounce-buffer page, and drivesattempt_rx_submit: it authors the RXdesc[0] = (provider_buffer_phys, length, flags=VIRTQ_DESC_F_WRITE, next=0)+ avail-ring entry over the manager-owned RX ring frames, rings the RX notify doorbell once, drives the ARP TX SLIRP stimulus (kernel-half), polls one real device->host RX DMA (used.idx == 1,used[0].len > 0), reads the observed EtherType, resets the device, and frees the manager-owned RX/TX ring + TX payload frames. The provider then observes the completion throughInterrupt.wait(kernel-injected dispatch,delivery_count_after == delivery_count_before + 1) andInterrupt.acknowledge(deferred LAPIC EOI retired,ack_delta == 1), re-maps itsDMABufferR/O and reads a non-zero received EtherType through its own mapping, unmaps, frees the buffer, and releases theInterruptcap. The kernel-sideon_releaseruns the masked-no-wake + reassign + stale-handle assertion chain on the bound RX route and emits exactly onecloudboot-evidence: virtio-net-userspace-provider <token>headline marker combining the RX publish outcome with the wait/ack delivery counts.- Spec basis: inherits the RX buffer-post / MSI-X wait/ack basis (virtio
1.2 §2.7.6 / §2.7.8 / §4.1.5.2 / §5.1.6, virtio-pci §4.1.5.1.2,
PCI 3.0 §6.8.2). The userspace-submit delta drives the same RX
descriptor/avail/notify write sequence and the same bounded
used.idxpoll, but the descriptor’saddrfield is the userspace-allocatedDMABuffer’s host-physical bounce-buffer page resolved through the kernel DMA ledger, not a manager-allocated payload frame. - Implemented wire-format subset: at-boot
init()covers the status sequence + RX/TX queue materialization + mask-first MSI-X function enable + notify mapping + RX MSI-X route program. Per-callattempt_rx_submitcovers the RX descriptor publish (desc[0].addr = payload_physfrom the userspaceDMABuffer,flags = VIRTQ_DESC_F_WRITE), avail-ring entry +avail.idxbump with a release compiler fence, RX notify doorbell ring, the ARP TX stimulus, the used-ringused.idxbounded poll,used[0]slot read with an acquire compiler fence, the observed EtherType read, device reset, queue-state register read-back, and queue-frame release. Token grammar replaces the wait/ack marker’srxpay.<hex>field withpool.<u>.gen.<u>-buf.<u>.gen.<u>-payload.<hex>-rxlen.<u>so the manager-issued single-slot bounce-buffer pool’s slot/generation pair, the buffer’s slot/generation pair, the resolved payload host-physical address, and the requested receive length are observable from the marker. - Fail-closed assertions: inherits the RX buffer-post proof’s assertions for
status/queue/MSI-X bring-up and the RX descriptor publish + ARP stimulus +
polled completion + reset, and the wait/ack proof’s masked-no-wake +
reassign + stale-handle / stale-token assertion chain. The per-call
admission gate adds the
DMABuffer.submitDescriptorrequest-shape checks surfaced through the cap-sideDmaBufferSubmitDescriptorAdmissionshape:queue != 0fails closed withdmabuffer-rx-queue-required/non-rx-queue-rejected(TX is rejected explicitly; queue >= 2 trips the standardqueue-out-of-rangerequest gate),descriptor_id != 0fails closed withdescriptor-id-out-of-range,length > PAGE_SIZEfails closed withlength-exceeds-buffer, a live userspace VMA fails closed withdmabuffer-mapping-live, and a secondsubmitDescriptorwithout an interveningfreeBufferfails closed withdmabuffer-descriptor-already-inflight. The marker’s trailing labels flipprovider_rx_submitfromkernel-proxy-boundedtouserspace-brokered-bufferand addhost_physical_user_visible=0/direct_dma=blocked; the RX device-write DMA discipline (device_autonomous_raise=not-claimed,provider_visible_queue_address=hidden,iova_export=disabled-future-only,live_cloud=not-attempted) is preserved. Teardown confirms a device reset BEFORE the provider’sfreeBufferscrubs/frees the brokered buffer page. - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by
make run-cloud-provider-virtio-net-rx-userspace-submit. No GCE resources are created. Device-autonomous RX MSI-X delivery (programming the virtio queue’s RXqueue_msix_vector), multi-queue operation, full NIC readiness, and any live-GCE evidence stay out of scope.
- Spec basis: inherits the RX buffer-post / MSI-X wait/ack basis (virtio
1.2 §2.7.6 / §2.7.8 / §4.1.5.2 / §5.1.6, virtio-pci §4.1.5.1.2,
PCI 3.0 §6.8.2). The userspace-submit delta drives the same RX
descriptor/avail/notify write sequence and the same bounded
-
RX production-IDT-dispatch waiter wake (
cap::virtio_net_rx_production_idt_dispatch_proof): carries the RX userspace-submit proof one authority step further. Active under the non-qemucloud kernel built with the Cargo featurecloud_virtio_net_rx_production_idt_dispatch_proof(which impliescloud_virtio_net_rx_userspace_submit_proofand its predecessors). The RX publish half – device staging, provider-submitted brokered receive buffer, SLIRP stimulus, one real device->host RX DMA, polledused.idx– is reused unchanged from the userspace-submit predecessor (its module is dropped and this proof becomes the new headline owner; the device-manager admission routesattempt_rx_submithere). The load-bearing change is the production IDT dispatch wiring: the non-qemuarch::x86_64::lapic::handle_device_interruptarm previously discarded real device-MSI vectors with a bareeoi(), so a real interrupt-gate entry could never reach a deferred-EOI dispatch slot or wake anInterrupt.wait. This proof wires that arm to record an IDT handler entry and route the vector throughdevice_interrupt::handle_lapic_delivery, honoringeoi_deferred(the deferred-EOI path owns the EOI write, retired byacknowledge) and keeping the bareeoi()fallback for unregistered/out-of-pool vectors. TheInterrupt.waitcap method then fires ONE realINT $vectoron the bound RX route’s vector (IF cleared – the syscall context runs IF-cleared by SFMASK design andINT nignores IF; see the Fail-closed assertions bullet below) – graduating the qemu-onlyarch::lapic::inject_real_lapic_int_for_proofmechanic to this proof feature – so the waiter wakes through a real CPU interrupt-gate entry, not the synchronousdevice_interrupt::wait_kernel_injected_dispatchcall every prior RX/cap-waiter proof used.- Spec basis: Intel SDM Vol. 3 interrupt-gate semantics (an interrupt gate
clears
EFLAGS.IFon entry) and Vol. 2INT ndescription (which ignoresEFLAGS.IF); inherits the RX userspace-submit basis (virtio 1.2 §2.7.6 / §2.7.8 / §4.1.5.2 / §5.1.6, virtio-pci §4.1.5.1.2, PCI 3.0 §6.8.2) for the unchanged publish half. - Implemented wire-format subset: identical to the userspace-submit proof
for the publish half. The new surface is kernel interrupt-path wiring (no
new device wire-format): the production
handle_device_interruptnon-qemuarm (kernel/src/arch/x86_64/lapic.rs), a per-vector IDT handler-entry counter (device_interrupt::record_idt_handler_entry/idt_handler_entry_count), and the graduatedinject_real_lapic_int_for_proof. The device’s RXqueue_msix_vectorstaysVIRTIO_MSI_NO_VECTORand the PCI function mask stays held; theINTis fired by this proof, NOT by the device. - Fail-closed assertions: inherits the userspace-submit proof’s publish +
masked-no-wake + reassign + stale-handle / stale-token chain, and adds:
waitassertsdelivery_count_after == delivery_count_before + 1, the per-vector IDT handler-entry count advanced by exactly one (idt_handler_observed), the real-delivery delta equals the IDT-entry delta (direct_dispatch_call_count_unchanged, i.e. no fallback synchronous dispatch was used), and one deferred LAPIC EOI is pending; the masked-route assertion now fires a realINTthrough the masked route and asserts NOdelivery_countadvance and NO deferred-EOI pending/ack change. The cap-dispatch syscall context runs withEFLAGS.IFcleared by SFMASK design (arch::x86_64::syscall) andINT nignores IF, soint_fired_with_ifis recorded as observed evidence only (false in this build) and is NOT a gating condition. The headline marker flipsrx_completiontoreal-idt-interrupt-gate-wakeand addswaiter_wake=real-idt-interrupt-gate,idt_dispatch=production-wired, plus the trailing-idthandler.1-directcall.1-iffired.<0|1>-maskedint.1token booleans;device_autonomous_raise=not-claimedandlive_cloud=not-attemptedare preserved. - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by
make run-cloud-provider-virtio-net-rx-production-idt-dispatch. No GCE resources are created. Flipping the device’s RXqueue_msix_vector+ clearing the function mask so the DEVICE raises the MSI – reusing this proof’s now-proven production dispatch path – is now covered by the device-autonomous MSI-X proof below. Live-GCE RX evidence remains future work.
- Spec basis: Intel SDM Vol. 3 interrupt-gate semantics (an interrupt gate
clears
-
RX device-autonomous MSI-X delivery proof (
cap::virtio_net_rx_device_autonomous_msix_proof): carries the production-IDT-dispatch proof one authority step further and proves a device-raised virtio-net RX MSI-X reaches the production IDT path under local QEMU/KVM. Active under the non-qemucloud kernel built withcloud_virtio_net_rx_device_autonomous_msix_proof(which impliescloud_virtio_net_rx_userspace_submit_proofand reuses the same brokered RX publish path). The module enables PCI memory-space decoding and bus mastering, stages RX queue 0 and TX queue 1, enables MSI-X function-level control mask-first, programs RX queue 0COMMON_QUEUE_MSIX_VECTOR = 0, programs MSI-X table entry 0 throughcap::interrupt_programmed::program_attach_arm_unmask, clears the PCI function mask, and then submits one userspace-owned RX bounce buffer plus the ARP TX stimulus. The RX DMA succeeds (used[0].len > 0, observed EtherType0x0806), proving the data path remains the same brokered provider path.- Spec basis: virtio 1.2 §4.1.5.1.2 (modern per-queue MSI-X vector is the MSI-X table entry index), virtio 1.2 §2.7.6 / §2.7.8 / §5.1.6 for the RX descriptor/avail/used path, and PCI 3.0 §6.8.2 for MSI-X table entry and Message Control masking semantics.
- Implemented wire-format subset: the proof writes only the PCI COMMAND memory-space/bus-master bits, the RX queue config-vector selector, the same split-ring RX/TX descriptors and avail entries as the userspace-submit proof, one RX and one TX notify, one MSI-X table entry, and the PCI MSI-X function mask bit. It does not expose host-physical or IOVA addresses to userspace, does not program an IOMMU, and does not add multi-queue or full-NIC readiness.
- Proof assertions:
make run-cloud-provider-virtio-net-rx-device-autonomous-msixnow assertspci_command=0x0107, one device-raisedInterrupt.waitdelivery on vector0x50withint_injected=0,delivery_count_before=0,delivery_count_after=1,idt_handler_observed=true,eoi_deferred=true, and one deferred-EOIInterrupt.acknowledge(ack_delta=1). The finalcloudboot-evidence: virtio-net-userspace-providermarker includespcicmd.0107,idthandler.1,directcall.1,devraise.1,intinjected.0, andrx_completion=device-autonomous-msix. Closeout validation also keeps the RX production-IDT dispatch, RX userspace-submit, provider cap-waiter,run-net, and default boot-smoke gates green under local QEMU. - QEMU/KVM diagnosis: earlier bpftrace evidence showed QEMU reached
msix_notify(vector=0)with an unmasked MSI-X entry and prepared0xfee00000/0x50, but KVM did not accept vector0x50. The missing precondition was explicit PCI COMMAND bus-master enablement in this proof path; after the proof enables memory-space decoding + bus mastering, local QEMU/KVM delivers the MSI-X to the guest IDT path.
-
RX polled used-ring completion (no injected dispatch) (
cap::virtio_net_rx_polled_completion_proof): the first virtio-net proof whose RX completion signal is real driver progress, not an injected proxy. Active under the non-qemucloud kernel built with the Cargo featurecloud_virtio_net_rx_polled_completion_proof(which impliescloud_virtio_net_rx_userspace_submit_proofand its predecessors). The RX publish half – device staging, provider-submitted brokered receive buffer, SLIRP stimulus, one real device->host RX DMA, polledused.idx– is reused unchanged from the userspace-submit predecessor (its module is dropped and this proof becomes the new headline owner; the device-manager admission routesattempt_rx_submithere). The load-bearing change is on the completion path: every prior virtio-net/cap-waiter proof signalled theInterrupt.waitcompletion throughdevice_interrupt::wait_kernel_injected_dispatch(a kernel-side dispatch-slot proxy) or, in the IDT-dispatch proof, a firedINT $vector– neither produced by real driver progress. Herevirtio_net_rx_polled_completion_proof::invoke_waitinstead reports the completion from the already-latched polled used-ring state captured duringattempt_rx_submit(thePublishedRxused_id == 0/used_len > 0/polled_used_idx >= POLL_TARGET_USED_IDX, latched from the predecessor’s reusedpoll_used_idxunder itsAcquirefence): there is NOwait_kernel_injected_dispatchcall and NOinject_real_lapic_int_for_proofanywhere in the wait/ack path, and zero kernel-injected interrupts.invoke_acknowledgeis a poll-confirmation no-op (no deferred LAPIC EOI to retire, since no interrupt was taken). The bound RX MSI-X route is still programmed at boot but is used ONLY by the release-time masked-no-wake/stale-handle assertion chain.- Spec basis: virtio 1.2 §2.7.8 (used ring is a memory-visible structure the
device advances) and §2.7.10 (the
VIRTQ_AVAIL_F_NO_INTERRUPTdriver flag the predecessor already sets, so the device performs no MSI either); inherits the RX userspace-submit basis (virtio 1.2 §2.7.6 / §4.1.5.2 / §5.1.6, virtio-pci §4.1.5.1.2, PCI 3.0 §6.8.2) for the unchanged publish half. - Implemented wire-format subset: identical to the userspace-submit proof
for the publish half; no new device wire-format and no new kernel
interrupt-path wiring. The completion is a pure memory read of the latched
used-ring state plus a
device_interrupt::snapshot_dispatch_slotbefore/afterdelivery_countcomparison. - Fail-closed assertions: inherits the userspace-submit proof’s publish +
masked-no-wake + reassign + stale-handle / stale-token chain, and replaces
the wait/ack injection assertions with their polled inverse:
waitasserts the latchedused_id == 0/used_len > 0/polled_used_idx >= POLL_TARGET_USED_IDX(completion_observed) ANDdelivery_count_after == delivery_count_before(int_injected=0, no kernel dispatch advanced);acknowledgeasserts no deferred LAPIC EOI was pending and none was retired (hardware_dispatch_ack_delta == 0,eoi_written=false); andon_releaserequiresprovider_observed_dispatch == 0andprovider_observed_ack == 0(the inverse of the injected predecessor’s>= 1). The headline marker flipsrx_completiontopolled-used-ringand addswaiter_wake=polled-used-ring,int_injected=0, with the trailing-deliv.0-ack.0token booleans;device_autonomous_raise=not-claimedandlive_cloud=not-attemptedare preserved. - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by
make run-cloud-provider-virtio-net-rx-polled-completion. No GCE resources are created. Graduating this polled provider off the per-proof feature onto the defaultsystem.cuecloudboot manifest, programming the device’s RXqueue_msix_vectorfor device-autonomous delivery, and any live-GCE RX evidence are future work.
- Spec basis: virtio 1.2 §2.7.8 (used ring is a memory-visible structure the
device advances) and §2.7.10 (the
-
Polled RX+TX provider, always-built off the per-proof feature (
cap::virtio_net_polled_provider): graduates the polled provider above into the production compile set. The module is always-built in the default non-qemucloud kernel (cfg(not(feature = "qemu")), nocloud_*_prooffeature), derived fromcap::virtio_net_rx_polled_completion_proofwith the proof gate removed and the feature-gatedvirtio_net_tx_authority_bundle_proofbundle-observer calls dropped (the per-grant identity is still recorded through the always-builthardware_auditcap-audit). The polled completion behaviour (read the latchedpoll_used_idxused-ring state ininvoke_wait, nowait_kernel_injected_dispatch, noinject_real_lapic_int_for_proof, no-opinvoke_acknowledge) is identical; only the activation switch changes from a Cargo feature to a manifest-observable condition.kernel::run_initcallsvirtio_net_polled_provider::initonly when the booted manifest declares thecloud-provider-virtio-net-polled-provider-default-smokebinary, so on the literalsystem.cue,run-cloud-interrupt-grant, and every other default cloudboot manifest the provider is never staged (is_staged()==false) and is inert. Theinterruptcap is granted through the unchanged productioninterrupt_grant_source_prod(no newKernelCapSourcearm, no proof-only grant-source replacement); that source delegates its cap tovirtio_net_polled_provider::build_cap_for_grantwhile the provider is staged, otherwise it keeps its admission-check-only skeleton. The always-builtdevice_manager::stubsubmit-admission preview and accepted path admit RX queue 0 and routeDMABuffer.submitDescriptortovirtio_net_polled_provider::attempt_rx_submitonly while staged.- Marker: emits a DISTINCT
cloudboot-evidence: virtio-net-polled-provider <token>headline (vs the proof’svirtio-net-userspace-provider) so the two manifests are distinguishable, adding the labelsprovider_build=always-built-default-kernel,provider_feature_gate=none, andgrant_source=production-despecializedto the polled-completion label set. Thecap::provider_nic_bind_proof::reportre-point landed (theprovider-nic-boundmarker above now fires from this provider’s real polled TX+RX completion viareport_real_completionon thecloud-provider-nic-bound-real-polled-driver-smokemanifest); the literalsystem.cuefold is not yet implemented. - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by
make run-cloud-provider-virtio-net-polled-provider-defaulton the default non-qemukernel with nocloud_*_prooffeature. No GCE resources are created;live_cloud=not-attempted.
- Marker: emits a DISTINCT
-
Polled provider teardown / stale-authority (clean cap-op-release) – the always-built polled provider now carries an asserted S.11.2 teardown/stale-authority chain over its DMA + MMIO + IRQ authority on the clean cap-op-release path, not only the IRQ route. When the dedicated teardown manifest is booted (
run_initcallsvirtio_net_polled_provider::arm_teardown_reportbecause thecloud-provider-virtio-net-polled-teardown-smokebinary is declared), the provider’scomplete_after_release(kernel/src/cap/virtio_net_polled_provider.rs,run_teardown_assertions+emit_teardown_evidence) re-validates the brokered DMA + DeviceMmio authority the smoke released before theInterruptcap and emits one combinedcloudboot-evidence: virtio-net-polled-teardown <token>headline.- Mechanisms reused:
device_manager::validate_dmabuffer_record/validate_dmapool_record(stale DMA handle / stale pool-allocate rejected fail-closed),device_manager::last_bounce_page_release_evidence(the scrub-before-free / ledger-removed ordering stamped bydetach_dmabuffer_record_for_cap_release),device_manager::validate_devicemmio_recordoverdevicemmio_grant_source_prod::last_issued_handle_and_owner(the grantedDeviceMmiocap’s record is detached on release, so a stale access fails closed), and the inheriteddevice_interruptmasked-no-wake / reassign / stale-handle chain folded into the same marker. - Marker labels:
stale_dma_buffer_blocked=true,dma_page_scrubbed_before_free=true,dma_ledger_removed_after_scrub=true,stale_dma_pool_alloc_blocked=true,stale_mmio_blocked=true,mmio_handle_invalidated=true,masked_no_wake=true,reassign_generation_bumped=true,stale_token_wake_blocked=true,stale_route_handle_blocked=true,int_injected=0,host_physical_user_visible=0,direct_dma=blocked,iova_export=disabled-future-only. The marker is suppressed fail-closed if any leg regresses. The MMIO leg rides the grantedDeviceMmiocap; the provider’s ownpci::map_bar_regionBAR mapping stays boot-only with no kernel invalidation API by design. - Scope boundary: the clean cap-op-release marker carries
driver_death_teardown=not-attempted-this-slice. The process-exit-under-active-authority teardown trigger is its own proof (see the next entry); it crosses the process-lifecycle authority boundary, and the single-shot provider (oneInterruptcap, single-shotattempt_rx_submit) cannot drive both teardown paths in one boot, so it has its own manifest/boot. - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by
make run-cloud-provider-virtio-net-polled-teardownon the default non-qemukernel with nocloud_*_prooffeature.live_cloud=not-attempted.
- Mechanisms reused:
-
Polled provider DRIVER-DEATH / process-exit teardown – the always-built polled provider’s release-time teardown chain now also covers the process-exit-under-active-authority trigger. When the dedicated driver-death manifest is booted (
run_initcallsvirtio_net_polled_provider::arm_driver_death_reportbecause thecloud-provider-virtio-net-polled-driver-death-smokebinary is declared), the smoke drives the same real polled RX submit/wait/ack + RX read-back, scrubs+frees itsDMABuffer, and then exits while still holding itsDMAPool,DeviceMmio, andInterruptcaps. The kernel’sCapReleaseReason::ProcessExitcap-teardown reclaims all three in cap-table slot order (device_mmiothendmapoolbefore theinterruptcap), so the provider’scomplete_after_releaseprocess-exit arm (run_teardown_assertions+emit_driver_death_evidence) re-validates the now-stale DMA + DeviceMmio authority and runs the IRQ masked-no-wake / reassign / stale-handle chain over the route, emitting onecloudboot-evidence: virtio-net-polled-driver-death <token>headline.- Mechanisms reused: identical to the clean cap-op-release entry above, but
the DMAPool / DeviceMmio / Interrupt records are detached by the kernel’s
ProcessExitcap-teardown rather than explicitInterrupt.release/DMAPool.release/DeviceMmio.releasecalls. The runtime-allocatedDMABuffercap is torn down AFTER the manifest-grantedInterruptcap in slot order, so its page is scrubbed+freed by the smoke’sfreeBufferbefore exit (the normal DMABuffer lifecycle); the buffer is then re-validated stale with its scrub-before-free ordering intact. - Marker labels: same DMA / MMIO / IRQ / no-export discipline labels as the
clean-release marker, plus
driver_death_teardown=no-live-authorityandrelease_path=process-exit. The marker is suppressed fail-closed if any leg regresses (polled_completion_clean, the DMA/MMIO stale re-validation, or the IRQ chain). The clean-releasevirtio-net-polled-teardownheadline and the cap-op-release-gatedvirtio-net-polled-providerheadline both stay absent on this manifest (theInterruptcap is reclaimed viaProcessExit, not an explicit release). - QEMU-emulable vs hardware-only: fully QEMU-emulable. Proved locally by
make run-cloud-provider-virtio-net-polled-driver-deathon the default non-qemukernel with nocloud_*_prooffeature.live_cloud=not-attempted.
- Mechanisms reused: identical to the clean cap-op-release entry above, but
the DMAPool / DeviceMmio / Interrupt records are detached by the kernel’s
-
Provider-chain closeout: the parent
cloud-prod-virtio-net-userspace-provider-local-proofis closed by the decomposed child chain above and the legacy/transitional bind below. The local non-qemucloudboot/QEMU evidence now includes modern TX/RX userspace-provider proofs, the always-built polled provider, real-polled-driverprovider-nic-bound, clean-release/process-exit stale-authority proofs, and the legacy-polled path that later passed the real-GCEprovider-nic-boundgate. This closeout does not claim L4 socket/smoltcp relocation, literalsystem.cueprovider fold, reusable full-NIC/multiqueue readiness, or device-autonomous MSI-X delivery; those remain separate lanes.
4. Legacy / transitional virtio 0.9 PIO transport (cloud bind)
Everything above is the modern (virtio 1.x) transport: vendor capability
windows in MMIO BARs + MSI-X. Real GCE presents the NIC as a legacy /
transitional virtio 0.9 device instead (run 1780377997-281b, 2026-06-02):
PCI 1af4:1000, no modern vendor capability windows, no usable MMIO
memory BAR, legacy INTx, no MSI-X. The whole legacy virtio config block
lives in a PIO (I/O space) BAR0 register window, which the modern transport
discovery cannot represent, so the modern polled provider selects no candidate.
This section maps the legacy PIO transport subset the kernel implements to
bind that device. It has two parts: selection + brokered PIO config
(kernel/src/cap/virtio_net_legacy_select_proof.rs) and the
legacy single-PFN contiguous-queue polled TX/RX data path + provider-nic-bound
(kernel/src/cap/virtio_net_legacy_datapath_proof.rs); both are implemented and
proved locally.
- Spec basis: Virtual I/O Device (VIRTIO) legacy interface — the
pre-1.0 “Virtio PCI” I/O-BAR register layout (virtio 0.9.5 / the legacy
appendix of the OASIS 1.x spec, §4.1.4.8 “Legacy Interfaces”). Cross-checked
against QEMU
hw/virtio/virtio-pci.c(virtio_pci_config_*legacy I/O ops) and the Linuxvirtio_pci_legacydriver. - Implemented wire-format subset (no-MSI-X legacy I/O register block,
kernel/src/cap/virtio_net_legacy_select_proof.rsLEGACY_*offset constants): device features (0x00, u32 RO), guest/driver features (0x04, u32 RW), queue PFN (0x08, used by the data path), queue size (0x0c, u16 RO), queue select (0x0e, u16 RW), queue notify (0x10, u16 RW), device status (0x12, u8 RW), ISR status (0x13, u8 RO). The device-specific config (MAC, …) follows at0x14in the no-MSI-X layout (offsets shift by 4 only when MSI-X is enabled, which this polled path never does). Feature negotiation is the 32-bit legacy feature word:VIRTIO_F_VERSION_1(a high feature bit) is unrepresentable and absent, andVIRTIO_NET_F_MAC(1 << 5) is required and acknowledged. The legacy ring uses the single-PFN contiguous virtqueue layout (descriptor table + avail ring + padding toVIRTIO_PCI_VRING_ALIGN(4096) + used ring, addressed by one page-frame number written toLEGACY_QUEUE_PFNasphysical_address >> 12) — materialized by the data path (virtio_net_legacy_datapath_proof::materialize_queue). - Legacy single-PFN data path
(
kernel/src/cap/virtio_net_legacy_datapath_proof.rs): after the status + feature handshake, both virtqueues are materialized as single physically contiguous, page-aligned regions (frame::alloc_contiguous) whose descriptor/avail/used sub-addresses are computed from the contiguous base, so the in-ring desc/avail/used manipulation reuses the modern provider’s helpers (write_desc_slot_0,write_avail_*,poll_used_idx,read_used_ring_slot_0) unchanged — only the transport differs. The doorbell is a PIO write of the queue index toLEGACY_QUEUE_NOTIFY(no modern MMIO notify region). The virtio-net header is the 10-byte legacy header (noVIRTIO_NET_F_MRG_RXBUF; the modern path uses 12). The reset is polled with a bounded settle (real legacy hardware acknowledges reset asynchronously; QEMU clears synchronously) rather than a single-shot== 0. A real polled TX (queue 1) + RX (queue 0) completes by reading the used rings (no MSI-X route programmed, no interrupt taken or injected); the device is then reset clean and all DMA frames are freed. - Device-fixed queue size + contiguous-allocation bound
(
materialize_queue,MAX_LEGACY_QUEUE_SIZE,vring_layout): legacy virtio queue size is device-fixed and read-only (LEGACY_QUEUE_NUM,0x0c) — the driver cannot shrink it, so it must materialize whatever single-PFN vring the device advertises.materialize_queuereads that size and rejects a zero, non-power-of-two, or over-bound value cleanly (the vring layout requires a power-of-two size; the bound caps the contiguous allocation), then sizes the contiguous region viavring_layout: 256 → 3 pages, 1024 → 8 pages, and the live GCE Andromeda virtio-net’s 4096-entry queue → ~28 pages per queue.MAX_LEGACY_QUEUE_SIZEis the virtio spec maximum (32768, a power of two), so the bound admits any spec-legal device-fixed size — including GCE’s 4096 — while still failing closed above it; analloc_contiguousthat cannot satisfy the request fails closed (no panic) on the existingalloc … contiguous frames … failedarm. QEMU’s legacy virtio-net advertises 256 by default and caps queue size at 1024 (VIRTQUEUE_MAX_SIZE), and lockstx_queue_sizeat 256 for the non-vhost SLIRP device, so the largest local shape isrx_queue_size=1024(an 8-page RX vring, exercised bymake run-cloud-provider-nic-bound-legacy-large-queue); the exact 4096-entry materialization is only verifiable on real GCE (the billable live-GCE run). - GCE-viable RX stimulus / completion (
fill_dhcp_discover_legacy,read_device_mac,poll_rx_used_wall_clock): the TX stimulus is a broadcast DHCP DISCOVER sourced from the device’s real MAC (read from legacy device-config space at0x14), not the modern path’s ARP “who-has 10.0.2.2” from a hardcoded spoofed source. This is required for a real cloud NIC: GCE’s Andromeda SDN enforces MAC/IP anti-spoofing (egress from a non-assigned MAC/IP is dropped),10.0.2.2does not exist on the VPC, and no responder answers an ARP-for-the-gateway. A legitimately-sourced DHCP DISCOVER is answered by both QEMU SLIRP’s built-in DHCP server and the GCE SDN DHCP responder, giving a real device->host RX frame. The completion model is accept-any inbound frame (any non-empty frame with a readable EtherType satisfies RX, so an ambient gateway ARP/RA on GCE counts too), polled against a wall-clock budget (monotonic_nsdeadline, 5 s) rather than a fixed spin count sized for SLIRP’s instantaneous reply. Interrupts are masked during this boot-time proof, so the wall-clock budget relies on the TSC-calibrated clocksource (the QEMU and GCE case); a tick-derived clock is frozen here and a fixed iteration ceiling is the fail-closed backstop. The egress MAC is re-asserted non-zero / non-broadcast before the marker is emitted, and the marker token carries it (-srcmac.<12hex>). - Persistent legacy
Nic-cap runtime (virtio_net_legacy_datapath_proof::legacy_nic_runtime, kernel featurecloud_gce_legacy_virtio_webui_serving_proof): unlike the one-shot proofs, this runtime brings the legacy device up once at boot and keeps itDRIVER_OKfor the whole boot, backing the same typedNiccap methods the modern shim path serves (transmit @0,macAddress @2,linkStatus @3,receivePoll @4;receive @1fails closed). RX keeps a small posted buffer pool (RX_POOL_SIZEdescriptors, recycled in place after each copy-out);receivePollis non-blocking and compares the device-writtenused.idxagainst a consumed cursor (read_used_idx), so a frame burst that advances the index pastcursor + 1is drained one completion per call instead of being missed by an equality poll. TX publishes one frame at descriptor slot 0 and drains its completion with the same bounded advanced-past-cursor check; an unresolved or divergent completion (and any other ring-integrity violation on either queue) is a fatal error that tears the runtime down through a reset-confirmed fail-stop, after which every cap call fails closed. Frames cross the cap boundary as inlineDatawith the 10-byte legacy header added/stripped kernel-side; PIO, vring, and DMA-frame ownership stay kernel-side, and release quiesces the device (reset-confirmed before frames are freed).VIRTIO_NET_F_STATUSis not negotiated, solinkStatusreports assumed-up while the runtime is staged. This is the serving bridge the Phase C userspace network stack uses on the GCE NIC shape; proofmake run-cloud-gce-legacy-virtio-webui-serving(host HTTP peer fetches the remote-session Web UI bundle through QEMUhostfwdover this datapath). - capOS mapping (brokered PIO config access): capOS device authority
(
DeviceMmio/DDF) is MMIO/memory-BAR based and there is no I/O-port capability. The legacy config window stays kernel-owned: the only sanctioned path to a device’s legacy I/O BAR is the bounds-checkedpci::LegacyIoBaraccessor (kernel/src/pci.rs), reached throughpci::io_bar(device, bar). Every access is range-checked against the device’s claimed I/O BAR window and the 16-bit x86 port space, so a caller cannot reach a port outside the BAR; there is no ambientin/outauthority and no port-I/O surface exposed to userspace. PCI I/O decoding is enabled per device viapci::enable_io_space_and_bus_masterbefore any access. - Candidate gate (MSI-X not required): the legacy candidate selection
(
virtio_net_legacy_select_proof::pick_legacy_candidate) accepts a transitional virtio-net function (1af4:1000, network class) whose modern common-config window does not resolve and which exposes a usable I/O BAR0 — without requiring an MSI-X capability, because the polled data path does not depend on interrupt delivery. This is the deliberate relaxation of the modern gate (virtio_net_polled_provider::candidate_from_device, which requires both the modern transport and MSI-X). - Fail-closed rules: the brokered status handshake fails closed on any
out-of-window access, any device-status regression, a missing
VIRTIO_NET_F_MAC, a guest-feature write-back mismatch, or a zero queue-0 size. The data path additionally fails closed on a device-MAC read failure (out-of-window, all-zero, or broadcast), an out-of-window queue PFN/notify access, a PFN read-back mismatch, an advertised queue size that is zero or exceeds the materialization bound, a TX used-ring poll-budget exhaustion, an RX wall-clock-budget exhaustion, a used-ring id/len regression, a zero EtherType, or a final reset that does not settle to0x00. On any such failure noprovider-nic-boundmarker is emitted (the gate staysnull/fail-closed), and the device is reset and its DMA frames freed regardless of outcome. The completion is observed by polling the legacy used ring with no MSI-X route programmed and no interrupt taken or injected; theprovider-nic-boundmarker (provider_nic_bind_proof::report_real_completion_legacy) carries honesttransport=legacy-pio-virtio-0.9,interrupt_model=polled-no-msix, anduserspace_driver_authority=kernel-brokered-legacy-polledlabels, and nothing is exported to userspace (host_physical_user_visible=0,direct_dma=blocked). - QEMU-emulable vs hardware-only: the legacy shape is QEMU-emulable via
qemu-system-x86_64 -device virtio-net-pci,disable-modern=on,vectors=0(legacy I/O BAR0, INTx, no MSI-X — the faithful GCE shape). Proved locally bymake run-cloud-provider-virtio-net-legacy-selecton the default non-qemukernel: the kernel selects the legacy NIC over its I/O BAR0, runs the brokered device-status handshake + 32-bit feature read (observeddevice_features=0x79bf8064,VIRTIO_NET_F_MACset) + queue-0 size read (256), and emitscloudboot-evidence: virtio-net-legacy-candidate-selected <token>. The data path is proved bymake run-cloud-provider-nic-bound-legacyon the same device shape: the kernel reads the device’s real MAC (52:54:00:12:34:56under QEMU), materializes the legacy single-PFN virtqueues, TX-submits a broadcast DHCP DISCOVER from that MAC, and completes a real polled RX against the wall-clock budget (observedsrc_mac=52:54:00:12:34:56,rx_used_len=600,ethertype=0x0800— the SLIRP DHCP OFFER, an IPv4 frame —tx_used_idx=1,rx_used_idx=1,rx_clock_usable=true,final_status=0x00), then emits exactly onecloudboot-evidence: provider-nic-bound <token>sourced from that completion (token carries-ethertype.0800and-srcmac.525400123456). The DHCP-discover-from-real-MAC stimulus is the GCE-viable path: GCE’s Andromeda SDN drops egress from a spoofed source MAC/IP and has no10.0.2.2ARP responder, so the modern path’s spoofed ARP-to-SLIRP-gateway stimulus would time out on a real NIC; a real-MAC DHCP DISCOVER is answered by both SLIRP and the GCE SDN. The follow-up billable real-GCE runcloud-prod-gce-billable-boot-real-polled-nic-boundpassed on 2026-06-02 15:03 UTC through this legacy path: the live1af4:1000NIC bound at00:04.0, materialized the 4096-entry RX/TX vrings, transmitted DHCP DISCOVER from the device MAC, received a real IPv4 frame (rx_used_len=532,ethertype=0x0800), and emittedprovider-nic-boundfromreport_real_completion_legacy. This remains a bounded raw-frame bind proof, not L4 networking or a reusable userspace provider service.
Related
kernel/src/virtio.rs– PCI transport discovery, split-ring transport, feature negotiation, framing.kernel/src/cap/network.rs– accepted-socket cap state and the network capability surface.docs/proposals/networking-proposal.md– the userspace network-stack move (Phase C) and the transitional-kernel status.docs/dma-isolation-design.md– the DMA backend and isolation model the userspace successor binds into.
virtio-blk (modern PCI block device)
This is a provenance map for the in-tree virtio-blk driver: it cites the spec,
summarizes only the wire-format subset the code actually implements, and points
into the implementation. It is not a re-spec – where the spec is implemented
unchanged it links rather than transcribes. The driver was the first real
BlockDevice CapObject, so the treatment is a concise map rather than
exhaustive register tables. It reuses the modern split-ring transport seam
introduced for virtio-net (virtio-net); this page covers only
the block-specific additions.
Status: QEMU fixture, not the production storage route. The kernel-owned
virtio-blk driver, its BlockDevice cap arm, and its PCI discovery are all
gated behind the qemu cargo feature (diagnose_qemu_virtio_blk in
kernel/src/pci.rs; the BlockDeviceBackend::Virtio arm in
kernel/src/cap/block_device.rs). The default non-qemu production kernel never
enumerates, claims, or binds virtio-blk, and its block_device grant source
resolves to the userspace-brokered NVMe BlockDevice arm
(BlockDeviceBackend::NvmeBrokered) instead, failing closed when no verified
NVMe controller and live device_mmio grant are present. virtio-blk remains as a
named local fixture / regression test only – a fully QEMU-emulable end-to-end
BlockDevice proof and the substrate the storage-layer (read-only / persistent /
writable filesystem) QEMU proofs read through. It is not an ambiguous forward
production driver. The kernel broker responsibilities it exercises (PCI claim
arbitration, MMIO/IRQ/DMA admission, bounce/IOMMU isolation, stale-generation
rejection, and revocation) are the same ones the production userspace storage
driver binds into; see §3 capOS mapping.
The driver lives in the virtio-blk section of kernel/src/virtio.rs
(VirtioBlkDriver) and the cap surface in kernel/src/cap/block_device.rs
(BlockDeviceCap).
1. Spec basis
- Device: virtio block device, modern (virtio 1.x) PCI transport.
PCI vendor
0x1af4; device0x1042(modern) /0x1001(transitional). IDs atkernel/src/pci.rs(VIRTIO_VENDOR_ID,VIRTIO_BLK_MODERN_DEVICE_ID,VIRTIO_BLK_TRANSITIONAL_DEVICE_ID; matched byPciDevice::is_virtio_blk). Up todevice_dma::MAX_VIRTIO_BLK_DEVICESfunctions are bound, each in its own const-generic driver slot (VIRTIO_BLK_DRIVER_0/VIRTIO_BLK_DRIVER_1) so the two devices cannot alias DMA or queue state. The target disk is selected by manifest PCI identity; the ordinary boot/storage disk resolves to the non-target disk when both are present. - Authoritative spec: Virtual I/O Device (VIRTIO) Version 1.2, OASIS Committee Specification 01 (2022-07-01). Source: https://docs.oasis-open.org/virtio/virtio/v1.2/virtio-v1.2.html. Relevant sections: 4.1 (virtio over PCI bus), 2.7 (split virtqueues), 5.2 (block device).
- Reference: cross-checked against the Linux
virtio_blkdriver for the request framing and thevirtio_pci_modernmodern-transport handshake.
2. Wire format (implemented subset)
The modern PCI capability parsing, common-config register map, split-ring
descriptor layout, and feature-negotiation handshake are the shared transport
seam documented in virtio-net §2
(kernel/src/virtio.rs transport module, ModernTransport, Virtqueue,
DescriptorTrackingSlot). Only the block-specific subset is summarized here.
- Feature negotiation: the driver requires and selects only
VIRTIO_F_VERSION_1(read_device_features/write_driver_featuresinVirtioBlkDriver::initialize); a device that does not offer it fails closed withBlkInitError::MissingRequiredFeatures. No block feature bits (read-only, multi-queue, discard, …) are negotiated, so the device is driven as a single read/write request queue. - Device config (capacity): the block device config space carries the
capacity in 512-byte sectors as a little-endian
u64(VIRTIO_BLK_CONFIG_CAPACITY_LEN= 8 bytes, read low/high ininitialize). A config region shorter than that, or a zero capacity, fails closed (BlkInitError::DeviceConfigTooSmall/ZeroCapacity). - Request queue: a single request virtqueue (queue 0). The negotiated size
is clamped to the largest power of two not exceeding both the device-advertised
COMMON_QUEUE_SIZEandVIRTIO_BLK_REQUEST_QUEUE_SIZE(8); a usable size below 4 (one request chain needs 3 descriptors) fails closed. Per-queue notify address is computed fromnotify_off_multiplierlike any modern virtio queue. - Request framing (
VirtioBlkDriver::issue_request): each request is a 3-descriptor chain over one bounce-buffer page (ChainSegment):- header –
VIRTIO_BLK_REQ_HEADER_LEN(16) bytes, device-readable:type(u32–VIRTIO_BLK_T_IN= 0 read /VIRTIO_BLK_T_OUT= 1 write), a reservedu32, and thesector(u64LBA), atVIRTIO_BLK_HEADER_OFFSET(0). - data –
512 * countbytes atVIRTIO_BLK_DATA_OFFSET(512), device-writable for reads, device-readable for writes. - status – 1 byte at
VIRTIO_BLK_STATUS_OFFSET(16), device-writable; pre-seeded withVIRTIO_BLK_STATUS_SENTINEL(0xff) and checked forVIRTIO_BLK_S_OK(0) after completion (BlockDeviceRequestError::DeviceStatusotherwise).
- header –
- Completion: QEMU completes virtio-blk requests synchronously, so the
driver notifies the queue and polls the used ring (
poll_used_within_ns, bounded by the real-timeVIRTIO_BLK_COMPLETION_BUDGET_NSbudget with theVIRTIO_BLK_COMPLETION_FALLBACK_SPIN_LIMITspin-count backstop when the monotonic clocksource is tick-derived) rather than waiting on the request MSI-X interrupt, which is claimed but left masked (see §3). The bound is time-based because the device side includes QEMU’s host file I/O, whose latency a raw spin count does not track.
3. capOS mapping
- Binding (qemu fixture, in-kernel): virtio-blk is driven in the kernel
and only under the
qemufeature. Unlike the userspace storage driver, it does not receiveDeviceMmio/Interrupt/DMAPoolcaps; insteadVirtioBlkDriver::initializebinds authority through the kerneldevice_managertransactions –claim_pci_function(.., DeviceOwner::VirtioBlk)thenattach_dmapool_record_with_remapping/attach_devicemmio_record/attach_interrupt_source. TheBlockDevicecap is the userspace-facing surface; the hardware authority stays kernel-owned. This in-kernel ownership is why the driver is kept as a qemu fixture rather than a production route: the productionBlockDeviceis served by the userspace-brokered NVMe provider chain (BlockDeviceBackend::NvmeBrokered, gated on a verified controller and a livedevice_mmiogrant), where the device-specific protocol logic runs in userspace overDeviceMmio/DMAPool/Interruptcaps and the kernel retains only broker/admission/isolation/revocation. - MMIO: the modern-transport common/notify/ISR/device-config regions are
mapped from the device BARs (
map_blk_regionoverpci::map_bar_region) and recorded withdevice_manager::attach_devicemmio_recordagainst the first decoded memory BAR. Doorbell (queue-notify) writes are scoped to the per-queue notify address computed fromnotify_off_multiplier. The DDFDeviceMmiocap (kernel/src/cap/device_mmio.rs) is the userspace successor surface. - Interrupt: one MSI-X route is registered for the request queue
(
VIRTIO_BLK_REQUEST_MSIX_ENTRY= 0,PciMsixInterruptRole::BlockRequestQueue), claimed (DeviceInterruptDriver::VirtioBlk) and attached to the device handle for authority binding, but left masked: completion is by polled used ring, not interrupt delivery. Route records are tracked by the kernel-owned device-interrupt ledger (kernel/src/device_interrupt.rs). - DMA: each bound device gets its own DMA pool (
device_dma::begin_virtio_blk_pool, keyed by the const-genericDEVindex viaVirtioBlkDma<DEV>). Ring pages and the request bounce buffer are allocated and accounted through the blk-keyed ledger (allocate_virtio_blk_page/register_virtio_blk_queue/record_virtio_blk_submission/..._completion_for_allocationinkernel/src/device_dma.rs). DMA uses the manager-owned bounce-buffer backend; no host physical address or IOVA is exposed to userspace – the request MSI-X route is kept masked specifically so no raw address leaves the kernel boundary. BlockDevicecap surface:BlockDeviceCap(kernel/src/cap/block_device.rs) is scoped to onedevice_indexand routes the schema’sreadBlocks/writeBlocks/info/flushmethods (schema/capos.capnpinterface BlockDevice) to that device only, failing closed when it is not bound. Under theqemufeature theblock_deviceKernelCapSourcereaches the resolved boot/storage virtio-blk disk, and theblock_device_targetsource requiresSystemConfig.blockDeviceTarget.pci(schema/capos.capnp) and resolves that PCI segment:bus:device.function selector to a bound non-boot virtio-blk device; absent, mismatched, or boot-disk selectors fail closed. In the production (non-qemu) kernel the sameblock_devicesource instead mints theNvmeBrokeredarm, andblock_device_targetfails closed (requires the qemu feature). The read-only/ persistent/writable filesystem and store caps (readonly_fs,persistent_store,writable_fs) layer their on-disk formats over whicheverBlockDevicebacks the boot/storage cap – the virtio-blk fixture underqemu, the brokered NVMe arm in production.- Fail-closed / validation rules:
VirtioBlkDriver::validate_rangerejects a zero count, a count overVIRTIO_BLK_MAX_SECTORS_PER_REQUEST(7 – bounded so header + status +512 * countfit one 4 KiB page),start_lba + countarithmetic overflow, and any range past the reportedcapacity_sectors, all before device access. The cap layer additionally enforces thatwriteBlocksdata length equalscount * 512(BlockDeviceRequestError::DataLengthMismatch). A non-OKdevice status, a used-ring poll timeout, or a DMA accounting failure each fail closed (DeviceStatus/Completion/Accounting). Descriptor reuse is generation-tracked through the shared bounded tracking-slot array. - QEMU-emulable vs hardware-only: fully QEMU-emulable, and these are the
fixture gates. QEMU provides virtio-blk-pci;
make run-virtio-blkis the single-device end-to-endBlockDevicefixture,make run-multi-virtio-blkproves the two-device (boot + target) binding with independent per-device DMA pools,make run-blockdevice-target-identityproves manifest identity selection when PCI/BDF order would otherwise bind the intended target first, andmake run-virtio-blk-failoverexercises the multi-device failover path. All are--features qemufixtures over dedicatedsystem-virtio-blk.cue/system-multi-virtio-blk.cue/system-blockdevice-target-identity.cuemanifests, not production-storage evidence. No hardware-only path. The production-storage gate is the userspace-brokered NVMeBlockDevicechain (make run-cloud-provider-nvme-blockdevice-read-graduatedand the otherrun-cloud-provider-nvme-blockdevice-*proofs).
Related
kernel/src/virtio.rs– the virtio-blk driver (VirtioBlkDriver), request framing, queue setup, and the shared modern split-ring transport.kernel/src/cap/block_device.rs– theBlockDevicecap surface (BlockDeviceCap) routing schema methods to a single bound device.kernel/src/device_dma.rs– the per-device virtio-blk DMA pool/queue ledger.kernel/src/device_interrupt.rs– the request-queue MSI-X route record.schema/capos.capnp(interface BlockDevice) – thereadBlocks/writeBlocks/info/flushcontract.docs/dma-isolation-design.md– the DMA backend and isolation model the userspace successor binds into.
FAT32 (read-only filesystem backer)
This is a provenance map for the read-only FAT32 Directory/File backer, part
of the real-filesystem role-split
(docs/proposals/real-filesystem-decision.md). It is a filesystem-format
reader layered over a block device, not a hardware device page; like
atapi-iso9660.md it documents an on-disk format and the capOS cap surface over
it, citing the spec and the vendored parser rather than re-specifying FAT.
The backer lives in kernel/src/cap/fat_fs.rs and reads through the vendored
fatfs no_std crate (vendor/fatfs-no_std/). Its sector reads go through a
BlockSource seam with two mutually-exclusive variants (mirroring
readonly_fs.rs): a Virtio arm (compiled under storage_fat_read,
reading the kernel-owned virtio-blk device) and an Nvme arm (compiled
under cloud_fat_read_over_nvme_proof, reading a cloud-attached NVMe namespace
through the always-built brokered read window op). The module compiles under
either feature.
1. Spec basis
- Format: FAT32, the File Allocation Table filesystem as standardized by Microsoft’s FAT: General Overview of On-Disk Format (the FAT32 File System Specification, v1.03) and the EFI FAT specification. The relevant structures are the BIOS Parameter Block (BPB) in the boot sector, the FAT32 FSInfo sector, the File Allocation Table itself (the cluster chain), and the directory-entry records (8.3 short entries plus VFAT long-file-name entries).
- Parser provenance: full FAT parsing is delegated to the vendored
fatfscrate (vendor/fatfs-no_std/rust-fatfs-0.4.0, upstream rafalh/rust-fatfs, commit pinned invendor/fatfs-no_std/VENDORED_FROM.md, MIT). capOS supplies the block-backed storage adapter, the cap surface above it, and an independent bounded validation subset over the BPB, root chain, root entries, and root-level file FAT chains before exposing a root cap. - Already a boot-path format: FAT32 is the EFI System Partition format
Limine reads, so it is structurally part of the boot path already
(
docs/backlog/hardware-boot-storage.md); this backer is the first capOS reader of a host-authored FAT32 image.
2. Wire format (implemented subset)
Only the read path is exercised; FAT write (cluster allocation, FSInfo/FAT mutation, directory-entry creation) is out of scope and fails closed.
- Mount (
fat_fs::mount_fatfs->fatfs::FileSystem::new): capOS first performs a bounded FAT32 preflight over the boot sector / BPB and primary FAT: 512-byte sectors, FAT32 geometry, root cluster in range, bounded root directory chain, and bounded root-level file chains. It then letsfatfsmount and assertsFileSystem::fat_type() == FatType::Fat32, failing the grant closed for FAT12/FAT16 or malformed images. The mount performs no writes (it reads the boot sector + FSInfo only); a cleanmkfs.fatimage keeps the BPB dirty flag clear, so thefatfsDrop/unmount path performs no writes either. The virtio arm (fat_fs::mount_root) mounts eagerly at grant time; the NVMe arm (fat_fs::mount_root_nvme) defers the mount to the firstDirectory.list/open(theFatMount::Deferred->Readytransition inFatMount::ensure), because the brokered NVMe controller is brought up by the userspace provider after the grant resolves – mirroringreadonly_fs::mount_root_nvme. - Directory listing (
Directory.list @1):fatfs::Dir::iter()walks the root directory entries; capOS copies each entry’sfile_name()(LFN or case-normalized 8.3),len(), andis_dir()into theDirEntryreply (FatFsDirectoryCap::collect_entries). The volume-label entry is skipped by the iterator. capOS bounds the exposed root toMAX_DIRECTORY_ENTRIES(64) visible entries. - Open (
Directory.open @0): resolves a root-level file name to its size and raw FAT directory-entry timestamp metadata (FatFsDirectoryCap::lookup_file_metadata, rejecting directories and missing names) and mints aFilecap recording the name, size, and bounded timestamp metadata. The write-implyingCREATE/TRUNCATEflag bits, nested (/-bearing) paths, and files larger thanMAX_FILE_BYTES(64 KiB) are rejected. - Read (
File.read @0): re-opens the file by name through the shared locked mount,seeks to the requested offset, and reads up tolengthbytes viafatfs::Read, walking the cluster chain. The covering read is clamped to end-of-file.fatfsresolves the FAT cluster chain; a multi-cluster file exercises the chain walk, not just the root entry. capOS preflight bounds the root-level file chain length and rejects cycles/bad/out-of-range cluster values before exposing the root cap. - Stat (
File.stat @2): reports the file size plus FAT directory-entrycreated/modifiedtimestamps when the FAT fields are valid. capOS converts the FAT date/time fields to Unix epoch nanoseconds by interpreting the timezone-free FAT local-time value as UTC for this bounded local proof. Missing, zero, or invalid FAT date/time fields map to0, the schema’s unstamped/unsupported value, rather than inventing trusted time. TheFile.statABI remains schema-stable and carries timestamp values only; proof logs label the source asmetadata_provenance=fat-directory-entry,clock_provenance=none, andtrusted_clock=false. FAT modification time has two-second granularity; FAT creation time has an optional high-resolution byte (0..=199in10msunits), and out-of-range high-resolution values are rejected before timestamp conversion. - Not implemented: every mutation (
Directory.mkdir/remove/sub/create/rename,File.write/truncate/sync) returns a typed error at the cap layer; the FAT/FSInfo/long-name write paths infatfsare never reached.
3. capOS mapping
- Binding (kernel-owned read; behind
read_only_fs_root): the FAT32 rootDirectoryis granted through the existingKernelCapSource::ReadOnlyFsRootsource. Under a plainqemubuild that source resolves to the capOS-authoredCAPOSRO1backer (cap::readonly_fs); understorage_fat_readit resolves to this FAT backer’s virtio arm (cap::fat_fs::mount_root); undercloud_fat_read_over_nvme_proofit resolves to the FAT backer’s NVMe arm (cap::fat_fs::mount_root_nvme, bound to the livedevice_mmiohandle the production grant source staged – thedevice_mmiogrant must precederead_only_fs_rootin the manifest). All three are wired in theKernelCapSource::ReadOnlyFsRootarms ofkernel/src/cap/mod.rs(boot PID 1) andkernel/src/cap/process_spawner.rs(spawned services). Selecting the backer by feature mirrors how the same source already selects itsVirtiovs NVMe backend, so no newKernelCapSourceand noschema/capos.capnpchange is needed – theDirectory/Filecontract already carries every field. Directory/Filecap surface:FatFsDirectoryCap/FatFsFileCapimplementDirectory.list/open+File.read/stat/close; every mutating method fails closed. Read-only is structural (distinctCapObjecttypes that expose no mutation), not a rights flag.openmints theFileresult capCopy/SameSession, so a holder can forward a read view to a same-session spawn child without conferring write authority (the same posturereadonly_fs/installable_imageuse).- MMIO / Interrupt: none authored by the backer. It holds no device registers
and binds no interrupt. The virtio arm reads sectors through the kernel-owned
virtio-blk
BlockDevicefree functions (crate::virtio::block_device_info/block_device_read_blocks/block_device_max_sectors_per_request); the NVMe arm reads through the always-built brokered read window op (device_manager::nvme_brokered_io_sync_read_window_op_for_cap) bound to the granteddevice_mmiohandle/owner – the same read arm the NVMeBlockDevicegraduation andreadonly_fs.rs’s NVMe arm use. The NVMe controller bring-up (reset/enable/IDENTIFY/CREATE I/O queue) is driven by the userspace provider and the shared cap-waiterInterruptroute, not by this backer. - DMA: no new DMA surface. The storage adapter (
fat_fs::BlockStorage) translates thefatfsbyte cursor to whole sectors and reads through the activeBlockSourcearm’s bounce-read path: the virtio bounce path, or the NVMe brokered window op (bounded fail-closed to one 4 KiB PRP1 page per read, with a manager-owned bounce page whose PRP1 never reaches userspace). There is no host-physical/IOVA export on either arm. - Fail-closed / validation rules: capOS does not treat
fatfsas a hostile FAT validator. The wrapper performs its own bounded preflight before granting or lazily completing the NVMe mount: BPB/device geometry must fit the active medium; the root directory FAT chain must end within the root byte budget and be cycle-free; the visible root entry count is capped at 64; each root-level regular file must be at mostMAX_FILE_BYTES(64 KiB), and its FAT chain must end within that bounded file budget without cycles, bad clusters, or out-of-range cluster values. The storage adapter also clamps every read to the device byte capacity (BlockStorage::read); the virtio arm’sblock_device_read_blocksand the NVMe arm’s window op each range-validate the LBA against the device/namespace geometry before issuing the request. The virtio arm queries the live virtio-blk geometry at construction; the NVMe arm’s capacity comes from the IDENTIFY Namespace claim (device_manager::nvme_namespace_geometry_for_cap, NSZE + active-LBA-format size) – the same IDENTIFY-derived geometry thereadonly_fs/persistent_store/writable_fsNVMe arms consult – and the deferred mount fails closed while that claim is unavailable or reports a non-512-byte active LBA format. The adapter’sWriteimpl always errors. - QEMU-emulable vs hardware-only: fully QEMU-emulable on both arms. The host
image is built with real
mkfs.fat+mcopy(tools/mkstorage-fat-read-image.py, a 64 MiB FAT32 image, >= 2 files, one multi-cluster, with deterministic FAT directory-entry timestamps on the known files). The virtio arm attaches it as a virtio-blk disk (make run-storage-fat-read); the combined timestamp/provenance proof runs that virtio arm plus the NVMe arm throughmake run-storage-fat32-timestamp-provenance. The NVMe arm attaches the same image as a pre-populated-device nvmenamespace (make run-cloud-provider-fat-read-over-nvme), reading the multi-cluster file back throughDirectory.open->File.readover the NVMeBlockSourceand asserting the round-tripped bytes,File.stattimestamp values and provenance proof lines, plus the fail-closed mutations. No hardware-only path.
Related
kernel/src/cap/fat_fs.rs– the FAT32Directory/Filebacker, theBlockSourceseam (Virtio/Nvme), theBlockStorageadapter, the deferredFatMount, and themount_root/mount_root_nvmegrant entry points.kernel/src/cap/fat_read_over_nvme_proof.rs– the NVMe-arm cap-waiterInterruptroute + headline marker (provider-fat-read-over-nvme) for the NVMe proof.vendor/fatfs-no_std/– the vendoredfatfsno_std read parser and itsVENDORED_FROM.mdprovenance.kernel/src/cap/readonly_fs.rs– theCAPOSRO1backer theread_only_fs_rootsource resolves to under a plainqemubuild, and theBlockSourcepattern the FAT NVMe arm mirrors.docs/proposals/real-filesystem-decision.md– the role-split decision and the phased plan this read-only FAT32 backer is part of.
virtio-rng (modern PCI entropy device)
This is a provenance map for the in-tree virtio-rng path: it cites the spec, summarizes only the wire-format subset the code actually implements, and points into the implementation. It is not a re-spec – where the spec is implemented unchanged it links rather than transcribes.
Unlike virtio-net and virtio-blk, the
virtio-rng device does not back a userspace-facing capability. It is a
QEMU-only proof fixture, not a production driver, and not forward DDF
production evidence: the entropy the device produces is consumed only by
in-kernel proofs, never handed to a process. The capOS EntropySource
capability is a separate, RDRAND-backed path
(kernel/src/cap/entropy_source.rs, fill_random / rdrand64 / has_rdrand;
per-call bound MAX_ENTROPY_FILL_BYTES) and does not touch this device. This
classification is asserted, not just documented: on every cfg(qemu) boot
diagnose_qemu_virtio_rng emits a deterministic marker
(virtio-rng: classification=qemu-only-proof-fixture userspace_capability=none production_driver=no ...) that make run-iommu-remapping
(tools/qemu-iommu-remapping-smoke.sh) requires, so a regression that promoted
this path into a production-driver claim would fail the smoke. virtio-rng exists
in the tree for two reasons:
- A DDF metadata-diagnostics path that exercises modern-transport
discovery, MSI-X metadata selection, and the device-manager
ownership/teardown/grant-source hooks against a real PCI function on every
cfg(qemu)boot (kernel/src/virtio.rsdiagnose_virtio_rng_metadata, driven fromkernel/src/pci.rsdiagnose_qemu_virtio_rng). - An IOMMU VT-d second-level remapping hardware-DMA proof vehicle (the
Slice A2/B/C proofs in
kernel/src/iommu.rs, driven throughkernel/src/virtio.rsprove_iommu_rng_mapped_dma/prove_iommu_rng_unmapped_dma/prove_iommu_rng_stale_dma). This is the minimal real virtqueue driver QEMU’s entropy device lets us stand up to prove a device DMA actually walks the programmed translation tables.
It reuses the modern split-ring transport seam introduced for virtio-net
(virtio-net); this page covers only the rng-specific usage.
1. Spec basis
- Device: virtio entropy device, modern (virtio 1.x) PCI transport.
PCI vendor
0x1af4; device0x1044(modern) /0x1005(transitional). IDs atkernel/src/pci.rs(VIRTIO_VENDOR_ID,VIRTIO_RNG_MODERN_DEVICE_ID,VIRTIO_RNG_TRANSITIONAL_DEVICE_ID; matched byPciDevice::is_virtio_rng). QEMU exposes it asvirtio-rng-pci-non-transitional(see §3). - Authoritative spec: Virtual I/O Device (VIRTIO) Version 1.2, OASIS Committee Specification 01 (2022-07-01). Source: https://docs.oasis-open.org/virtio/virtio/v1.2/virtio-v1.2.html. Relevant sections: 4.1 (virtio over PCI bus), 2.7 (split virtqueues), 5.4 (entropy device).
- Reference: cross-checked against the Linux
virtio_rngdriver for the single-request-queue model and thevirtio_pci_modernmodern-transport handshake.
2. Wire format (implemented subset)
The modern PCI capability parsing, common-config register map, split-ring
descriptor layout, and feature-negotiation handshake are the shared transport
seam documented in virtio-net §2
(kernel/src/virtio.rs transport module, ModernTransport, the COMMON_*
register offsets, VIRTQ_DESC_F_WRITE). The rng path discovers that transport
through discover_virtio_rng_metadata_transport and maps regions with
map_region. Only the rng-specific subset is summarized here.
- Device shape: the transport discovery reports whether the function is a
modern device id or a transitional id that still exposes modern capabilities
(
DeviceShape::Modern/DeviceShape::TransitionalWithModernCaps); both are driven through the modern path. The entropy device has no device-specific config space and no device-specific feature bits. - Single request queue: the entropy device exposes one virtqueue, the
requestq(VIRTIO_RNG_REQUEST_QUEUE= queue 0). The IOMMU proof drives it at a deliberately smallVIRTIO_RNG_PROOF_QUEUE_SIZE(2) – a single in-flight descriptor is enough to prove a DMA through translation, and a power of two keeps the ring layout legal. The per-queue notify address is computed fromnotify_off_multiplierlike any modern virtio queue. - Request framing: each request is a single device-writable descriptor
(
VIRTQ_DESC_F_WRITE) pointing at a buffer the device fills with entropy. The proof requestsVIRTIO_RNG_PROOF_REQUEST_LEN(64) bytes (rng_publish_descriptor_and_notifywrites the 16-byte descriptor and bumps the available ring; completion is read from the used ring’s{ id:u32, len:u32 }entry). There is no request header or status byte – the entropy device just writes bytes into the supplied buffer. - Feature negotiation: virtio-rng offers no device-specific features. The
metadata path negotiates nothing; the IOMMU hardware-DMA proof requires both
VIRTIO_F_VERSION_1(modern transport) andVIRTIO_F_ACCESS_PLATFORM– the latter is what makes QEMU route the device’s DMA through the platform IOMMU and consume the IOVAs the driver programs into the ring registers, rather than treating them as host-physical addresses. A device that does not offer both fails the proof closed (rng-missing-access-platform-feature). - Completion: the proof polls the used ring (
hhdm_read_u16ofused.idx, bounded byVIRTIO_RNG_USED_POLL_LIMIT) rather than waiting on the request interrupt; the MSI-X path is exercised at the metadata level only (see §3).
3. capOS mapping
- Binding (transitional, in-kernel, no userspace cap): virtio-rng is driven
entirely in the kernel and is not exposed to userspace at all – there is
no
RandomNumberGenerator/EntropySource-style cap routed to this device. The metadata-diagnostics path runs on everycfg(qemu)boot fromkernel/src/pci.rsdiagnose_qemu_virtio_rng; the hardware-DMA proofs run under therun-iommu-remappingtarget only. - Device-manager authority (metadata path):
diagnose_virtio_rng_metadatabinds authority through the kerneldevice_manageragainstDeviceOwner::VirtioRng– it proves QEMU ownership (prove_qemu_ownership), teardown triggers, and theDeviceMmio/DMAPool/DMABuffercap release / driver-crash / reset-disable hooks, then logs thedevicemmio/dmapool/interruptgrant-source status (devicemmio_grant_source::log_statusand thedmapool/interruptequivalents). This is the same DDF ledger the cloud-NIC and block drivers bind through; virtio-rng is the function the bring-up hooks are proved against. - MMIO: the modern-transport common/notify/ISR/device-config regions are
mapped from the device BARs (
map_regionoverpci::map_bar_region) into the device-uncacheable (NO_CACHE) window; the metadata path additionally logs each decoded region (log_device_region). Doorbell (queue-notify) writes are scoped to the per-queue notify address computed fromnotify_off_multiplier. - Interrupt: MSI-X is handled at the metadata level – the request queue
uses
VIRTIO_RNG_MSIX_METADATA_ENTRY(0) and requiresVIRTIO_RNG_MSIX_REQUIRED_ENTRIES(1) usable table entries; the plan is selected byselect_virtio_rng_msix_planand the route programming is proved byprove_virtio_rng_msix_metadata_route. The hardware-DMA proof completes by polling the used ring, so it does not arm a completion-IRQ waiter. - DMA: the IOMMU proof’s descriptor table, available ring, used ring, and
request buffer are placed at programmed IOVAs carried in the
iommu::IommuRngDmaVehicle, never at host-physical addresses; onceGCMD.TEis set every DMA the device issues must walk the second-level table the IOMMU module installed. The ring pages are zeroed through the HHDM before their IOVAs are handed to the device so a stale reading can never be mistaken for a completion. No host physical address or IOVA leaves the kernel boundary. - Fail-closed / validation rules: the proof fails closed at every step –
transport discovery, bus-master enable, MMIO map, reset handshake, the
required-feature check, notify-offset/map-length overflow, queue-size floor,
and queue-enable rejection each return a distinct
failed(...)reason rather than proceeding. A page whose invalidation never completes is not freed (a page freed before invalidation completes would be a stale-DMA hole). The unmapped-IOVA and stale-IOVA re-drives must fault in the IOMMU (FSTS.PPF/FRCD[0].F) instead of reaching memory. - QEMU-emulable vs hardware-only: fully QEMU-emulable. QEMU provides
virtio-rng-pci-non-transitional(the sharedQEMU_SECOND_DEVICEdefault);make run-iommu-remappingoverrides it withiommu_platform=onbehind anintel-iommudevice and is the end-to-end proof of the mapped-IOVA hardware DMA, the unmapped-IOVA fault, and the Slice C two-phase revocation / stale-DMA fault. The DDF metadata diagnostics emit on everycfg(qemu)boot. No hardware-only path.
Related
kernel/src/virtio.rs– the rng metadata diagnostics (diagnose_virtio_rng_metadata), the IOMMU hardware-DMA proof driver (prove_iommu_rng_mapped_dma/prove_iommu_rng_unmapped_dma/prove_iommu_rng_stale_dma), and the shared modern split-ring transport.kernel/src/iommu.rs– the VT-d Slice A2/B/C remapping, fault, and revocation proofs that drive this device.kernel/src/cap/entropy_source.rs– the separate RDRAND-backedEntropySourcecapability (this device backs no capability).docs/dma-isolation-design.md– the DMA backend and isolation model the IOMMU remapping proofs validate.
NVMe (NVM Express controller)
This is a provenance map for the NVMe controller wire subset the kernel Model B on-notify DMA validator scans on the doorbell/queue-arm path. It cites the spec basis, summarizes only the register and descriptor fields the validator actually reads, and points into the implementation by symbol name. It is not a re-spec.
Maturity caveat. This page documents the DMA validator mechanism, a
brokered no-IOMMU bring-up through one bounded I/O read on the local QEMU
make run-pci-nvme gate, and one bounded live-GCE Persistent Disk proof for the
production provider-nvme-io-read path. It is still not a general
production NVMe driver, not broad GCP/AWS/Azure storage readiness, and not a
provider-visible address or direct-DMA claim. It also records the 2026-05-27
correction: on the current no-IOMMU gate, provider-written queue-base or PRP
addresses would be host physical addresses, so the live no-IOMMU path must be
brokered by the kernel/device manager unless a verified IOMMU/vIOMMU or
synthetic address namespace is added. The capabilities implemented against
make run-pci-nvme and the later production cloudboot gates:
-
nvme-doorbell-dma-validatoris the kernel on-notify scan (kernel/src/cap/nvme_doorbell_validator.rs); it proves its invariants with acfg(qemu)self-test (prove_qemu_on_notify_scan_contract) using synthetic owner windows in place of a live grant ledger. -
nvme-bind-claimed-mmio-readadds the read-only userspace bind (§4): the kernel claims the enumerated controller, preseeds its BAR0 controller-register page, and stages theDMAPool/DeviceMmio/Interruptbootstrap grant sources against it, and the userspacenvme-bringup-smokeprovider readsCAP/VS/CC/CSTSthrough the brokered claim, proving the claim reaches a coherent NVMe BAR0 (liveCAP, validVSversion). The controller is firmware-initialized under SeaBIOS NVMe boot-probe (CC.EN=1,CSTS.RDY=1), so the provider reports the observed enable/ready state rather than asserting reset. -
nvme-controller-reset-selected-writeadds the userspace controller reset (§5): theDeviceMmiogrant now carries a reset-only NVMe controller-register selected-write claim scoped toCC, and the provider drives the firmware-enabled controller to a known reset state (CC.EN=0→CSTS.RDY=0). This is the first genuine userspace NVMe controller-register write. -
nvme-no-iommu-brokered-controller-enableadds the brokered no-IOMMU enable (§6): the kernel authorsAQA/ASQ/ACQfrom the live DMA ledger and performs theCC.ENwrite, reachingCSTS.RDY=1without exposing a queue-base address. -
nvme-admin-queue-identify+nvme-admin-interrupt-deliveryadd the brokered admin SQ/CQ doorbell and one interrupt-drivenIDENTIFY(§7). -
nvme-io-queue-and-readadds the brokered I/O queue pair and one boundedREAD(§8) – the last piece of the userspace storage-provider foundation. -
cloud-prod-nvme-userspace-provider-readonly-bind-local-proofports the same-BDF read-only bind shape onto the non-qemucloudboot kernel under thecloud_nvme_readonly_bind_proofCargo feature. The feature constrains the three production grant sources (devicemmio_grant_source_prod,dmapool_grant_source_prod,interrupt_grant_source_prod) to the NVMe function (class0x01subclass0x08); the userspacecloud-nvme-readonly-bind-smokeprovider receives a same-BDFDeviceMmio/DMAPool/Interruptbundle, readsCAP_LO/CAP_HI/VS/CC/CSTSvia brokeredDeviceMmio.read32, releases the three caps, and asserts stale-handle rejection on each. Proof:make run-cloud-provider-nvme-readonly-bind. NoCC.ENwrite, no admin or I/O queue, noIDENTIFY, noInterrupt.wait, no DMA, no live cloud. -
cloud-prod-nvme-controller-reset-selected-write-local-prooflayers the reset-onlyCCselected-write authority on the same-BDF bundle under thecloud_nvme_controller_reset_proofCargo feature (which impliescloud_nvme_readonly_bind_proof, so the picker constraints are inherited). The kernel admits exactly one brokeredDeviceMmio.write32shape throughkernel::device_manager::stub::write_devicemmio_u32: aCCwrite (offset0x14) whoseCC.EN(bit 0) is cleared. ACCwrite that setsCC.ENfails closed withdevicemmio-nvme-cc-enable-deferred, a write to any non-CCoffset fails closed withdevicemmio-write32-register-unclaimed, and an out-of-range or unaligned offset fails closed at the range validator, all before any volatile MMIO write touches the BAR. The userspacecloud-nvme-controller-reset-smokeprovider receives the same-BDF bundle, readsCAP_LO/CAP_HI/VS/CC/CSTS, exercises the two fail-closed write probes (CC.EN=1, non-CCoffset0x18), performs the admittedCC.EN=0reset write, pollsCSTSuntilCSTS.RDY=0, re-readsCCto assertCC.EN=0, releases the three caps, and confirms stale-handle rejection. Proof:make run-cloud-provider-nvme-controller-reset. NoCC.EN=1write, no admin or I/O queue, noIDENTIFY, noInterrupt.wait, no DMA, no live cloud. -
cloud-prod-nvme-admin-queue-materialization-local-proofmaterializes the admin SQ and admin CQ backing buffers on the same-BDF bundle under thecloud_nvme_admin_queue_materialization_proofCargo feature (which impliescloud_nvme_controller_reset_proof, which impliescloud_nvme_readonly_bind_proof, so the picker constraints and the reset-onlyCC.EN=0claim are inherited). No new kernel admission surface is added: the productiondevice_manager::stubalready supports manager-owned bounce-buffer allocation throughstage_bounce_buffer_dmapool_record+issue_manager_attached_dmabuffer_handle_with_request(a fresh zeroed-on-alloc kernel frame per buffer), scrub-before-frame-free ondetach_dmabuffer_record_for_cap_release, and stale-handle rejection on the parked-slot ledger. The userspacecloud-nvme-admin-queue-materialization-smokeprovider receives the same-BDF bundle, sequentially materializes the admin SQ backing buffer and the admin CQ backing buffer through the brokeredDMAPool.allocateBuffer+DMABuffer.{info,map,unmap,freeBuffer}path (assertinguserspace_dma_buffer=manager-issued-bounce-buffer,iova_export=disabled-future-only,host_physical_user_visible=false, anddevice_iova=0on each), writes and reads back a deterministic 256-byte template through the userspace VMA, asserts the freshly-allocated admin CQ frame reads back as zero before the write (scrub-before-reuse, paired with the admin SQ’s scrub-before-frame-free), confirms post-freeDMABuffer.mapfail-closed on the stale handle, emits onecloudboot-evidence: provider-nvme-admin-queue-materialization <token>marker recording both manager-owned pool/buffer slot/generation identities and the discipline labels, releases the three bundle caps, and confirms stale-handle rejection on each. Proof:make run-cloud-provider-nvme-admin-queue-materialization. No NVMe controller register WRITE on this path (the kernel still admits the reset-onlyCC.EN=0claim from the controller-reset sibling, but this smoke never callsDeviceMmio.write32), noAQA/ASQ/ACQpublication, noCC.EN=1, no I/O queue allocation, noIDENTIFY, no PRP/SGL publication, no doorbell write, noInterrupt.wait/Interrupt.acknowledge, no host-physical or IOVA export, no live cloud. -
cloud-prod-nvme-brokered-controller-enable-local-proofenables the controller through manager-authoredAQA/ASQ/ACQplus a provider-suppliedCC.EN=1write under thecloud_nvme_controller_enable_proofCargo feature (which implies the three earlier features). The productiondevice_manager::stubparked-pool slot holds two simultaneously-live bounce-bufferDMABuffers (PARKED_DMAPOOL_LIVE_BUFFER_CAPACITY = 2,PARKED_DMABUFFER_SLOTS = [1, 2]) so the admin SQ and admin CQ can stay parked together; the bounce-buffer grant proof and the virtio-net live-publish proof keep their existing single-buffer behavior (slot 0). The provider-suppliedCC.EN=1write of this path is superseded bycloud-prod-nvme-controller-enable-manager-op-remediationbelow, which makes rawDeviceMmio.write32(CC, value with CC.EN=1)fail closed before any MMIO side effect and exposes controller enable only through the no-parameterDeviceMmio.brokeredNvmeControllerEnableverb (schema@6). The parked-pool slot capacity, the[1, 2]slot ids, theAQAdepth policy, and the four MMIO writes the manager authors all carry over unchanged. -
cloud-prod-nvme-controller-enable-manager-op-remediationcorrects the brokered enable contract. RawDeviceMmio.write32(CC, value with CC.EN=1)now fails closed withauthority_result=devicemmio-nvme-cc-enable-raw-blocked / authority_reason=cc-enable-requires-broker-nvme-controller-enable-opbefore any volatile MMIO side effect. Controller enable is reachable only through the new no-parameterDeviceMmio.brokeredNvmeControllerEnableverb (schema@6), which carries nooffset,value, queue address, queue id, PRP/SGL, or provider-selected controller-bit parameter. The verb routes to the renamed manager-authorednvme_brokered_controller_enable_op_for_capinkernel/src/device_manager/stub.rs. The manager: (1) validates the cap’s BAR matches the parked region and covers theCC/AQA/ASQ/ACQregister span; (2) resolves the two parked admin queue DMABuffers (slot order: SQ then CQ) and requires both to be live, unmapped, and frame-aligned; (3) selects every controller bit internally –CC.EN | IOSQES=6 | IOCQES=4(NVMe Base Spec §3.1.5); (4) authorsAQA = ((depth-1)<<16) | (depth-1)with depth8,ASQlow/high from the admin SQ buffer’s phys address,ACQlow/high from the admin CQ buffer’s phys address through the boot-preseeded BAR0 kernel mapping; and (5) performs the manager-selectedCC.EN=1write. The provider supplies no parameters and never observes a host-physical / device-visible queue-base address. The cap dispatch admission carriesauthority_result=ok,register_write=performed,side_effect=mmio-write-performed,cc_en_write_performed=true,aqa_authored=true,asq_authored=true,acq_authored=true, andqueue_base_source=manager-ledger. The kernel diagnostic line is nownvme: brokered-enable owner=cloud-nvme model=cloud-bounce validator=none trigger=manager-op admin_sq_slot=1 admin_cq_slot=2 aqa=0x00070007 cc=0x... asq_authored=true acq_authored=true cc_en_write=performed cc_bits_selected_by=manager queue_base_source=manager-ledger host_physical_user_visible=false proof_result=ok;trigger=manager-opproves the admission entered through@6, not through a rawCCwrite32. Thecloud-nvme-controller-enable-smokeprovider proves both the new fail-closed rawCC.EN=1probe and the manager-op enable, and the headlinecloudboot-evidence: provider-nvme-controller-enable <token>marker pinsbrokered_enable_trigger=manager-opandcc_raw_enable_write=refused. Proof:make run-cloud-provider-nvme-controller-enable. NoIDENTIFY, admin or I/O queue command, PRP/SGL publication, doorbell write,Interrupt.wait/Interrupt.acknowledge, host-physical or IOVA export, or live cloud is claimed. -
cloud-prod-nvme-admin-identify-manager-op-local-proofextends the corrected controller-enable surface with one explicit manager-owned admin-command operation:DeviceMmio.brokeredNvmeAdminIdentify(schema@7). The verb carries no parameters; the cap holder may not supply queue addresses, opcode, command id, NSID, PRP/SGL entries, data-buffer address, doorbell offset, or doorbell value. The productiondevice_manager::stubparked-pool slot capacity was extended from two to three simultaneously-live bounce-buffer DMABuffers (PARKED_DMAPOOL_LIVE_BUFFER_CAPACITY = 3,PARKED_DMABUFFER_SLOTS = [1, 2, 3]) so the admin SQ (slot 1), admin CQ (slot 2), and IDENTIFY data page (slot 3) can stay parked together; the controller-enable sibling, the bounce-buffer grant proof, and the virtio-net live-publish proof keep their existing single- or dual-buffer behavior unchanged. The production grant source’s kernel-mapped BAR window was correspondingly widened from one to two pages (MAPPED_WINDOW_BYTES = 0x2000undercloud_nvme_admin_identify_proof) so the admin SQ tail (0x1000) and admin CQ head (0x1004) doorbells fall inside the boot-preseeded mapping the manager already uses forCC/AQA/ASQ/ACQ– rawwrite32to either doorbell offset still fails closed at the device-manager boundary asdevicemmio-write32-register-unclaimed(the offset is outside the reset-onlyCCselected-write claim), and the brokered admin IDENTIFY verb is the only path that may ring them. The handlernvme_brokered_admin_identify_op_for_capinkernel/src/device_manager/stub.rs: (1) validates the cap’s BAR matches the parked region and covers both doorbell offsets and theCSTSregister; (2) resolves the three parked admin DMABuffers and requires all three to be live, unmapped, and frame-aligned; (3) re-readsCSTSthrough the boot-preseeded BAR mapping and refuses ifCSTS.RDY=0; (4) authors the full submission queue entry at admin SQ index 0 through the HHDM kernel mapping of the SQ page – opcodeIDENTIFY(0x06, NVMe Base Spec §5.17), command id 1, NSID 0, MPTR 0, PRP1 = data-page physical address (sourced from the manager’s parked-pool ledger), PRP2 0, CDW10 CNS0x01(Controller); (5) issues a SeqCst fence and rings the admin SQ tail doorbell at BAR0 offset0x1000; (6) polls the admin CQ entry at index 0 through the HHDM kernel mapping of the CQ page for the phase-bit flip (NVMe Base Spec §4.6 CQE DW3 bit 16); (7) inspects the CQE status field (bits 30:17 of DW3) and command-id echo, refusing on either mismatch; (8) parses IDENTIFY Controller VID (offset 0, 2 bytes) and SSVID (offset 2, 2 bytes) through the HHDM kernel mapping of the data page; (9) advances the admin CQ head doorbell at BAR0 offset0x1004. The provider sees only bounded-status labels, the manager-selected CNS/opcode/command-id echoes, the three parked-slot identities, the parsed VID/SSVID, and the doorbell side-effect labels. The cap-side dispatch admission carriesauthority_result=ok,result=ok,register_write=performed,side_effect=mmio-write-performed,sq_doorbell_written=true,cq_doorbell_written=true,completion_consumed=true,cq_status=0x0000,prp_source=manager-ledger, andhost_physical_user_visible=false. The kernel diagnostic isnvme: brokered-admin-identify owner=cloud-nvme model=cloud-bounce trigger=manager-op admin_sq_slot=1 admin_cq_slot=2 admin_data_slot=3 cns=0x01 opcode=0x06 command_id=0x0001 ... cqe_status=0x0000 cqe_command_id=0x0001 sq_tail=1 cq_head=1 cq_phase=1 identify_vid=0x1b36 identify_ssvid=0x1af4 sq_doorbell_written=performed cq_doorbell_written=performed completion_consumed=true prp_source=manager-ledger host_physical_user_visible=false proof_result=ok(QEMU’snvmedevice reports PCI VID0x1b36and SSVID0x1af4, which the harness pins). Thecloud-nvme-admin-identify-smokeprovider exercises the inherited fail-closed raw-write claims (six in total: AQA/ASQ/ACQ + rawCC.EN=1+ raw admin SQ tail/CQ head doorbells), invokes the controller-enable verb at@6, invokes the admin IDENTIFY verb at@7, and emits one headlinecloudboot-evidence: provider-nvme-admin-identify <token>marker plus three supplementary[cloud-nvme-admin-identify-smoke] discipline-*lines that re-anchor the contract within the per-callConsole.writeLinebound. Proof:make run-cloud-provider-nvme-admin-identify. Future work (not yet implemented): I/O queue creation,READ/WRITE,Interrupt.wait/Interrupt.acknowledgeadmin-completion handoff, device-autonomous MSI-X delivery, host-physical/IOVA export, provider-authored SQE/PRP/SGL bytes, provider-authored doorbell offsets/values, and live cloud traffic. -
cloud-prod-nvme-admin-completion-wait-ack-local-proofmoves the admin IDENTIFY completion handoff off manager-internal CQ polling onto the productionInterrupt.wait/Interrupt.acknowledgepath. The admission-check-only productionInterruptgrant (interrupt_grant_source_prod,wait/acknowledge=admission-check-only) is replaced by a fully-programmed cap-waiter MSI-X route on the same NVMe BDF (kernel/src/cap/nvme_admin_completion_wait_ack_proof.rs, table entry 0): itsinitregisters + claims the route underManagerGrantSource, programs the MSI-X table entry mask-first with the kernel-authored(message_address, message_data), attaches it to the device manager, arms the deferred-LAPIC-EOI gate, and unmasks. The admin IDENTIFY is split into two manager-owned verbs:DeviceMmio.brokeredNvmeAdminSubmit(schema@8) authors the SQE and rings the admin SQ tail doorbell (no CQ consumed,completion_consumed=false), andDeviceMmio.brokeredNvmeAdminComplete(schema@9) polls/consumes the admin CQ (the manager-owned CQ status/CID check is preserved), parses VID/SSVID, and advances the admin CQ head doorbell. Both reuse the sharedNvmeBrokeredAdminOpResultschema struct. The handoff state machine is ordered and one-shot:brokeredNvmeAdminSubmit(@8) records the exact live admin SQ/CQ/data slots and generations;Interrupt.waitis admitted once for that submitted state, revalidates those live DMA records, and consumes the wait phase;brokeredNvmeAdminComplete(@9) is admitted only after the wait phase; andInterrupt.acknowledgeis admitted once only after the completion phase has been recorded. Hostile complete-before-wait, ack-before-complete, repeat wait, repeat complete, repeat ack, and submit-then-DMABuffer.freeBufferattempts fail closed before injecting extra dispatch, retiring an extra EOI, or freeing/scrubbing the manager-owned admin pages. Between the two verbs the provider callsInterrupt.wait– which injects exactly one bounded, non-autonomousdevice_interrupt::handle_lapic_deliverydispatch on the bound route (result=nvme-admin-completion-wait-ack-dispatch-consumed,real_interrupt_delivery=kernel-injected-dispatch, delivery count+1, one deferred LAPIC EOI armed) – then after the completion verb callsInterrupt.acknowledgeto retire exactly that deferred EOI (hardware_dispatch_ack_delta=1). The chain is: read-only bind -> reset-onlyCC.EN=0-> manager-owned admin buffer materialization ->brokeredNvmeControllerEnable->brokeredNvmeAdminSubmit(@8) -> admin SQ tail doorbell ->Interrupt.waitwake ->brokeredNvmeAdminComplete(@9) -> admin CQ completion consumed -> admin CQ head doorbell advanced ->Interrupt.acknowledgedeferred EOI retired. OnInterruptcap release the kernel requires exactly one observed dispatch, exactly one observed ack, and the terminal acked handoff state, then runs the masked-no-wake + reassign + stale-handle/stale-token assertion chain and emits exactly one headlinecloudboot-evidence: provider-nvme-admin-completion-wait-ack <token>marker labeledadmin_completion_wake=provider-cap-side-injected device_autonomous_raise=not-claimed. The wake is the same bounded kernel-injected cap-waiter model asmake run-cloud-provider-cap-waiter; this proof does not claim a device-autonomously-raised NVMe MSI-X completion interrupt. Proof:make run-cloud-provider-nvme-admin-completion-wait-ack. Future work (not yet implemented): I/O queue creation,READ/WRITE,BlockDevice/filesystem integration, device-autonomous MSI-X delivery, host-physical/IOVA export, and live cloud traffic. -
cloud-prod-nvme-io-queue-create-local-proofadds the single I/O queue pair (queue id 1) on top of the admin chain. After the combined poll-based admin IDENTIFY (DeviceMmio.brokeredNvmeAdminIdentify @7, VID0x1b36), the manager authors the two queue-establishing admin commands behind parameterless per-command verbs:DeviceMmio.brokeredNvmeCreateIoCqSubmit(schema@10, opcode0x05, CDW10 = queue id 1 | (queue-size-1)<<16, CDW11 PC=1 IEN=0, PRP1 = manager-owned I/O CQ base page) andDeviceMmio.brokeredNvmeCreateIoSqSubmit(schema@11, opcode0x01, CDW10 = queue id 1 | (queue-size-1)<<16, CDW11 = CQ id 1 | PC<<16, PRP1 = manager-owned I/O SQ base page). The opcode/CDWs are manager-selected; the provider supplies nothing (widening@8with a command-selector parameter was rejected because it would let a provider author arbitrary admin opcodes). Each SUBMIT verb authors the SQE at the next admin SQ index, rings the admin SQ tail doorbell, and records the in-flight create; the completion of each is consumed through the sharedDeviceMmio.brokeredNvmeAdminComplete(@9, now command-aware: it reads the admin CQ entry at the recorded index, checks status/CID, and advances the CQ head doorbell) after one provider-cap-sideInterrupt.wait, and the deferred LAPIC EOI is retired by oneInterrupt.acknowledge. The cap-waiter route (kernel/src/cap/nvme_io_queue_create_proof.rs) drives two bounded kernel-injected dispatch + deferred-EOI cycles – one per create – and its ordered handoff enforces CREATE I/O CQ before CREATE I/O SQ, one create at a time, with submit-then-DMABuffer.freeBuffer, repeat-wait, and ack-before-complete attempts failing closed. The I/O CQ/SQ base pages are manager-owned brokered bounce buffers (parked-pool slots 3/4, userspace slots 4/5); their PRP1 is never exported. OnInterruptcap release the kernel requires both creates completed (CQE status 0), exactly two observed dispatches, two observed acks, the idle terminal handoff, and the masked-no-wake + reassign + stale-handle chain, then emits onecloudboot-evidence: provider-nvme-io-queue-create <token>marker labeledio_queue_create_wake=provider-cap-side-injected device_autonomous_raise=not-claimed io_command=create-only io_read=not-attempted io_sq_doorbell=not-attempted. Proof:make run-cloud-provider-nvme-io-queue-create. Future work (not yet implemented): the I/O SQ tail doorbell (0x1008),READ/WRITE, the I/O data page, the I/O-completion route,BlockDevice/filesystem integration, device-autonomous MSI-X delivery, host-physical/IOVA export, and live cloud traffic. -
cloud-prod-nvme-io-read-local-proofadds one bounded I/OREAD(LBA 0, 1 block, NSID 1) on top of the live I/O queue pair. After the two CREATE I/O queue commands, the manager authors the entire READ SQE behind two parameterless per-command verbs:DeviceMmio.brokeredNvmeIoReadSubmit(schema@12) writes CDW0 (opcode0x02| command-id<<16), NSID 1, MPTR 0, PRP1 = manager-owned I/O read-data page (parked-pool slot 5), PRP2 0, SLBA 0 (CDW10/CDW11), NLB 0 = “1 block” (CDW12 bits 15:0) at I/O SQ index 0 and rings the I/O SQ tail doorbell (0x1008);DeviceMmio.brokeredNvmeIoReadComplete(schema@13) polls the I/O CQ entry at index 0 for the phase flip, checks status/CID, advances the I/O CQ head doorbell (0x100c), reads the first bytes of the read-data page through the kernel mapping, and surfaces a bounded read-data digest (readDataDigestLo/readDataDigestHi= first 8 bytes,readDataLen= transferred length). The provider supplies no opcode/LBA/PRP/ doorbell (a providerwrite32(0x1008, ...)path was rejected because it would break the no-provider-authored-command discipline; reusing the create/admin verbs was rejected because they are hardwired to the admin SQ/CQ doorbells and ledger). The cap-waiter route (kernel/src/cap/nvme_io_read_proof.rs) drives three bounded kernel-injected dispatch + deferred-EOI cycles – two creates plus one read – with submit-then-DMABuffer.freeBuffer, repeat-wait, and ack-before-complete attempts failing closed; the read-data page is a manager-owned brokered bounce buffer (parked-pool slot 5, userspace slot 6) whose PRP1 is never exported, and the manager reads the completed block bytes only through the kernel mapping. OnInterruptcap release the kernel requires both creates and the read completed (CQE status 0), the verified block bytes (readDataLen > 0and a non-zero digest), exactly three observed dispatches, three observed acks, the idle terminal handoffs, and the masked-no-wake + reassign + stale-handle chain, then emits onecloudboot-evidence: provider-nvme-io-read <token>marker labeledio_read_wake=provider-cap-side-injected device_autonomous_raise=not-claimed io_command=read io_read=completed io_sq_doorbell=performed io_cq_completion=polled-io-cqplusio_read_block_bytes=<digest> read_data_len=512. The local QEMU smoke seeds the NVMe backing file’s first sector with a known 16-byte pattern so the digest proves an actual byte transfer, not merely that a CQE arrived. Proof:make run-cloud-provider-nvme-io-read. The same marker shape also passed live on GCE run1780806087-bf69(make cloudboot-gcp-storage-nvme-io-read-test) against a Persistent Disk NVMe controller withvendor.1ae0,device.001f,live_cloud=gce-persistent-disk, and a 512-byte READ digest prefixeb3c904c494d494e4520200002000000. Future work (not yet implemented): a dedicated I/O-completionInterruptroute on live cloud,WRITE, multi-block/second-LBA reads,BlockDevice/filesystem integration, device-autonomous MSI-X delivery, host-physical/IOVA export, and live cloud coverage beyond this one GCE PD read. -
cloud-prod-nvme-io-write-local-proofadds one bounded I/OWRITE(LBA 0, 1 block, NSID 1) of a fixed manager pattern on top of the live I/O queue pair, proven durable by reading it back. After the two CREATE I/O queue commands, the manager authors the entire WRITE SQE behind two parameterless per-command verbs:DeviceMmio.brokeredNvmeIoWriteSubmit(schema@14) pre-fills the manager-owned I/O write-data page (parked-pool slot 6, userspace slot 7) with the fixed 16-byte signaturefacefeedcafebabe1122334455667788repeated across the block, then writes CDW0 (opcode0x01| command-id<<16), NSID 1, MPTR 0, PRP1 = that page, PRP2 0, SLBA 0, NLB 0 = “1 block” at I/O SQ index 0 and rings the I/O SQ tail doorbell (0x1008);DeviceMmio.brokeredNvmeIoWriteComplete(schema@15) polls the I/O CQ entry at index 0 for the phase flip, checks status/CID, advances the I/O CQ head doorbell (0x100c), reads the first bytes of the write-data page through the kernel mapping, and surfaces a bounded written-pattern digest (carried in the sharedreadDataDigestLo/readDataDigestHi/readDataLenfields). The landed I/OREAD(@12/@13) is then reused unchanged to read LBA 0 back into the read-data page (slot 5) at the next I/O SQ/CQ index (1, since the WRITE consumed index 0); the provider supplies no opcode/LBA/PRP/pattern/doorbell, and a new schema field was deliberately avoided – the durability match is computed kernel-side by comparing the written-pattern digest with the read-back digest. The cap-waiter route (kernel/src/cap/nvme_io_write_proof.rs) drives four bounded kernel-injected dispatch + deferred-EOI cycles – two creates, one write, one read-back – with submit-then-DMABuffer.freeBuffer, repeat-wait, and ack-before-complete attempts failing closed; the write-data and read-data pages are manager-owned brokered bounce buffers whose PRP1 is never exported. OnInterruptcap release the kernel requires both creates, the write, and the read-back completed (CQE status 0), non-zero digests, exactly four observed dispatches and acks, the idle terminal handoffs, the masked-no-wake + reassign + stale-handle chain, and the read-back digest matching the written pattern, then emits onecloudboot-evidence: provider-nvme-io-write <token>marker labeledio_command=write io_write=completed io_sq_doorbell=performed io_cq_completion=polled-io-cq write_pattern=<digest> write_readback_match=true. The local QEMU smoke seeds the backing file’s first sector with a distinct sentinel so the read-back of the manager pattern proves the WRITE transferred the bytes. Proof:make run-cloud-provider-nvme-io-write. Future work (not yet implemented): a dedicated I/O-completionInterruptroute distinct from the admin/create/read route, multi-block/second-LBA/second-NSID I/O, flush/FUA/DSM,BlockDevice/filesystem integration, device-autonomous MSI-X delivery, host-physical/IOVA export, and live cloud traffic. -
cloud-prod-nvme-io-second-lba-local-proofgeneralizes the manager-authored data path beyond the hardwiredSLBA 0: it proves LBA addressing actually selects the block by driving three sequential I/O commands on the live queue pair through two new parameterless verbs (DeviceMmio.brokeredNvmeIoSecondLbaSubmit @16/brokeredNvmeIoSecondLbaComplete @17), selected by a kernel-owned phase counter: phase 0 reads LBA 0 (the distinctness baseline, read-data slot 5), phase 1 pre-fills the write-data page (slot 6) with a fixed LBA-1-distinct 16-byte pattern0123456789abcdeffedcba9876543210and writes it to LBA 1 (opcode0x01, CDW10 = SLBA low = 1, CDW11 = SLBA high = 0), and phase 2 reads LBA 1 back. Because the landed@12-@15verbs hardwire SLBA 0 (and their I/O index depends on the io-write feature), this proof implies onlycloud_nvme_io_queue_create_proofand authors its own SQEs at I/O SQ indices 0/1/2; the provider supplies no opcode/LBA/PRP/pattern/doorbell. No schema field was added – the LBA-1 read-back match (LBA-1 read digest == LBA-1 write pattern) and the LBA distinctness (LBA-1 read digest != LBA-0 read digest) are computed kernel-side across the three recorded phase digests. The cap-waiter route (kernel/src/cap/nvme_io_second_lba_proof.rs) drives five bounded kernel-injected dispatch + deferred-EOI cycles (two creates + three I/O phases), with submit-then-DMABuffer.freeBuffer, repeat-wait, and ack-before-complete attempts failing closed. OnInterruptcap release the kernel requires both creates and all three phases completed (CQE status 0), non-zero digests, five observed dispatches and acks, the masked-no-wake + reassign + stale-handle chain,second_lba_readback_match=true, andlba_distinct_from_zero=true, then emits onecloudboot-evidence: provider-nvme-io-second-lba <token>marker labeledio_command=second-lba io_second_lba=1 second_lba_readback_match=true lba_distinct_from_zero=true io_sq_doorbell=performed io_cq_completion=polled-io-cq. The local QEMU smoke seeds the backing file’s first sector with the distinct sentineldeadbeefcafebabe0102030405060708so the LBA-0 read returns content distinct from the LBA-1 pattern. Proof:make run-cloud-provider-nvme-io-second-lba. Future work (not yet implemented): a dedicated I/O-completionInterruptroute, multi-block (NLB > 0) I/O, a third LBA or second namespace, flush/FUA/DSM,BlockDevice/filesystem integration (now unblocked by this LBA parameterization), device-autonomous MSI-X delivery, host-physical/IOVA export, and live cloud traffic. -
cloud-prod-nvme-io-multiblock-local-proofgeneralizes the manager-authored data path beyond one logical block: it proves the authored SQE drives a transfer larger than a single block by driving two sequential I/O commands on the live queue pair through two new parameterless verbs (DeviceMmio.brokeredNvmeIoMultiblockSubmit @18/brokeredNvmeIoMultiblockComplete @19), selected by a kernel-owned phase counter: phase 0 pre-fills the write-data page (slot 6) with two distinct 16-byte patterns – block 0 =112233445566778899aabbccddeeff00, block 1 =f0e1d2c3b4a5968778695a4b3c2d1e0f– over 1024 B and writes both blocks to LBA 2 (opcode0x01, NLB = block_count - 1 = 1, CDW10 = 2; PRP1 = slot 6, PRP2 = 0, since the 1024 B transfer fits one 4 KiB page), and phase 1 reads LBA 2 back into the read-data page (slot 5). Because the landed@12-@17verbs hardwire a single block (block_count = 1), this proof implies onlycloud_nvme_io_queue_create_proofand authors its own SQEs at I/O SQ indices 0/1; the provider supplies no opcode/LBA/count/PRP/pattern/doorbell. No schema field was added for the second block’s digest – the existingreadDataDigestLo/Hicarry block 0’s first 8 bytes for userspace, while the per-block match (read block 0 == written pattern-0, read block 1 == written pattern-1) and the block-distinctness (block 0 digest != block 1 digest, proving the second 512 B block actually transferred) are computed kernel-side across the two recorded phase digest pairs and attested in the headline marker. The cap-waiter route (kernel/src/cap/nvme_io_multiblock_proof.rs) drives four bounded kernel-injected dispatch + deferred-EOI cycles (two creates + two I/O phases), with submit-then-DMABuffer.freeBuffer, repeat-wait, and ack-before-complete attempts failing closed. OnInterruptcap release the kernel requires both creates and both phases completed (CQE status 0), non-zero digests, four observed dispatches and acks, the masked-no-wake + reassign + stale-handle chain,multiblock_block0_match=true,multiblock_block1_match=true, andmultiblock_blocks_distinct=true, then emits onecloudboot-evidence: provider-nvme-io-multiblock <token>marker labeledio_command=multiblock io_slba=2 io_nlb=1 io_block_count=2 prp2_zeroed=true multiblock_block0_match=true multiblock_block1_match=true io_sq_doorbell=performed io_cq_completion=polled-io-cq. Proof:make run-cloud-provider-nvme-io-multiblock. Future work (not yet implemented): a dedicated I/O-completionInterruptroute, NLB > 1 requiring a PRP list / second mapped page, a third LBA or second namespace, flush/FUA/DSM, wrapping the brokered READ/WRITE behind a userspace-servedBlockDevicecap (now has both LBA selection and >1-block transfer as prerequisites), device-autonomous MSI-X delivery, host-physical/IOVA export, and live cloud traffic. -
cloud-prod-nvme-io-synchronous-poll-read-local-proofcollapses the four-call submit/wait/complete/ack NVMe I/O lifecycle into ONE synchronousCapObject::call– the shapeBlockDevice.readBlocks @0requires. It adds two parameterless single-call verbs (DeviceMmio.brokeredNvmeIoSyncWrite @20/brokeredNvmeIoSyncRead @21), each mirroring the combinedbrokeredNvmeAdminIdentify @7: the manager pre-fills the write-data page (slot 6) with112233445566778899aabbccddeeff00(block 0) andf0e1d2c3b4a5968778695a4b3c2d1e0f(block 1), authors the SQE (WRITE opcode0x01at I/O SQ index 0 / READ opcode0x02at index 1; NSID 1, SLBA 2, NLB = 1 / two 512 B blocks, PRP1 = data page, PRP2 = 0), rings the I/O SQ tail doorbell (0x1008), polls the I/O CQ entry phase bit to completion within a bounded budget, advances the I/O CQ head doorbell (0x100c), and reads block 0/block 1 back – all inside one cap call, with noInterrupt.waiton the I/O data path. The two CREATE I/O queue commands still complete through the cap-waiterInterrupt.wait/acknowledgepath, so the route (kernel/src/cap/nvme_io_sync_read_proof.rs) drives only two bounded kernel-injected dispatch + deferred-EOI cycles. The single-call verbs reportsqDoorbellWritten,cqDoorbellWritten, andcompletionConsumedall true in one result; the block-1 match and block-distinctness are computed kernel-side across the recorded WRITE/READ digest pairs. OnInterruptcap release the kernel requires both creates and both single-call I/O commands completed (CQE status 0), non-zero digests, two observed dispatches and acks, the masked-no-wake + reassign + stale-handle chain,sync_block0_match=true,sync_block1_match=true, andsync_blocks_distinct=true, then emits onecloudboot-evidence: provider-nvme-io-sync-read <token>marker labeledio_command=sync-read io_slba=2 io_nlb=1 io_block_count=2 prp2_zeroed=true sync_block0_match=true sync_block1_match=true sync_blocks_distinct=true io_sq_doorbell=performed io_cq_completion=polled-io-cq-single-call interrupt_wait=not-used. Proof:make run-cloud-provider-nvme-io-sync-read. This closes concern (c) of theBlockDevice-shaped read gap (lifecycle collapse) without touchingBlockDeviceCap/crate::virtio(concern a), the manager-op routing into a generic cap (concern b), a dedicated I/O-completionInterruptroute, or a PRP list (NLB > 1). Future work (not yet implemented): introducing an NVMe-backedBlockDeviceCapwhosereadBlocks @0arm calls this single-call op, areadonly_fs-style consumer over it, device-autonomous MSI-X delivery, host-physical/IOVA export, and live cloud traffic. -
cloud-prod-nvme-io-sync-read-block-bytes-local-proofsurfaces the full read-back bytes. It adds one verb,DeviceMmio.brokeredNvmeIoSyncReadBytes @22 () -> NvmeBrokeredAdminOpReadBytesResult, that reuses the landed single-call poll-read body (nvme_brokered_io_sync_command) unchanged but returns the entire 1024 B read-back (block 0 ‖ block 1), read through the kernel mapping, as the inlinedata :Datafield of a new narrow result struct – the full-bytes shapeBlockDevice.readBlocks @0 -> (data :Data)requires – instead of folding it to an 8-byte digest. The provider issues@20(WRITE) then@22(READ-bytes) as two synchronous cap calls and compares the returneddatabyte-for-byte to the reconstructed manager-authored page, asserting the two 512 B halves differ; the kernel still attests per-block match and distinctness in the release marker. No host-physical/IOVA address crosses the boundary – only the content bytes the caller already authored. The cap-waiter route (kernel/src/cap/nvme_io_sync_read_bytes_proof.rs) is a clone of the sync-read proof that emits onecloudboot-evidence: provider-nvme-io-sync-read-bytes <token>marker labeledio_command=sync-read-bytes io_slba=2 io_nlb=1 io_block_count=2 read_data_len=1024 data_return=inline-bytes data_block0_match=true data_block1_match=true data_blocks_distinct=true io_cq_completion=polled-io-cq-single-call interrupt_wait=not-used. Proof:make run-cloud-provider-nvme-io-sync-read-bytes. This closes concern (b) of theBlockDevice-shaped read gap (full-bytes return) without touchingBlockDeviceCap/crate::virtio(concern a), arbitrary(startLba, count)parameterization, a dedicated I/O-completionInterruptroute, or a PRP list (NLB > 1). Future work (not yet implemented): an NVMe-backedBlockDeviceCapbackend enum whosereadBlocks @0arm calls this op, arbitrary-LBA routing, device-autonomous MSI-X delivery, host-physical/IOVA export, and live cloud traffic. -
cloud-prod-nvme-blockdevice-fixed-lba-read-arm-local-proofmakes the NVMe namespace consumable through the SAMEBlockDevice.readBlocks @0interface a filesystem consumer calls (proof-fixed LBA arm). It replacesBlockDeviceCap’s baredevice_index: usizewith aBlockDeviceBackendenum (kernel/src/cap/block_device.rs): the always-builtVirtio { device_index }arm (behavior-identical to today, verified bymake run-storage-fs) and, undercloud_nvme_blockdevice_read_proof, anNvmeBrokered { handle, owner }arm. TheNvmeBrokeredarm’sreadBlocks @0accepts ONLYstartLba == 2 && count == 2(fails closed withBlockDevice.readBlocks NVMe arm fixed to SLBA 2 NLB 1on any other window) and drives the landednvme_brokered_io_sync_read_bytes_op_for_cap(@22) body into a local 1024 B buffer surfaced as inlineDatathrough the sameread_blocks_resultsbuilder the virtio arm uses;writeBlocks/flushfail closed (read-only namespace) andinforeturns the fixed geometry. A new bootstrap grant arm mints theNvmeBrokeredcap bound to the SAME livedevice_mmiohandle/owner the production grant source staged (devicemmio_grant_source_prod::live_handle_for_nvme_blockdevice), so thedevice_mmiogrant must precede theblock_devicegrant in the manifest cap list. The provider drives the full bring-up (reset → enable → IDENTIFY @7 → CREATE I/O CQ @10 → CREATE I/O SQ @11 →@20WRITE) and issues the TERMINAL read throughBlockDevice.readBlocks(2, 2)instead of a rawDeviceMmio @22. The cap-waiter route (kernel/src/cap/nvme_blockdevice_read_proof.rs, a clone of the sync-read-bytes proof) emits onecloudboot-evidence: provider-nvme-blockdevice-read <token>marker labeledread_path=blockdevice-readblocks read_iface=BlockDevice read_method=0 io_slba=2 io_nlb=1 io_block_count=2 read_data_len=1024 data_return=inline-bytes nvme_arm_fixed_lba=true arbitrary_lba=not-supported. Proof:make run-cloud-provider-nvme-blockdevice-read. This closes concern (a) of theBlockDevice-shaped read gap (the same schema method, not the bespoke@22verb), restricted to the proof-fixed window. Future work (not yet implemented): arbitrary(startLba, count)parameterization, NVMe write/flush durability throughBlockDevice, areadonly_fs-style filesystem mounted over the NVMeBlockDevicecap, a dedicated I/O-completionInterruptroute, NLB > 1 with a PRP list, and graduating the NVMe data plane out of the per-proof feature into always-built production. -
cloud-prod-nvme-blockdevice-arbitrary-lba-read-local-proofwidens theNvmeBrokeredarm off the hardwired SLBA 2 / NLB 1:readBlocks @0now honors an ARBITRARY(startLba, count)window (read-only, bounded to one PRP1 page). The shared single-call bodynvme_brokered_io_sync_command(kernel/src/device_manager/stub.rs) gains explicitslba/block_countfields onSyncIoParamsand authors CDW10/CDW11 (SLBA) and CDW12 (NLB) from them instead of the module constants; the transfer length isblock_count * 512, bounded fail-closed to one 4 KiB PRP1 page (block_count <= 8) so PRP2 stays 0. The existing@20/@21/@22callers keep passing the proof-fixed SLBA 2 / count 2, so their behavior is byte-identical (regression:make run-cloud-provider-nvme-io-sync-read-bytes). A new parameterized opnvme_brokered_io_sync_read_window_op_for_cap(handle, owner, slba, count, out_data)rotates the I/O SQ/CQ index off the kernel-side read sequence (window 0 at index 1 after the WRITE at index 0, window 1 at index 2) so each completion is polled at the CQ slot the controller actually writes. Undercloud_nvme_blockdevice_arbitrary_lba_proof(implies and supersedescloud_nvme_blockdevice_read_proof),BlockDeviceCap::nvme_read_blocksadmits any1 <= count <= 8window withstartLba + count <= namespace blocks(the IDENTIFY-derived NSZE reported throughinfo @2– 16 MiB / 512 = 32768 on the QEMU fixture image; see the READ-arm graduation entry), and fails closed with distinct errors forcount == 0/count > 8(... count out of range (1..=8)) and a window past the namespace end (... window past namespace end). The proof (kernel/src/cap/nvme_blockdevice_arbitrary_lba_proof.rs, a clone of the fixed-LBA module) drives the full bring-up plus the@20WRITE (seeding LBA 2 = pattern-0, LBA 3 = pattern-1), then issues TWO distinctreadBlockswindows –readBlocks(0, 1)(zero-filled LBA 0) andreadBlocks(3, 2)(LBA 3 = pattern-1, LBA 4 = zero-filled) – comparing each returneddatabyte-for-byte to the manager-authored content and asserting the two windows return distinct content. It emits onecloudboot-evidence: provider-nvme-blockdevice-arbitrary-lba-read <token>marker labeledarbitrary_lba=supported window0_slba=0 window0_count=1 window1_slba=3 window1_count=2 windows_distinct=true prp_pages=single nvme_arm_fixed_lba=false. Proof:make run-cloud-provider-nvme-blockdevice-arbitrary-lba-read. With this, the NVMe namespace is readable throughBlockDevice.readBlocks @0at the LBA the consumer names. Future work (not yet implemented): NLB spanning more than one PRP1 page (count > 8) with a PRP list, NVMe write/flush durability throughBlockDevice, areadonly_fs-style filesystem mounted over the NVMeBlockDevicecap, a dedicated I/O-completionInterruptroute, and graduating the NVMe data plane out of the per-proof feature into always-built production. -
cloud-prod-nvme-blockdevice-writeblocks-durability-arm-local-proofarms theNvmeBrokeredarm’swriteBlocks @1(read-only until now): it drives the brokered NVMe sync WRITE with the caller-supplied(startLba, count, data)and proves write-then-read-back durability. A new parameterized opnvme_brokered_io_sync_write_window_op_for_cap(handle, owner, slba, count, in_data)(kernel/src/device_manager/stub.rs) mirrors the arbitrary-window READ entry but rotates the I/O index off a kernel-side write sequence (next_write_io_index()=> index 0, before the read-back at index 1). The shared single-call bodynvme_brokered_io_sync_commandgains a third fill mode – awrite_payload: Option<&[u8]>that copies the caller’scount * 512bytes into block 0..count of the manager-owned write-data page (slot 6) through the HHDM mapping before the WRITE SQE is authored – beside the fixedprefill_patternand the readonly_fsseed_imagemodes (both unchanged: regressionsmake run-cloud-provider-nvme-blockdevice-arbitrary-lba-readandmake run-storage-fsstay green). Undercloud_nvme_blockdevice_writeblocks_proof(implies and supersedescloud_nvme_blockdevice_arbitrary_lba_proof),BlockDeviceCap::nvme_write_blocksadmits any1 <= count <= 8window withstartLba + count <= namespace blocksanddata.len() == count * 512, failing closed with distinct errors for zero count, over-capacity, past-namespace-end, and length mismatch;info @2reportsreadOnly = false;flush @3stays fail-closed (a real NVMe FLUSH, opcode0x00, is a distinct verb – see theflush @3capability below). The proof (kernel/src/cap/nvme_blockdevice_writeblocks_proof.rs, a clone of the arbitrary-LBA module) drives the full bring-up thenwriteBlocks(5, 2, data)with a caller-authored, non-zero, two-distinct-block 1024 B payload, followed byreadBlocks(5, 2), comparing the read-back byte-for-byte to the bytes written. It emits onecloudboot-evidence: provider-nvme-blockdevice-writeblocks-durability <token>marker labeledwrite_path=blockdevice-writeblocks write_method=1 write_slba=5 write_count=2 write_data_len=1024 readback_data_len=1024 write_readback_match=true nvme_arm_read_only=false flush=fail-closed prp_pages=single. Proof:make run-cloud-provider-nvme-blockdevice-writeblocks-durability. No schema/binding change (writeBlocks @1andreadBlocks @0round-trip through existing bindings). Future work (not yet implemented): awritable_fs/persistent_storeconsumer mounted over the NVMeBlockDevicewrite arm, a real NVMe FLUSH onflush @3, a dedicated I/O-completionInterruptroute on the data path, and graduating the NVMe data plane out of the per-proof feature into always-built production. -
ddf-nvme-multiprp-blockdevice-window-local-proofextends the sameBlockDevice.writeBlocks @1/readBlocks @0round-trip to a three-page NVMe PRP window. Undercloud_nvme_blockdevice_multiprp_window_proof,BlockDeviceCap::nvme_write_blocksandBlockDeviceCap::nvme_read_blocksacceptcount <= 24for the local proof geometry while the default and older proof builds keep the one-pagecount <= 8bound. The sharednvme_brokered_io_sync_commandbody resolves primary read/write data pages from parked-pool slots 5/6, a manager-owned PRP-list page from slot 7, read extension pages from slots 8/9, and write extension pages from slots 10/11. For thewriteBlocks(5, 24, data)andreadBlocks(5, 24)proof window it authors PRP1 as the primary data page and PRP2 as a PRP-list page containing two little-endian page pointers, matching the NVMe PRP-list subset in NVMe Base Specification 1.4 §4.3. The provider still supplies only inlineDatathrough theBlockDeviceschema and never sees a host physical address, IOVA, PRP1, PRP2, PRP-list page address, SQE byte, doorbell offset, or doorbell value. Requests with zero count, count 25, namespace overflow, or length mismatch fail closed before any I/O SQ doorbell write. The release marker includes full-transfer FNV-1a hashes for the WRITE and read-back records so the kernel-side proof is not limited to the first two 16-byte block digests; the userspace smoke also compares all 12 KiB byte-for-byte. Proof:make run-cloud-provider-nvme-blockdevice-multiprp-window. -
cloud-prod-nvme-blockdevice-flush-local-proofarms theNvmeBrokeredarm’sflush @3(fail-closed until now): it authors a real NVMe FLUSH (NVM command-set opcode0x00, NSID-scoped, no data transfer) through the brokered sync command machinery and proves awriteBlocksthenflushreturns CQE status0and the written block survives the flush. A new parameter-free opnvme_brokered_io_sync_flush_op_for_cap(handle, owner)(kernel/src/device_manager/stub.rs) drives the shared single-call bodynvme_brokered_io_sync_commandwith a FLUSHSyncIoParams(opcode = 0x00,command_id = 8,slba = 0,block_count = 0), rotating the I/O index off a kernel-side flush sequence (next_flush_io_index()=> index 1, after the WRITE at 0 and before the read-back at 2). The shared body learns the FLUSH shape (gated on the opcode): it skips the one-PRP1-page data bound, authors the SQE with NSID only andPRP1 = 0/PRP2 = 0/CDW10..15 = 0(no data page touched), and the WRITE/READ data-bearing path stays byte-identical for non-FLUSH opcodes. Undercloud_nvme_blockdevice_flush_proof(implies and supersedescloud_nvme_blockdevice_writeblocks_proof, the flush proof’s true sibling, so the write/read arms and the whole brokered I/O chain are reused unchanged),BlockDeviceCap::nvme_flushreturns()when the FLUSH was authored + the SQ doorbell rung + the completion consumed + CQE status0, failing closed otherwise. The proof (kernel/src/cap/nvme_blockdevice_flush_proof.rs, a clone of the writeblocks module) drives the full bring-up thenwriteBlocks(5, 2, data),flush(), andreadBlocks(5, 2), comparing the post-flush read-back byte-for-byte to the bytes written. It emits onecloudboot-evidence: provider-nvme-blockdevice-flush <token>marker labeledflush_path=blockdevice-flush flush_method=3 nvme_flush_opcode=0x00 flush_cqe_status=0 write_then_flush_ok=true flush_data_transfer=none prp1=0 prp2=0 write_readback_after_flush_match=true reboot_persistence=deferred durability_proof=flush-completion-only virtio_flush_regression=green. Proof:make run-cloud-provider-nvme-blockdevice-flush. No schema/binding change (flush @3 () -> ()round-trips through existing bindings with its empty params/result). Future work (not yet implemented): an NVMe reboot-persistence pass and crash-consistency where the FLUSH barrier specifically changes the survival outcome (a flushed write surviving a forced poweroff an unflushed one would not), routingFile.sync/ the writable-fs / persistent-store sync through this FLUSH, a dedicated I/O-completionInterruptroute on the data path, NLB>1 spanning multiple PRP pages with a PRP list, and graduating the NVMe data plane out of the per-proof feature into always-built production. -
cloud-prod-nvme-blockdevice-reboot-persistence-local-proofcloses the reboot-persistence gap the flush proof named first: it proves a normally committed- FLUSHED write survives a CLEAN reboot through the same
BlockDeviceinterface, the two-boot analogue ofrun-storage-persiston the NVMe arm. The Makefile recipe (make run-cloud-provider-nvme-blockdevice-reboot-persistence) creates ONEnvme.rawimage and boots the non-qemucloudboot kernel over it TWICE WITHOUT regenerating it between boots. The provider self-selects its boot phase by probing the LBA 5..6 window throughreadBlocks(5, 2) @0– the data window itself is the guard sentinel: boot 1 reads back all-zero (fresh namespace), takes the writer branch, and issueswriteBlocks(5, 2, data) @1+ a realflush() @3(CQE status0); QEMU restarts against the SAME backing file and boot 2 reads back the known payload, takes NO writer branch, and the single read-back verifies persistence. The proof reuses the landedwriteBlocks @1/flush @3/readBlocks @0arms and the brokered sync command machinery unchanged; the only kernel-internal additions are a flat per-boot single-call I/O op log (reworked from the flush proof’s rigid WRITE -> FLUSH -> read-back state machine so the verifier boot can record a READ with no prior WRITE in the same boot), the data-window phase select, and the cross-bootphase=1|2marker labels. The cap-waiter route + headline marker come fromkernel/src/cap/nvme_blockdevice_reboot_persistence_proof.rs(a clone of the flush module undercloud_nvme_blockdevice_reboot_persistence_proof, which implies and supersedescloud_nvme_blockdevice_flush_proof). Each boot emits onecloudboot-evidence: provider-nvme-blockdevice-reboot-persistence <token>marker carrying itsphase(phase=1 ... write_then_flush_ok=true flush_cqe_status=0 boot_role=writer-flushon boot 1;phase=2 ... reboot_persistence_match=true boot_role=verifier durability_proof=clean-reboot-persistenceon boot 2). The reboot-persistence gate is the cross-boot correlation: boot 1’s persisted block digests equal boot 2’s read-back block digests (and both equal the known payload). No schema/binding change. Future work (not yet implemented): crash-consistency where the FLUSH barrier specifically changes the survival outcome under an induced mid-flush crash (the analogue ofrun-storage-writable-recovery), routingFile.sync/ the writable-fs / persistent-store sync through this FLUSH, a dedicated I/O-completionInterruptroute on the data path, NLB>1 spanning multiple PRP pages, and graduating the NVMe data plane out of the per-proof feature into always-built production.
- FLUSHED write survives a CLEAN reboot through the same
-
cloud-prod-nvme-blockdevice-flush-crash-consistency-local-proofcovers the flushed-write-survives half of crash-consistency: it proves a normally committed + FLUSHED write survives a FORCED poweroff (an abruptkill -9of the QEMU process AFTER the flush barrier completed), the NVMeBlockDeviceanalogue ofrun-storage-writable-recovery. The Makefile recipe (make run-cloud-provider-nvme-blockdevice-flush-crash-consistency) creates ONEnvme.rawimage, boots the non-qemucloudboot kernel over it in the BACKGROUND (boot 1: empty namespace -> writer branch ->writeBlocks(5, 2, data) @1- real
flush() @3, CQE status0), watches the kernel log for the bounded arming marker[nvme-blockdevice-flush-crash-consistency] kernel: flushed write armed; awaiting forced poweroff,kill -9s the QEMU PID (the forced poweroff AFTER the flush barrier), then boots a SECOND time over the SAME file WITHOUT regenerating it (boot 2: verifier -> singlereadBlocks(5, 2) @0read-back). The proof reuses the reboot-persistence predecessor’s two-boot phase select, the landedwriteBlocks @1/flush @3/readBlocks @0arms, and the brokered sync command machinery unchanged; the only kernel-internal additions over the predecessor are the phase-1 arm-and-spin window after the flush (on_releaseemits the arming marker and spins forever so the recipe cankill -9at that point) and the forced-poweroff marker labels. The cap-waiter route + headline marker come fromkernel/src/cap/nvme_blockdevice_flush_crash_consistency_proof.rs(a clone of the reboot-persistence module undercloud_nvme_blockdevice_flush_crash_consistency_proof, which implies and supersedescloud_nvme_blockdevice_reboot_persistence_proof). Boot 1 emits onecloudboot-evidence: provider-nvme-blockdevice-flush-crash-consistency <token>marker carryingphase=1 ... write_then_flush_ok=true flush_cqe_status=0 armed_forced_poweroff=true boot_role=writer-flush-armbefore the spin; boot 2 emits one carryingphase=2 ... flush_survives_forced_poweroff=true boot_role=verifier durability_proof=flush-survives-forced-poweroff. The crash-consistency gate is the cross-boot correlation: boot 1’s persisted block digests equal boot 2’s read-back block digests (and both equal the known payload), AND boot 1 reached the arm-and-spin window (was forcibly killed, did not take the verifier branch). No schema/binding change. Scoped honestly: “an unflushed write rolls back” is NOT provable under QEMU’s-device nvmecache=writebackmodel (the host page cache surviveskill -9), so the differential-rollback half is NOT claimed (unflushed_rollback=not-provable-under-qemu-nvme-model). Future work (not yet implemented): a dedicated I/O-completionInterruptroute on the data path, NLB>1 spanning multiple PRP pages, and graduating the NVMe data plane out of the per-proof feature into always-built production. (Both higher-level consumer FLUSH routings are now closed: the writable-fsFile.synchalf bycloud-prod-nvme-consumer-sync-to-flush-local-proofand the persistent-Storeput-commit half bycloud-prod-nvme-persistent-store-sync-to-flush-local-proof, both below.)
- real
-
cloud-prod-nvme-dedicated-io-completion-interrupt-local-proofmoves the NVMeBlockDevice.writeBlocks @1/readBlocks @0data-completion handoff off the synchronous I/O-CQ poll return path and onto a dedicated dataInterruptroute. The spec basis remains NVMe Base Specification 1.4 submission/completion queue doorbells and completion entries (§3 controller registers, §4 queue management and CQ phase/status handling, §6 NVM WRITE / READ commands). The proof keeps table entry 0 for the CREATE I/O CQ/SQ admin completions and adds table entry 1 for the data I/O CQ completions; both routes are kernel-injected cap-waiter MSI-X routes, not a device-autonomous interrupt claim. The implementation entry points arekernel/src/cap/nvme_io_completion_interrupt_proof.rs(init,invoke_wait,invoke_acknowledge,poll_blockdevice_completions,emit_marker),kernel/src/device_manager/stub.rs(nvme_brokered_io_completion_interrupt_submit_op_for_cap,nvme_brokered_io_completion_interrupt_complete_op_for_cap,nvme_io_completion_interrupt_submit_record_buffers_live), andkernel/src/cap/block_device.rs(nvme_interrupt_write_blocks,nvme_interrupt_read_blocks,call_with_context). The manager still authors queue bases and PRP1 from the liveDMAPoolledger, copies the caller’s write payload into the parked write-data page, consumes the I/O CQ entry atInterrupt.acknowledge, advances the I/O CQ head doorbell (0x100c), and posts the deferredBlockDevicecompletion only after the bounded caller CQ has room. The proof (make run-cloud-provider-nvme-io-completion-interrupt) driveswriteBlocks(5, 2, data), waits/acks the data route, observes the deferred write completion, then drivesreadBlocks(5, 2), waits/acks the data route, and receives the read bytes through the standardblock_device::read_blocks_resultsdata field. The headline markercloudboot-evidence: provider-nvme-io-completion-interrupt <token>pinscreate.entry.0,io.entry.1,data_route_distinct_from_create_route=true, four dispatches/four deferred EOIs, both deferredBlockDevicecompletions posted, and a byte-for-byte write/read-back match. Scoped honestly: queue-base and PRP addresses remain hidden (host_physical_user_visible=0,iova_export=disabled-future-only,prp_source=manager-ledger); multi-PRP windows (count > 8), provider-written PRP/SGL/address lanes, live cloud, a second namespace, FUA/DSM, and device-autonomous MSI-X delivery remain future work. -
cloud-prod-readonly-fs-over-nvme-blockdevice-local-proofprovides areadonly_fs-style consumer over the NVMeBlockDevicearm: the read-only filesystem mount reads its sectors through the NVMeBlockDevicecap instead of the kernel-owned virtio-blk free functions.kernel/src/cap/readonly_fs.rsgains aBlockSourceseam abstracting the two reads a mount needs (device geometry + range read). The always-builtVirtiovariant routes to the samecrate::virtiofree functions, somake run-storage-fsstays byte-identical; theNvmevariant (built only undercloud_readonly_fs_over_nvme_proof) reads through a granted NVMe-backedBlockDevice– geometry from the IDENTIFY Namespace claim (see the READ-arm graduation entry), each chunked range read throughnvme_brokered_io_sync_read_window_op_for_cap, one 4 KiB PRP1 page per call. Because the brokered controller is brought up by the userspace provider, the NVMe rootDirectory(granted viaread_only_fs_root) defers its mount-parse to the firstDirectory.open. The proof (kernel/src/cap/readonly_fs_over_nvme_proof.rs, a clone of the arbitrary-LBA module) drives the full bring-up, then seeds a tinyCAPOSRO1image through the repurposed@20op (one manager-baked sector per call: superblock @ LBA 0, entry table @ LBA 1, file data @ LBA 2), mounts the filesystem over the NVMeBlockSource, opens the one seeded file, reads it, and compares the bytes. It emits onecloudboot-evidence: provider-readonly-fs-over-nvme <token>marker labeledread_path=readonly-fs-over-blockdevice fs_format=CAPOSRO1 block_source=nvme-blockdevice file_match=true superblock_via_nvme=true entry_table_via_nvme=true extent_via_nvme=trueafter the kernel verifies each read-back block-0 digest against the baked image. Proof:make run-cloud-provider-readonly-fs-over-nvme. The malformed-image fail-closed paths (bad superblock magic, out-of-range entry-table or file extent) are the unchanged sharedmount_root_inner/parse_entriesvalidation inkernel/src/cap/readonly_fs.rs– theBlockSourceseam swaps only the block-read backend, so the existingMountErrorchecks covered bymake run-storage-fsapply identically over the NVMeBlockSource; the NVMe arm additionally rejects an over-range range read with the arbitrary-LBA arm’s fail-closed error. Future work (not yet implemented): a multi-file directory walk /Directory.listtraversal over NVMe, files whose extents span many one-PRP1-page chunks, NVMe write/flush durability throughBlockDevice(the image is seeded via the manager-owned@20op, notwriteBlocks), a dedicated I/O-completionInterruptroute, and graduating the NVMe data plane and the readonly_fs NVMe mount out of the per-proof feature into always-built production. -
cloud-prod-readonly-fs-over-nvme-multifile-dirwalk-local-proofextends the read-only filesystem to multi-file directories: it lists a directory with more than one entry and reads two distinct files over the NVMeBlockDevicecap, one of which spans multiple 4 KiB chunks. The baked image (kernel/src/cap/readonly_fs_over_nvme_multifile_proof.rs, a clone of the single-file module) grows to 12 sectors – superblock @ LBA 0, a two-record entry table @ LBA 1, a one-sector small file @ LBA 2, and a 9-sector large file @ LBA 3..11 carrying a deterministic position-dependent byte pattern. The largeFile.readcovers nine sectors, so theread_rangechunk loop issues TWOBlockDevice.readBlocks @0calls (an 8-sector chunk @ LBA 3 + a 1-sector chunk @ LBA 11) – the multi-chunk path the single-file arm never exercised. The proof identifies each recorded read by(slba, count)and verifies it byte-for-byte with a per-read FNV-1a-64 over the full transfer (computed indevice_manageralongside the block digests), so a dropped trailing chunk fails closed. Because per-sector seeding (12) plus the filesystem reads (5) issue 17 single I/O commands and the monotonic I/O SQ/CQ index must stay inside one first CQ pass, the build raisesdevice_manager::stub::NVME_IO_QUEUE_DEPTHfrom 8 to 32 (createcdw10=0x001f0001); the change is inert for every other NVMe proof build. It emits onecloudboot-evidence: provider-readonly-fs-over-nvme-multifile <token>marker labeleddir_entry_count=2 file_count=2 files_distinct=true large_file_full_match=true large_file_read_blocks_calls=2 superblock_via_nvme=true entry_table_via_nvme=true extents_via_nvme=true(plus the single-file arm’s discipline labels). Proof:make run-cloud-provider-readonly-fs-over-nvme-multifile; the virtio mount path stays byte-identical (make run-storage-fs). Future work (not yet implemented): NVMe write/flush durability throughBlockDevice, a dedicated I/O-completionInterruptroute on the data path, NLB > 1 spanning multiple PRP pages in a single call, sub-directory trees, and graduating the NVMe data plane and the readonly_fs NVMe mount out of the per-proof feature into always-built production. -
cloud-prod-persistent-store-over-nvme-blockdevice-local-proofprovides a writable consumer over the NVMeBlockDevicewrite arm: the disk-backed persistentStoremounts over the NVMeBlockDevicewrite arm and proves a put-then-get durability round-trip.kernel/src/cap/persistent_store.rsgains a read+writeBlockSourceseam (mirroringreadonly_fs::BlockSourcebut with awrite_blocksmethod). The always-builtVirtiovariant routes to the samecrate::virtiofree functions byte-identically (including thedata_region_base_lba()installable-disk offset, folded into the variant), somake run-storage-persiststays green; theNvmevariant (built only undercloud_persistent_store_over_nvme_proof) reads throughnvme_brokered_io_sync_read_window_op_for_capand writes throughnvme_brokered_io_sync_write_window_op_for_cap, one 4 KiB PRP1 page per call. Because the brokered controller is brought up by the userspace provider, the NVMe rootStore(granted viapersistent_store) defers its mount-parse to the firstStorecall. The proof (kernel/src/cap/persistent_store_over_nvme_proof.rs, a clone of the writeblocks module) drives the full bring-up, then seeds aCAPOSST1superblock + empty entry table through the repurposed@20op (superblock @ LBA 0, entry table @ LBA 1), and exercises the grantedStore:Store.putwrites the data extent (LBA 2), entry-table sector, and superblock throughBlockDevice.writeBlocks @1, andStore.getreads the extent back throughBlockDevice.readBlocks @0. The kernel attests the put WRITE and get READ block-0 digests both equal the payload digest and differ from the pre-put (zero) extent, and userspace compares the returned bytes byte-for-byte. It emits onecloudboot-evidence: provider-persistent-store-over-nvme <token>marker labeledwrite_path=store-put-over-blockdevice-writeblocks read_path=store-get-over-blockdevice-readblocks consumer=persistent-store store_iface=Store block_iface=BlockDevice store_format=CAPOSST1 write_method=1 read_method=0 put_get_roundtrip_match=true durability_attested=true virtio_regression=green. Because the round-trip issues 8 single I/O commands (2 seed WRITEs + 2 deferred-mount READs + 3Store.putWRITEs + 1Store.getREAD) whose last monotonic CQ head reaches 8 – past the default depth-8 first pass – the build raisesdevice_manager::stub::NVME_IO_QUEUE_DEPTHfrom 8 to 16; the change is inert for every other NVMe proof build. Proof:make run-cloud-provider-persistent-store-over-nvme. No schema/binding change (Store.put/get/has/deleteandBlockDevice.writeBlocks @1/readBlocks @0round-trip through existing bindings). Future work (not yet implemented): routing the writable filesystem (CAPOSWF1) over the NVMe write arm, a real NVMe FLUSH onflush @3(stays fail-closed), an NVMe reboot-persistence pass, a dedicated I/O-completionInterruptroute on the data path, NLB > 1 spanning multiple PRP pages, and graduating the NVMe data plane out of the per-proof feature into always-built production. -
cloud-prod-writable-fs-over-nvme-blockdevice-local-proofmounts the full disk-backed writable filesystem over the NVMe arm (kernel/src/cap/writable_fs.rs, theCAPOSWF1node-table tree withmkdir/rename/removeand a fail-closed single-writer policy).writable_fscarries a read+writeBlockSourceseam mirroringpersistent_store’s: theVirtiovariant (built only in the qemu/ installable storage builds) routes to thecrate::virtiofree functions byte-identically (folding thedata_region_base_lba()offset), somake run-storage-writable/make run-storage-writable-recoverystay green; theNvmevariant (built only undercloud_writable_fs_over_nvme_proof) reads throughnvme_brokered_io_sync_read_window_op_for_capand writes throughnvme_brokered_io_sync_write_window_op_for_cap, one 4 KiB PRP1 page per call. Becausewritable_fsuses a process-wide singleton volume, the NVMewritable_fs_rootgrant stages the livedevice_mmiohandle and defers the singleton mount-parse to the firstDirectory/Filecall. The proof (kernel/src/cap/writable_fs_over_nvme_proof.rs, a clone of the persistent-store module that supersedes and drops it) seeds aCAPOSWF1superblock + root + one seeded file through the@20op, contiguously from LBA 256 (superblock @256, node table @257, seeded file extent @258), then exercises the granted filesystem. READ arm: opening the seeded file triggers the deferred mount, which reads the seeded extent (@258) back throughBlockDevice.readBlocks @0;File.readreturns the RAM copy the mount loaded. WRITE arm:File.writeto a fresh file lands a bump-allocated data extent (@259) + node-record + superblock throughBlockDevice.writeBlocks @1. BecauseFile.readserves the RAM content cache the mount loaded (not a fresh disk read), a same-extent disk re-read of a just-written file – which needs a remount/ reboot – is out of scope; the matching block-0 digests prove the same payload traversed both device arms, each device-acked. The single-writer policy is proven intact: aFile.writethrough a second grantedwritable_fs_rootcap fails closed. It emits onecloudboot-evidence: provider-writable-fs-over-nvme <token>marker labeledwrite_path=file-write-over-blockdevice-writeblocks read_path=file-read-over-blockdevice-readblocks consumer=writable-fs fs_iface=Directory file_iface=File block_iface=BlockDevice fs_format=CAPOSWF1 write_method=1 read_method=0 write_read_roundtrip_match=true durability_attested=true single_writer_policy=enforced second_writer_denied=true recovery_over_nvme=deferred virtio_regression=green. The round-trip issues 11 single I/O commands (3 seed WRITEs + 3 deferred-mount READs + 5File.writeWRITEs);NVME_IO_QUEUE_DEPTHstays 16. Proof:make run-cloud-provider-writable-fs-over-nvme. No schema/binding change. Future work (not yet implemented): the unclean-shutdown / forced- poweroff recovery window (recovery_crash_after_record) over the NVMe arm (the analogue ofrun-storage-writable-recovery, proved on virtio here), a real NVMe FLUSH onflush @3, an NVMe reboot-persistence pass, a dedicated I/O-completionInterruptroute on the data path, NLB > 1 spanning multiple PRP pages, and graduating the NVMe data plane out of the per-proof feature into production. -
cloud-prod-writable-fs-over-nvme-recovery-local-proofproves the unclean-shutdown / forced- poweroff RECOVERY window (recovery_crash_after_record, the record-sector-written- but-superblock-not-yet-committed window inkernel/src/cap/writable_fs.rs) over the NVMeBlockDevicearm. A newcloud_writable_fs_over_nvme_recovery_prooffeature implies (and supersedes the happy-path proof module/route/init of)cloud_writable_fs_over_nvme_proofand widens thestorage_writable_recoverycrash-window cfg gate so the samerecovery-orphan.txtsentinel arms an induced forced poweroff when the writable filesystem is NVMe-backed. The recovery cap-waiter module (kernel/src/cap/writable_fs_over_nvme_recovery_proof.rs) reuses the NVMeBlockSourcearm, deferred mount, thirdWritableFsRootgrant arm, window ops, and I/O queue create unchanged. Unlike the happy-path proof, theCAPOSWF1image is HOST-BUILT (tools/mkstore-image --writable-nvmelays an empty superblock + root-only node table) rather than seeded through@20: a two-boot SAME-image recovery flow cannot re-seed on pass 2 without clobbering the pass-1 committed state, and the read ordering does not depend on a per-boot seed here (mirroring the virtiorun-storage-writable-recoveryproof, which also boots a host-built image twice).make run-cloud-provider-writable-fs-over-nvme-recoveryboots QEMU twice with-device nvmeagainst one shared raw drive file: pass 1 commits aFile.write+ sub-directory throughwriteBlocks @1, allocates the sentinel (its record sector lands on the namespace), and spins; the harnesskill -9s QEMU before the superblock commit. Pass 2 boots the SAME file, mounts by reading the old superblock + node table back throughreadBlocks @0, and asserts the recovered tree omits the orphan slot (exactly the committed entries remain), preserves the committed mutation (file size + content), accepts a usable post-recovery write, and denies a second-grant write (single-writer policy). The userspace smoke emits onecloudboot-evidence: provider-writable-fs-over-nvme-recovery <token>marker labeledcrash_window=record-written-superblock-uncommitted orphan_slot_ignored=true committed_mutation_survived=true post_recovery_write_ok=true recovery_over_nvme=true single_writer_policy=enforced durability_basis=host-page-cache real_flush=deferred reboot_persistence=deferred io_completion=polled interrupt_wait=not-used-on-data-path virtio_recovery_regression=green live_cloud=not-attempted; the kernel proof module’son_releaseindependently attests the cap-waiter route lifecycle + the two CREATE I/O queue dispatch/ack cycles. The two CREATE I/O queue commands keep their productionInterrupt.wait/acknowledgecap-waiter cycles; the data path stays polled. Bounded-proof caveat: one record-vs-commit window, host-page-cache durability (the two passes share one backing file; akill -9preserves the host page cache), NOT media crash-consistency and NOT a real NVMe FLUSH barrier. No schema/binding change. Future work (not yet implemented): a real NVMe FLUSH onflush @3, an NVMe clean- reboot-persistence pass, NLB > 1 spanning multiple PRP pages, a dedicated I/O-completionInterruptroute on the data path, and graduating the NVMe data plane into production. Proof:make run-cloud-provider-writable-fs-over-nvme-recovery. -
cloud-prod-nvme-consumer-sync-to-flush-local-proofroutes a consumer-levelFile.sync @4to a realBlockDevice.flush @3NVMe FLUSH media barrier instead of a write-side no-op.writable_fs::BlockSourcecarries aflush()arm (theVirtiovariant returnsOk(())– the driver negotiates noVIRTIO_BLK_F_FLUSH, so virtioFile.syncstays a byte-identical no-op andmake run-storage-writablestays green; theNvmevariant drivesnvme_brokered_io_sync_flush_op_for_capwith the same success predicate the read/write arms apply), andFile.sync @4(writable_fs.rs) routes through it AFTER theclaim_writergate. The featurecloud_nvme_consumer_sync_to_flush_proofcomposescloud_nvme_blockdevice_flush_crash_consistency_proof(arming the realflush @3op-for-cap) andcloud_writable_fs_over_nvme_proof(the consumer arm), dropping both predecessors’ proof modules. The proof (kernel/src/cap/nvme_consumer_sync_to_flush_proof.rs) seeds theCAPOSWF1image, thenDirectory.open(CREATE)+File.write(throughwriteBlocks @1), thenFile.sync()– which issues the real NVMe FLUSH (opcode0x00, CQE status0, no data transfer: PRP1 = 0, PRP2 = 0) – thenFile.readconfirming the bytes survive. The single-writer policy is shown intact: aFile.syncthrough a second granted cap fails closed BEFORE any FLUSH is issued (denied_sync_issues_no_flush=true), and the kernel asserts exactly one consumer-sync FLUSH (status0) was recorded. It emits onecloudboot-evidence: provider-nvme-consumer-sync-to-flush <token>marker labeledconsumer_sync_path=File.sync-to-nvme-flush sync_method=4 flush_method=3 nvme_flush_issued_by_consumer_sync=true nvme_flush_opcode=0x00 flush_cqe_status=0 write_sync_read_roundtrip_match=true single_writer_policy=enforced durability_proof=consumer-sync-issues-real-flush virtio_sync_noop=byte-identical. The round-trip issues 12 single I/O commands (3 seed WRITEs + 3 mount READs + 5File.writeWRITEs + 1File.syncFLUSH);NVME_IO_QUEUE_DEPTHstays 16. Proof:make run-cloud-provider-nvme-consumer-sync-to-flush. No schema/binding change. The bounded claim is consumer-sync-issues-real-flush, NOT a power-loss survival differential (unflushed_rollback=not-provable-under-qemu-nvme-model; the cross-boot forced-poweroff differential stays as the crash-consistency proof established it). Future work (not yet implemented): routing the persistentStore’s commit path through the FLUSH, graduating the NVMe data plane into production, NLB > 1 spanning multiple PRP pages, and a dedicated I/O-completionInterruptroute on the data path. -
cloud-prod-nvme-persistent-store-sync-to-flush-local-proofroutes the persistentStore’s put-commit path to a realBlockDevice.flush @3NVMe FLUSH media barrier. TheStorehas NOsyncschema method (put @0/get @1/has @2/delete @3only), so the routing point is the existing put-commit path, not a new method:persistent_store::BlockSourcegains aflush()arm (theVirtiovariant returnsOk(())– noVIRTIO_BLK_F_FLUSH, so the virtioStore.putcommit stays a byte-identical no-op andmake run-storage-persiststays green; theNvmevariant drivesnvme_brokered_io_sync_flush_op_for_capwith the same success predicate the read/write arms apply, but only when the flush lineage is composed so the plainmake run-cloud-provider-persistent-store-over-nvmecommit stays a no-op), andput_blob(persistent_store.rs) issues it AFTER theflush_superblockwrite (the ordering commit point) succeeds. A FLUSH that fails closed rolls back the in-RAMentry_count/next_free_sectorso no live index insert occurs. The featurecloud_nvme_persistent_store_sync_to_flush_proofcomposes (and drops/supersedes) thecloud_nvme_consumer_sync_to_flush_prooflineage (transitively the realflush @3op and the persistent-store-over-NVMe read+write seam). The proof (kernel/src/cap/nvme_persistent_store_sync_to_flush_proof.rs) seeds theCAPOSST1image, thenStore.put(data)– which writes the data extent / entry sector / superblock throughwriteBlocks @1, then issues the real NVMe FLUSH (opcode0x00, CQE status0, no data transfer: PRP1 = 0, PRP2 = 0) after the superblock commit – thenStore.get(hash)confirming the bytes survive. The kernel asserts exactly oneStore-commit FLUSH (status0) recorded AFTER the superblock write (superblock_commit_before_flush=true) and emits onecloudboot-evidence: provider-nvme-persistent-store-sync-to-flush <token>marker labeledconsumer_commit_path=store-put-to-nvme-flush put_method=0 flush_method=3 nvme_flush_issued_by_store_commit=true nvme_flush_opcode=0x00 flush_cqe_status=0 superblock_commit_before_flush=true put_get_roundtrip_match=true failed_flush_issues_no_live_entry=true virtio_commit_noop=byte-identical durability_proof=store-commit-issues-real-flush. The round-trip issues 9 single I/O commands (2 seed WRITEs + 2 mount READs + 3Store.putWRITEs + 1 put-commit FLUSH + 1Store.getREAD). Proof:make run-cloud-provider-nvme-persistent-store-sync-to-flush. No schema/binding change. The bounded claim is store-commit-issues-real-flush, NOT a power-loss survival differential (unflushed_rollback=not-provable-under-qemu-nvme-model). Future work (not yet implemented): graduating the NVMe data plane out of the per-proof features into always-built production, with the dedicated I/O-completionInterruptroute on the data path and NLB > 1 spanning multiple PRP pages. -
cloud-prod-nvme-sync-io-state-seam-always-built-local-proofextracts the brokered-NVMe synchronous-I/O state the shared op body depends on into ONE always-built moduledevice_manager::nvme_sync_io_state(kernel/src/device_manager/nvme_sync_io_state.rs), compiled in the default no-proofcargo build(not behind anycloud_nvme_*_prooffeature). The seam owns: the functional I/O SQ/CQ reservation cursor (reserve_io_slot()= a queue-depth-bounded first-pass slot reservation, with one live in-flight reservation, so a single-call command cannot reuse a stale CQE before the created queue wraps); the admissions predicate (sync_{read,write,flush}_admitted= bounded-ledger-not-full plus no active reservation); and the orderedSyncIoRecordop-log ledger (record_sync_{read,write,flush},DIGEST_BYTES, op kind implied byrecord.opcode=0x00FLUSH /0x01WRITE /0x02READ). The shared bodynvme_brokered_io_sync_commandand thenvme_brokered_io_sync_{read_window,write_window,flush}_op_for_cap/nvme_brokered_io_sync_read_bytes_op_for_capentries (kernel/src/device_manager/stub.rs) now record/admit/index through this seam instead of the per-proofnvme_io_proofalias; the alias stays only for the genuinely per-proof create/io-phase/seed ledgers (record_create_*,record_io_*,next_io_phase,next_seed_slba,seed_image_sector). All 15 NVMeBlockDeviceproof modules (kernel/src/cap/*_proof.rs) delegate to the seam: each deletes its privateSyncIoRecord/SyncHandoff/record/admit/index copy and reconstructs its release-marker view from the seam’s ordered op-log snapshot, byte-identical. The create-ordering / write-before-read orderings the per-proof harnesses formerly folded into their admit predicates are proof-harness assertions each proof still re-derives in its release marker; the always-built admit keeps only the production-honest bounded-ledger invariant (a real read needs no prior write). Code-location refactor only – no schema/binding change, no new device behavior; defaultcargo buildandcargo build --features qemustay warning-free (the seam carries a module-level#[allow(dead_code)]for its dormant not-yet-activated entry points). This UNBLOCKS the read-arm graduation: the read body’s sync-I/O symbols now resolve in the default build, so the graduation can make the read body always-built and gate activation behind a fail-closed runtime probe while the proof exercises the same always-built seam. Proof: every existing NVMeBlockDeviceproof stays green (make run-cloud-provider-nvme-blockdevice-arbitrary-lba-readand the rest of the chain) andmake run-netis byte-identical.
These brokered capabilities target the no-IOMMU QEMU/GCP lane, where queue-base
and PRP addresses are materialized by the kernel/device manager from the live
ledger. On a direct-remapping/vIOMMU gate the provider-written validator model
(nvme-userspace-bind-and-controller-bringup) applies instead. The PCI
metadata-only discovery summary (pci: nvme metadata ...) that also runs on
make run-pci-nvme is the separate enumeration-evidence surface in
kernel/src/pci.rs.
1. Spec basis
- Device: NVM Express PCI controller. PCI class
0x01(mass storage), subclass0x08(NVM), programming interface0x02(NVM Express). Detected byPciDevice::is_nvme_controller(kernel/src/pci.rs,NVME_CLASS_MASS_STORAGE/NVME_SUBCLASS_NVM/NVME_PROG_IF_NVM_EXPRESS). QEMU instance:-device nvme,drive=...,serial=...on theq35machine. - Authoritative spec: NVM Express Base Specification (NVMe 1.4 / 2.0). The
fields the validator relies on:
- Controller registers
CAP,CC(withCC.ENcontroller enable),AQA,ASQ,ACQ(NVMe Base Spec §3.1 controller register map).ASQ/ACQbase addresses have bits 11:0 reserved → 4 KiB page-aligned. - Submission/completion queue base addresses and the per-queue doorbell registers in the doorbell stride region (§3.1.x, §7.6 queue setup).
- Physical Region Page (PRP) entries PRP1/PRP2 and the PRP List (§4.3): PRP list pages and list-pointer PRP2 entries are page-aligned; a transfer that needs more than one PRP list page chains a further list, which this bounded subset does not follow.
- Controller registers
- Reference driver (optional cross-check): the Linux
drivers/nvme/host/queue-setup and PRP-build paths (nvme_setup_prps,nvme_pci_configure_admin_queue).
2. Wire format (validator-relevant subset)
The validator reads only the device-visible addresses a single doorbell newly
publishes, plus the byte extent of the region each names. It does not decode
command opcodes, data payloads, or completion entries. Scanned items are modeled
by ScanItem (kernel/src/cap/nvme_doorbell_validator.rs).
- Queue-base registers (
ScanItem::QueueBase): theASQ/ACQadmin queue bases (scanned on theCC.EN/ queue-arm write,ScanKind::QueueArm) and the I/OSQ/CQbases. The named region isentries × entry_sizebytes (e.g. an admin SQ isdepth × 64, an admin CQ isdepth × 16). Required alignment: page (4 KiB). - SQ entry PRP pointers (
ScanItem::Prp): for each NVMe command newly made visible by an SQ tail doorbell (ScanKind::SqTailDoorbell), the PRP1 data pointer, the PRP2 data-or-list pointer, and one level of PRP-list indirection.list_depthcounts indirection already followed (0= a PRP carried in the SQE,1= a pointer inside the single PRP list page);list_depth > 1is the out-of-subset deeper-chain case and fails closed (MAX_PRP_LIST_DEPTH). The named region is the transfer length (PRP data) or one page (a PRP list). Required alignment: page (4 KiB).
The scan is on-notify only: the provider may freely write its own mapped DMA pages between doorbells; nothing device-reachable happens until a doorbell rings, which is the single choke point the validator guards. Cost is O(descriptors published by this doorbell).
3. capOS mapping
The validator is the kernel half of the Model B genuine-userspace-driver model
(docs/proposals/nvme-model-b-doorbell-dma-validator.md): the provider writes
the device-visible queue-base and PRP addresses itself, and the kernel validates
them on the doorbell path rather than minting them (Model A, the unchanged
virtio-net TX path).
- Authority gate: the live doorbell-path hook derives the owning provider’s
identity (
OwnerToken) and live grant generation from theDeviceMmiogrant record in the device-manager ownership ledger, never from provider-supplied bytes. Thecfg(qemu)self-test resolves owner/generation from synthetic windows only. DeviceMmio: the validator is invoked from the pre-write step of the NVMe doorbell/queue-arm selected-writeDeviceMmioclaim (kernel/src/cap/device_mmio.rs; the existingnotifyDoorbellpath is the Model A virtio-net claim and does not trigger the validator). The scan completes — accept or reject — before the doorbell write is allowed to take effect, so the device never sees an unvalidated descriptor batch. BAR0 / doorbell pages stay device-uncacheable, NX, capability-scoped.DMAPool/ window descriptor: for a direct-remapping/vIOMMU lane, aDmaWindowcan name the owner’s domain-scoped IOVA range with a live generation, and provider-written values can be checked against that range. On the current no-IOMMU lane, there is no provider-visible non-host-physical device-address namespace; the manager owns the physical bounce pages and must materialize queue-base and PRP/SGL fields itself. Seedocs/dma-isolation-design.md(Provider-Written Addresses And No-IOMMU Brokered Bounce).Interrupt:completion_wakes_waiterenforces the stale-completion gate — a completion wakes a waiter only if its submission scan was accepted and the generation it was validated under is still live; an unvalidated or retired-generation completion does not wake a waiter.- Fail-closed / validation rules (
ScanReject, all reject with no doorbell write and no waiter wake):out-of-window,host-physical,cross-owner-alias,region-overrun,unaligned,deep-prp-chain,stale-generation, andinvalid-region. A doorbell rung after revoke/reset/regrant against a stale generation fails closed even when the byte value would have been in-window for the prior grant. - QEMU-emulable vs hardware-only: the validator mechanism and its hostile-scan
invariants are end-to-end provable in QEMU (
make run-pci-nvme, thenvme: validator ...proof lines). Live controller bring-up over a real NVMe controller — admin/I/O queue creation, IDENTIFY, and a bounded read with the validator gating the real doorbell — is QEMU-emulable too and is covered by the brokered bring-up capabilities (§4-§9), not the validator mechanism itself.
4. Userspace bind (read-only controller bring-up)
nvme-bind-claimed-mmio-read stands up the userspace storage-provider bind
foundation over the existing DDF driver foundation. It binds the controller
read-only – no register write, no DMA submission, and no doorbell – and so
leaves the controller’s existing (firmware-initialized) state untouched.
- Enumerate → claim → BAR0 preseed:
bind_qemu_nvme_controller(kernel/src/pci.rs) runs for the first enumerated NVMe controller after the metadata summary. It preseeds the first decoded memory BAR (BAR0 controller registers) for brokered reads (devicemmio_grant_source::preseed_read32_for_device), then claims the function and parks it underDeviceOwner::ManagerGrantSource. On any staging failure it falls back to thepci: nvme no-authority/no-driverline (fail-closed): a partially-staged authority surface is never advertised. - Grant-source staging: the same device-agnostic
{devicemmio,dmapool,interrupt}_grant_source::init_for_devicethe virtio path uses stage the bootstrap grants against the claimed NVMe handle. The virtio-net-specific provider-notify/doorbell selected-write claim is not staged here — that is the controller-enable path (§6). run-pci-nvme boots with no virtio devices, so the singletons are free for NVMe; run-net / run-ddf-provider-consumer (no-device nvme) keep the virtio bind untouched. - Brokered register reads: the
nvme-bringup-smokeprovider (demos/nvme-bringup-smoke/) holds the manifest-grantedconsole/dmapool/device_mmio/interruptcaps and readsCAP(0x0,0x4),VS(0x8),CC(0x14), andCSTS(0x1c) throughDeviceMmio.read32(the brokered boot-preseeded mapping indevice_manager::read_devicemmio_u32). It proves the bound claim reaches a coherent NVMe BAR0 by requiring a liveCAP(non-zero, non-floating) and a validVSversion (NVMe Base Spec §3.1.1/§3.1.2; QEMU reports 1.4.0), and reports the observedCC.EN/CSTS.RDY(§3.1.5/§3.1.6, bit 0 of each). - Firmware-initialized controller: under QEMU’s SeaBIOS BIOS boot, the
NVMe boot-probe enables the controller (
CC.EN=1,CSTS.RDY=1) before init runs, so the read-only bind observes a live controller. Bringing it to a known reset state (CC.EN=0, waitCSTS.RDY=0) before re-enabling with provider-owned admin queues is the controller-enable path’s responsibility (§6), not this read-only bind. - Proof line: the userspace
[nvme-bringup-smoke] controller-bind ok ... mmio_read=brokered controller_state=firmware-enabledread proof, asserted bytools/qemu-pci-nvme-smoke.sh. The kernel bind line advanced fromcontroller_init=read-only-bindtocontroller_init=reset-capable-bindwhen §5 added theCCselected-write claim to the same grant staging; the read proof itself is unchanged.
5. Userspace controller reset (selected-write CC claim)
nvme-controller-reset-selected-write is the first genuine userspace NVMe
controller-register write: it brings the firmware-enabled controller to a
known reset state. It does not enable the controller, program admin/IO
queue bases, submit DMA, or ring a doorbell – controller enable publishes
admin queue-base addresses and is the validator-gated path in §6.
- NVMe
CCselected-write claim: the NVMe bind stages theDeviceMmiogrant region with a reset-only selected-write claim (DeviceMmioWrite32ClaimProvider::NvmeControllerRegister,device_manager::nvme_controller_register_grant_region→proofs::nvme_controller_register_region), scoped to theCCregister (0x14, NVMe Base Spec §3.1.5).bind_qemu_nvme_controller(kernel/src/pci.rs) now callsdevicemmio_grant_source::init_nvme_controller_for_deviceinstead of the plaininit_for_device, so the single granted cap carries both the brokered read surface and theCCwrite claim. - Value-flexible, scoped: unlike the virtio variants (which pin an
exact
(offset, value)pair), the NVMe claim is offset-scoped and value-flexible only for reset invalidate_devicemmio_write32_claim(kernel/src/device_manager/qemu_full.rs): it admits anyCCwrite whoseCC.EN(bit 0) is clear – the read-modify-write reset – directly, fails closed on rawCC.EN=1writes withdevicemmio-nvme-cc-enable-raw-blocked, and fails closed on any write to a non-CCoffset (unclaimed-register-write). Refused writes perform no MMIO. - Reset sequence: the
nvme-bringup-smokeprovider readsCC, writes it back withCC.ENcleared throughDeviceMmio.write32(the volatile MMIO write indevice_manager::write_devicemmio_u32, resolved through the boot-preseeded BAR0 mapping), and pollsCSTS(§3.1.6) untilCSTS.RDYclears. QEMU clearsCSTS.RDYsynchronously on theCC.EN=1→0write (nvme_ctrl_reset). - No DMA validator involvement: a reset write (
CC.ENclear) publishes no queue-base or PRP addresses, so the Model B on-notify validator (§2/§3) is not invoked. The validator is invoked on the explicit brokered controller-enable queue-arm path (§6). - Proof lines (asserted by
tools/qemu-pci-nvme-smoke.sh):pci: nvme userspace-bind ... controller_init=reset-capable-bind ... cc_selected_write=staged(kernel),[nvme-bringup-smoke] cc-raw-enable-refused ...,[nvme-bringup-smoke] non-cc-write-refused ..., and[nvme-bringup-smoke] controller-reset ok ... csts_rdy_before=1 csts_rdy_after=0 cc_en_after=0 reset_write=performed ...(userspace).
6. Brokered controller enable (no-IOMMU, manager-authored admin queues)
nvme-no-iommu-brokered-controller-enable enables the controller on the
no-IOMMU make run-pci-nvme gate without exporting a host physical (== the
device-visible address on the bounce shape) or raw IOVA to the provider. The
provider invokes the explicit no-parameter
DeviceMmio.brokeredNvmeControllerEnable verb (schema @6); the manager
authors every address-bearing register and the selected CC value from its
live DMA ledger. Raw DeviceMmio.write32(CC, value with CC.EN=1) fails closed
before any MMIO side effect. The earlier
provider-written Model B enable (nvme-userspace-bind-and-controller-bringup)
stays blocked: it would require the provider to author a device-visible
queue-base, which the reviewed iova_export=disabled-future-only discipline
forbids on this gate.
- Brokered admin queue memory: the provider allocates the admin submission
and completion queue pages through the device’s
DMAPoolauthority (DmaPool.allocateBuffer), maps each read-write to fill it, and unmaps. By convention the manager reads the admin SQ from pool slot 0 (NVME_ADMIN_SQ_POOL_SLOT) and the admin CQ from slot 1 (NVME_ADMIN_CQ_POOL_SLOT). The pages stay live in the manager ledger;DmaBuffer.infocontinues to reportdevice_iova=0,iova_export=disabled-future-only,host_physical_user_visible=false. - Manager-authored queue-base registers: the @6 method dispatches through
nvme_brokered_controller_enable_op_for_cap(kernel/src/device_manager/qemu_full.rs), which validates the cap against the live NVMe controller-register claim and then callsnvme_brokered_admin_queue_enable. It resolves the admin SQ/CQ pages (record.attached_dmapools[..].proof_buffers[slot].page), then authorsAQA(0x24, zero-based admin queue sizes),ASQ(0x28/0x2c), andACQ(0x30/0x34) from the ledger page physical addresses vianvme_authored_register_write(volatile writes resolved only through the boot-preseeded BAR0 mapping), and finally performs the manager-selectedCC.EN | IOSQES=6 | IOCQES=4write. No provider-supplied controller bits or address-bearing value reaches the controller. - Validator on the queue-arm path: before any register write the authored
ASQ/ACQ bases are passed through the Model B on-notify DMA validator
(
crate::cap::nvme_doorbell_validator::validate_doorbell_scan,ScanKind::QueueArm). On this path the windows and the scanned items both derive from the same kernel ledger pages, so it is a self-consistency check on the kernel-authored bases: it proves page alignment and in-window containment of the named queue region (entries * entry_size– the admin SQ 128 B / CQ 32 B fit the 4 KiB page). The owner-identity, cross-owner-alias, host-physical, and stale-generation rejections are structurally unreachable here because both sides of each comparison come from the manager (the real authority gate against a stale/foreign page is the live-ledger membership check below); those hostile rejections are exercised by the boundedcfg(qemu)self-test (§3). A reject still fails closed before theCC.ENwrite. - Fail-closed before enable: raw
write32(CC, CC.EN=1)returnsdevicemmio-nvme-cc-enable-raw-blockedbefore any MMIO. The explicit manager-op’s real authority gate is live-ledger membership – an enable request with the admin queue pages unallocated, freed, or in-flight returnsnvme-admin-queues-not-armed(devicemmio-nvme-cc-enable-not-armed) with no MMIO side effect, covering the out-of-order manager operation and the post-free stale re-enable. A validator reject returnsdevicemmio-nvme-cc-enable-validator-reject. - Teardown under live admin queues: reset (
CC.EN=0) quiesces the controller (CSTS.RDYclears) before the admin queue pages are reused;DmaBuffer.freeBufferthen scrubs each page before the frame is freed (page_scrubbed_before_frame_free=true), and a subsequent enable with the queue memory gone fails closed. The enable path submits no admin commands, so there are no live completions or waiters during teardown; the “stale/unvalidated completion does not wake a waiter” property (completion_wakes_waiter) is proven by the boundedcfg(qemu)self-test (§3), not by the live admin-queue teardown. - Proof lines (asserted by
tools/qemu-pci-nvme-smoke.sh):[nvme-bringup-smoke] admin-queue-allocated ...(userspace),nvme: brokered-enable owner=nvme-storage trigger=manager-op admin_sq_slot=0 admin_cq_slot=1 validator=queue-arm scanned_items=2 aqa=0x00070007 cc=0x00460001 asq_authored=true acq_authored=true cc_en_write=performed cc_bits_selected_by=manager queue_base_source=manager-ledger host_physical_user_visible=false ...(kernel),[nvme-bringup-smoke] controller-enable ok ... cc_en_after=1 csts_rdy_after=1 ... brokered_enable_trigger=manager-op ...,[nvme-bringup-smoke] teardown-reset ok ... quiesced=true,[nvme-bringup-smoke] admin-queue-freed ... page_scrubbed_before_frame_free=true, and[nvme-bringup-smoke] stale-enable-refused ... brokered_enable_trigger=manager-op reason=nvme-admin-queues-not-armed(userspace). - Not in scope (this path): I/O queues, read/write commands, cloud
evidence, and host-physical/IOVA export are out of scope for the enable path.
hostile_hardware_isolation=not-claimed; the brokered no-IOMMU enable is not hostile-hardware isolation. One brokeredIDENTIFYadmin command is in §7.
7. Brokered admin command + IDENTIFY (no-IOMMU)
nvme-admin-queue-identify extends the brokered no-IOMMU lane to one admin
command. After the §6 enable, the provider submits a single IDENTIFY
(controller) admin command and consumes its completion from its own mapped admin
CQ. As on the enable path, the manager authors every address-bearing field; the
provider supplies only the non-addressing command dwords and the doorbell index.
nvme-admin-interrupt-delivery then makes that completion interrupt-driven: the
provider unmasks the admin completion interrupt route and blocks on
Interrupt.wait, the kernel wakes the live waiter through the device-interrupt
dispatch path, and only then is the completion consumed from the mapped CQ.
- Admin command (wire subset): a 64-byte submission queue entry (NVMe Base
Spec §4.2). The provider writes opcode
0x06(IDENTIFY, §5.17) at byte 0, a command id at bytes 2:3,NSID=0, andCNS=0x01(Identify Controller) in CDW10 at bytes 40:43 into the mapped admin SQ page. It leaves the address-bearing MPTR (bytes 16:23), PRP1 (bytes 24:31), and PRP2 (bytes 32:39) zero; the manager overwrites them. The IDENTIFY data structure is 4096 bytes, so a single page-aligned PRP1 covers it and PRP2 stays zero. - Doorbells: the same
NvmeControllerRegisterclaim covers the admin SQ tail doorbell (0x1000) and CQ head doorbell (0x1004) – admin queue 0 with doorbell strideCAP.DSTRD=0(NVMe Base Spec §3.1.24/§3.1.25;nvme_brokered_admin_sq_doorbellre-readsCAP.DSTRDand fails closed on a non-zero stride). The doorbell value (the tail/head index) is not address-bearing; the manager bounds it to<= NVME_ADMIN_QUEUE_DEPTHand performs the write. - Manager-authored PRP on the submit path: a write to the SQ tail doorbell
is routed by
validate_devicemmio_write32_claim(NvmeBrokeredWriteOp::AdminSqTailDoorbell) tonvme_brokered_admin_sq_doorbell(kernel/src/device_manager/mod.rs). It resolves the live admin SQ (slot 0), CQ (slot 1), and IDENTIFY data (slot 2,NVME_ADMIN_DATA_POOL_SLOT) ledger pages, authors the SQE’s MPTR/PRP1/PRP2 from the data page physical address through the SQ page’s HHDM mapping (nvme_author_admin_sqe_prp), fences, then rings the doorbell vianvme_authored_register_write. No provider-supplied address reaches the controller. - Validator on the SQ-tail path: before authoring the SQE the data-buffer
PRP1 is passed through the Model B on-notify DMA validator
(
validate_doorbell_scan,ScanKind::SqTailDoorbell,ScanItem::Prp { list_depth: 0 }): page alignment and in-window containment of the 4 KiB data region against the data page’s own device-visible window. A reject returnsdevicemmio-nvme-admin-submit-validator-rejectwith no doorbell write. As on the queue-arm path the window and scanned item are both manager-derived, so the hostile owner/host-physical/stale rejections are exercised by thecfg(qemu)self-test (§3); the live authority gate is live-ledger membership. - Interrupt-driven completion wake (
nvme-admin-interrupt-delivery): after ringing the SQ tail doorbell, the provider unmasks the admin completion interrupt route and blocks onInterrupt.wait. The route is the NVMe controller’s bootstrapInterruptgrant (DeviceOwner::ManagerGrantSource, MSI-X table entry 0, roleGrantSource); the kernel wakes the live waiter with a real LAPIC dispatch routed through the device-interrupt dispatch slot plus the deferred-EOI and waiter path – the same grant-source delivery model proven bymake run-interrupt-grantand used by the virtio-net provider. The wait returnsresult=interrupt-delivered real_interrupt_delivery=delivered wake_blocked=falsewith the route’s dispatchdelivery_countincremented. This is a kernel-injected dispatch at the route’s programmed LAPIC vector, not a device-autonomous MSI-X raise: the NVMe MSI-X table is not yet hardware- programmed for an external write (msix_table_programming=not-written, as on the DDF interrupt-grant path), so device-raised MSI-X delivery and MSI-X table programming remain a documented next increment. - Completion consume: only after the interrupt-driven wake does the provider
read completion queue entry 0 in its mapped admin CQ page (NVMe Base Spec
§4.6): a 16-byte entry whose DW3 (bytes 12:15) carries the command id (bits
15:0), the phase tag (bit 16), and the status field (bits 31:17). It checks the
status field is success and the command id matches, confirms the controller
DMA-wrote the data structure (non-zero PCI Vendor ID at IDENTIFY byte 0; QEMU’s
nvmereports0x1b36), then advances the CQ head doorbell. The completion is thus consumed after an interrupt-driven wake, with the mapped-CQ read as the consume step. - Stale/post-reset no-wake: after the IDENTIFY completes, a second live
Interrupt.waitwaiter is installed on the (driver-unmasked) route, observed to stay pending, and the route is then masked; the live waiter completesresult=interrupt-waiter-cancelled reason=route-maskedrather than woken (masked_live_waiter_woke=false). At the kernel layer the stale/unvalidated/retired-generation completion no-wake invariant (completion_wakes_waiter) remains proven by thecfg(qemu)self-test (§3,waiter_wake=none). - Fail-closed before submit: a SQ tail doorbell with the admin SQ/CQ or data
page unallocated, freed, or in-flight returns
nvme-admin-command-not-armed(devicemmio-nvme-admin-submit-not-armed) with no MMIO side effect; an out-of-range doorbell index returnsdevicemmio-nvme-admin-doorbell-out-of-range. Teardown frees the data page first and proves the post-free re-submit fails closed. - Proof lines (asserted by
tools/qemu-pci-nvme-smoke.sh):nvme: admin-submit owner=nvme-storage admin_sq_slot=0 admin_data_slot=2 validator=sq-tail-doorbell scanned_items=1 command=identify-controller prp1_authored=true ... doorbell_written=performed host_physical_user_visible=false(kernel),[nvme-bringup-smoke] admin-interrupt-route-unmasked ... route_state_after=driver-unmasked,[nvme-bringup-smoke] admin-interrupt-wake result=interrupt-delivered real_interrupt_delivery=delivered wake_blocked=false ... interrupt_driven_wake=delivered(userspace, the interrupt-driven wake),[nvme-bringup-smoke] identify-complete ok command=identify-controller cid=0x0042 status=0x0000 phase=1 ... identify_vid=0x... completion_consumed= mapped-admin-cq-after-interrupt-wake ...(userspace),nvme: admin-complete-ack ... cq_head=1 ... address_bearing=false,[nvme-bringup-smoke] identify-cq-head-advanced ...,[nvme-bringup-smoke] admin-interrupt-stale-no-wake ... result=interrupt-waiter-cancelled ... masked_live_waiter_woke=false, and[nvme-bringup-smoke] stale-submit-refused ... reason=nvme-admin-command-not-armed(userspace). - Not in scope (this path): I/O queue pairs, read/write commands, and the remaining out-of-scope items below are covered in §8.
8. Brokered I/O queue pair + bounded READ (no-IOMMU)
nvme-io-queue-and-read extends the brokered no-IOMMU lane to one I/O queue
pair and one bounded read – the last piece of the userspace NVMe
storage-provider foundation. After the §7 IDENTIFY, the provider creates one
I/O queue pair (queue id 1) through admin commands, then issues one READ on
it. As on every brokered path the manager authors each command’s
address-bearing PRP1 from a live ledger page; the provider supplies only the
non-addressing dwords and the doorbell index.
- I/O queue entry sizes: the re-enable (§6) must program
CC.IOSQES(bits 19:16, log2 of the 64 B SQ entry = 6) andCC.IOCQES(bits 23:20, log2 of the 16 B CQ entry = 4) before any I/O queue is created (NVMe Base Spec §3.1.5); aCC.EN1->0 reset clears all ofCC, so the provider sets them explicitly (resultingCC = 0x00460001). Creating an I/O queue withIOCQES/IOSQESunset is refused by the controller (QEMU returns command-specific Invalid Queue Size). - Create I/O queue commands (wire subset):
CREATE I/O COMPLETION QUEUE(opcode0x05, NVMe Base Spec §5.3) andCREATE I/O SUBMISSION QUEUE(opcode0x01, §5.4) are admin commands submitted on the admin SQ. CDW10 carries the zero-based queue size (bits 31:16) and queue id (bits 15:0); the create-CQ CDW11 sets PC=1 with IEN=0 (no I/O interrupt; the completion is polled), the create-SQ CDW11 sets the completion queue id (bits 31:16) and PC=1. PRP1 (the queue base) is left zero by the provider and authored by the manager from the I/O CQ (slot 3) / I/O SQ (slot 4) ledger page. - Opcode-directed PRP authoring: a write to the admin SQ tail doorbell is
routed to
nvme_brokered_admin_sq_doorbell(kernel/src/device_manager/mod.rs), which reads the opcode the provider wrote into the just-published SQE (at SQ indextail - 1) and maps it to the live ledger page whose device-visible address is that command’s PRP1:IDENTIFY-> IDENTIFY data (slot 2),CREATE I/O CQ-> I/O CQ (slot 3),CREATE I/O SQ-> I/O SQ (slot 4). An unrecognized opcode fails closed (devicemmio-nvme-admin-submit-unknown-opcode). The opcode is the only provider-supplied input consulted and is non-addressing. - READ command (wire subset): opcode
0x02(NVM Command Set §3.x), NSID 1, starting LBA 0 in CDW10/11, NLB 0 (zero-based, one block) in CDW12. PRP1 (the data buffer) is authored by the manager from the read data page (slot 5). - I/O doorbells: the same
NvmeControllerRegisterclaim covers the I/O SQ tail doorbell (0x1008) and I/O CQ head doorbell (0x100c) – queue id 1 with doorbell strideCAP.DSTRD=0(SQytail at0x1000 + (2y)*4, CQyhead at0x1000 + (2y+1)*4). The I/O SQ tail doorbell is routed tonvme_brokered_io_sq_doorbell, which requires the opcode to beREAD, materializes the READ PRP1 from the live read data page, validates it through the Model B on-notify validator (ScanKind::SqTailDoorbell), authors the SQE PRP, and rings the doorbell. The I/O CQ head doorbell (nvme_brokered_io_cq_head_doorbell) carries only the consumed-entry head index (no address-bearing field). - Completion consume: the create-queue and READ completions are consumed by
polling the mapped CQ phase tags. The single kernel grant-source injected
interrupt delivery is spent on the §7 admin IDENTIFY wake
(
delivery_count_before == 0gates injection to one delivery per route), so an interrupt-driven I/O completion wake awaits the device-autonomous MSI-X table-programming increment that §7 already defers. The provider confirms the controller DMA-transferred real data by checking the harness-seeded LBA 0 signature (0x4f504143= “CAPO”) in its mapped read data page – proving the read moved bytes through the brokered PRP, not a zero page. - Fail-closed before submit: an I/O SQ tail doorbell with the I/O CQ/SQ or
read data page unallocated, freed, or in-flight returns
nvme-io-command-not-armed(devicemmio-nvme-io-submit-not-armed) with no MMIO side effect; an out-of-range index returnsdevicemmio-nvme-io-doorbell-out-of-range; an opcode other thanREADreturnsdevicemmio-nvme-io-submit-unknown-opcode. Teardown frees the read data page first and proves the post-free I/O re-submit fails closed. - Proof lines (asserted by
tools/qemu-pci-nvme-smoke.sh):nvme: admin-submit ... command=create-io-cq ... admin_data_slot=3 sq_tail=2,[nvme-bringup-smoke] create-io-cq-complete ok ...,nvme: admin-submit ... command=create-io-sq ... admin_data_slot=4 sq_tail=3,[nvme-bringup-smoke] create-io-sq-complete ok ...,nvme: io-submit owner=nvme-storage io_queue_id=1 io_sq_slot=4 io_read_data_slot=5 ... command=read ... io_sq_tail=1 doorbell_offset=0x1008 doorbell_written=performed host_physical_user_visible=false(kernel),[nvme-bringup-smoke] io-read-complete ok command=read cid=0x0053 status=0x0000 ... io_read_dword0=0x4f504143 ... completion_consumed=mapped-io-cq-polled(userspace, the read data proof),nvme: io-complete-ack ... io_cq_slot=3 cq_head=1 ... address_bearing=false, and[nvme-bringup-smoke] io-stale-submit-refused ... reason=nvme-io-command-not-armed(teardown). - Not in scope: device-autonomous MSI-X delivery (hardware MSI-X table
programming, a device-raised I/O completion interrupt, and an interrupt-driven
I/O completion wake), multi-block / write / scatter-gather (PRP-list) I/O,
cloud (GCP/AWS/Azure) enumeration or evidence, and host-physical/IOVA export
remain out of scope.
hostile_hardware_isolation=not-claimed.
9. Production-path cloudboot proofs (non-qemu cloud kernel)
This section covers the non-qemu cloudboot kernel proofs. The older
cloud-prod-storage-bound-local-proof storage-bind path predates the later
production-stub NVMe manager operations: it binds
DeviceMmio/DMAPool/Interrupt surfaces to one NVMe function and
exercises an interrupt-dispatch proxy, but does not attempt controller enable,
admin commands, I/O queues, IDENTIFY, READ, or a userspace storage provider.
- Older storage-bind proxy:
cap::storage_bind_proof::report(kernel/src/cap/storage_bind_proof.rs) runs under#[cfg(not(feature = "qemu"))]duringkernel::run_init(kernel/src/main.rs). It selects an NVMe function withPciDevice::is_nvme_controller(kernel/src/pci.rs), stages a readbackDeviceMmiorecord throughdevice_manager::stage_bar_readback_region, a parked bounceDMAPool/DMABufferthroughdevice_manager::stage_bounce_buffer_dmapool_recordanddevice_manager::issue_manager_attached_dmabuffer_handle_with_request, and one MSI-X interrupt route throughdevice_interruptplus mask-first PCI MSI-X table programming. The I/O-completion evidence is a kernel-side proxy:device_interrupt::handle_lapic_deliveryadvances the live dispatch slot, deferred EOI is acknowledged, masked no-wake is checked, and teardown proves stale route/pool/buffer/MMIO handles fail closed. Its marker iscloudboot-evidence: storage-bound <token>, with summary fields such asnvme_admin_identify=not-attempted,nvme_read_command=not-attempted, andwaiter_wake=kernel-side-proxy. - Later production-stub manager ops: the non-
qemucloudboot kernel also implements real production-stub NVMe operations for the same local QEMU/cloudboot lane. The read-only bind, reset-onlyCC.EN=0selected-write claim, parked admin SQ/CQ/dataDMABuffermaterialization, brokered controller-enable manager operation (DeviceMmio.brokeredNvmeControllerEnable @6), and brokered adminIDENTIFY Controllermanager operation (DeviceMmio.brokeredNvmeAdminIdentify @7) live inkernel/src/device_manager/stub.rsand the production grant-source modules. These operations are not the storage-bind proxy: the manager authorsAQA/ASQ/ACQ,CC.EN=1, the fixed IDENTIFY SQE, PRP1, SQ tail doorbell, CQ polling, and CQ head doorbell from its parked ledger. The provider still supplies no host-physical address, IOVA, queue base, PRP/SGL, opcode, command id, doorbell offset, or doorbell value. - Local production provider chain: the moved
cloud-prod-nvme-brokered-userspace-provider-local-proofparent is closed by production-stub child records over the non-qemucloudboot kernel. The local QEMU/cloudboot chain reaches split admin completion (@8/@9plusInterrupt.wait/Interrupt.acknowledge), I/O queue creation (@10/@11), bounded READ/WRITE (@12-@15), second-LBA/multiblock I/O (@16-@19), synchronous read/write and read-bytes (@20-@22),BlockDevice.readBlocks/writeBlocks/ FLUSH, higher-level filesystem and Store consumers, a dedicated data-path completionInterruptroute, and multi-PRPBlockDevicewindows. These are manager-authored brokered operations inkernel/src/device_manager/stub.rsand the production proof modules, not provider-authored Model B doorbell writes. - READ-arm graduation to always-built production (
cloud-prod-nvme-storage-graduate-readarm-local-proof): the NVMeBlockDeviceREAD arm is the first capstone piece graduated OUT of the per-proofcloud_nvme_*_prooffeatures into always-built production code. TheBlockDeviceBackend::NvmeBrokeredarm and its arbitrary-windowreadBlocks @0body (kernel/src/cap/block_device.rs,cfg(not(qemu))), the shared read bodynvme_brokered_io_sync_command/nvme_brokered_io_sync_read_window_op_for_capand the brokered controller bring-up registers/helpers it reaches (kernel/src/device_manager/stub.rs), andlive_handle_for_nvme_blockdevicenow compile in the default no-proofcargo build/make capos-cloudboot-imagekernel – the GCE-validated production composition. ACTIVATION is fronted by a fail-closed runtime capability probekernel/src/nvme_storage_backend.rs(dma_backend.rs-style atomic verdict +select_nvme_blockdevice_handle()resolver): the cap is minted only when a stageddevice_mmiogrant resolves a live brokered-controller handle (recording the verdict), else a typed error is returned – never a panic. The no-NVMe default boot leaves the probe unverified, so theblock_devicegrant fails closed.writeBlocks/flushstay fail-closed on the graduated arm (named follow-up graduations). The graduated data plane is bounded, not a general-purpose driver: every command runs through the synchronous single-call seam (kernel/src/device_manager/nvme_sync_io_state.rs), which admits at most 64 single-call I/O commands per boot (MAX_SYNC_OPS) and permanently rejects further commands at the first I/O CQ wrap (no CQ phase-toggle handling) – both limits fail closed. Namespace geometry is IDENTIFY-derived, not assumed: after the fixed three-command bring-up sequence completes, the first geometry consultation issues one manager-authored IDENTIFY Namespace (CNS0x00, NSID 1, admin SQ index 3 / tail 4, NVMe Base Spec §5.17) throughnvme_namespace_geometry_for_cap(kernel/src/device_manager/stub.rs), parses NSZE plus the active LBA format (FLBAS + LBAFLBADS/MS), caches the verdict for the boot, and emitsnvme: brokered-identify-namespace ... nsze=... flbas=... lbads=... supported=....BlockDevice.info @2and thereadonly_fs/persistent_store/writable_fsNVMeBlockSource::inforeport this IDENTIFY-derived geometry, and every read/write window bound is enforced against it; while the claim is unavailable (bring-up incomplete, a failed claim reset the controller, or an unsupported format – anything other than 512 B data blocks with no interleaved metadata) those paths fail closed instead of falling back to a fixture constant. Proof:make run-cloud-provider-nvme-blockdevice-read-graduatedemitscloudboot-evidence: provider-nvme-blockdevice-read-graduated <token>(read_arm=always-built data_plane_feature_gated=false probe_verdict=verified nvme_read_roundtrip_match=true). This is a local QEMU/cloudboot proof; it does NOT claim a live cloud NVMe run, direct DMA, IOVA export, or a write/durability graduation. - Production boundary: one production-stub NVMe path now has live GCE
Persistent Disk evidence:
provider-nvme-io-readcompleted one brokered 512-byte READ on run1780806087-bf69. Other production-stub NVMe proofs remain local QEMU/cloudboot evidence unless their task record explicitly says otherwise. The current evidence still does not claim direct DMA, cloud/guest IOMMU support, provider-visible device addresses, device-autonomous MSI-X delivery, AWS/Azure storage, a reusable storage provider, or full filesystem integration. The NVMeBlockDeviceREAD data plane is graduated to always-built production (above); other write/FLUSH/filesystem consumers and broader windows have local proof coverage but remain bounded by their recorded production proof gates unless their task record explicitly says the surface was graduated.
AWS Nitro EBS (NVMe storage)
This is a provenance map for the AWS Nitro EBS storage shape: how an AWS Nitro instance presents its EBS volumes to the guest, why that surface is the same standard NVMe device the shared NVMe storage-provider foundation already drives, and the small AWS delta capOS adds on top of it. It is not a re-spec; the NVMe register/queue/PRP wire subset capOS actually touches is documented once in NVMe and not repeated here.
Maturity caveat. This page documents a local QEMU cloud-shape
classification, not a bound driver running on real AWS hardware. The NVMe
bind/identify/read lifecycle is proven locally on make run-pci-nvme against
QEMU’s -device nvme; the AWS delta is the AWS-context classification proof
line and the Nitro DMA-backend policy note on top of that shared NVMe
foundation. End-to-end AWS EBS enumeration, live namespace I/O, and cloud
evidence capture are future work (tracked as cloud-aws-storage-live-proof),
blocked until AWS access is provisioned. The ENA NIC is a distinct
driver-binding claim (cloud-aws-ena-nic-live-proof) and is out of scope here.
1. Spec basis
- Device: AWS Nitro EBS controller. All AWS Nitro-based instance families
(effectively all current generations) expose attached EBS volumes as NVMe
namespaces behind a standard NVMe PCI controller – there is no AWS-specific
storage transport and no virtio-scsi alternative (unlike GCP, whose
first-/second-generation families use virtio-scsi). PCI class
0x01(mass storage), subclass0x08(NVM), programming interface0x02(NVM Express) – the same class triple QEMU emulates with-device nvmeand the kernel detects withPciDevice::is_nvme_controller(kernel/src/pci.rs). - Production PCI identity: the Nitro EBS controller carries Amazon’s PCI
vendor id
0x1d0f(device id0x8061for the EBS NVMe controller), distinct from QEMU’s0x1b36. capOS therefore classifies on the device class surface and the brokered no-IOMMU bounce DMA shape, not on a vendor-id match (see §3); the live vendor-id confirmation belongs to the deferredcloud-aws-storage-live-proof. - Authoritative spec: the NVM Express Base Specification (NVMe 1.4 / 2.0) is
the wire contract; AWS publishes no separate EBS register spec because the
device is a standard NVMe controller. AWS documents the namespace exposure
in the “Amazon EBS and NVMe on Linux instances” guide
(https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html);
the in-guest reference driver is the upstream Linux
drivers/nvme/host/. - Wire-format subset capOS implements: identical to the standard NVMe subset
documented in NVMe §1-§2 (controller registers
CAP/CC/AQA/ASQ/ACQ/CSTS, the admin and one I/O submission/completion queue pair, per-queue doorbells, and PRP1/PRP2 data pointers). Nitro EBS adds no fields beyond that subset, so this page does not re-list them.
2. Wire format (relevant subset)
See NVMe §2 and §6-§8. There is no AWS-specific wire format to
document: the brokered controller enable (manager-authored AQA/ASQ/ACQ),
the admin IDENTIFY, the one I/O queue pair, and the bounded READ all use the
standard NVMe encoding the shared foundation already implements and proves.
3. capOS mapping
The AWS delta is a cloud-shape classification plus a DMA-backend policy consumption detail layered onto the shared NVMe storage-provider foundation; it adds no new driver code.
- Cloud-shape classification proof: after the first enumerated NVMe
controller is bound (
bind_qemu_nvme_controller), the enumeration path emits anvme: cloud shape classification cloud_shape=aws-nitro-ebs ...proof line (kernel/src/pci.rsreport_cloud_nvme_shape) classifying the bound controller against the documented AWS Nitro EBS device surface. It prints the enumeratedpci_vendor/pci_device_idandclass/subclass/prog_if, records the productionaws_nitro_ebs_vendor=0x1d0fidentity as documentation (not as a claimed match), and carries explicit scope flags (local_qemu_precursor=true,real_aws_enumeration=not-claimed,ena=separate-nic-driver-out-of-scope).make run-pci-nvmeasserts this line conjunctively with the bounce-bufferdma: backend selectionline (tools/qemu-pci-nvme-smoke.shassert_nvme_cloud_shape), tying the bound device surface to the DMA backend resolved that boot. - Nitro IOMMU-availability DMA-backend policy: AWS Nitro does not guarantee
guest VT-d remapping the way QEMU’s emulated IOMMU does, so the DMA backend the
live AWS path consumes is selected by
cloud-dma-backend-selection(kernel/src/dma_backend.rsselect_and_report): direct-remapping where a usable+safe IOMMU is positively probe-verified, else the labeled bounce-buffer fallback. The classification line labels the expected backend (aws_labeled_dma_backend=bounce-buffer,dma_backend_policy=direct-remapping-if-verified-else-bounce-buffer); the resolved backend is proven separately by thedma: backend selectionline, which on the no-IOMMUmake run-pci-nvmegate isbounce-buffer. - Brokered DMA / no host-physical exposure: the binding lifecycle reuses the
brokered no-IOMMU lane documented in NVMe §6-§8 – the
manager authors every address-bearing register and PRP from the live DMA
ledger, and
host_physical_user_visible=falseholds throughout. On a verified remapping lane the provider-written Model B path would apply instead; on the no-IOMMU gate the brokered bounce shape is the only consistent path (seedocs/dma-isolation-design.md, “Provider-Written Addresses And No-IOMMU Brokered Bounce”). DeviceMmio/Interrupt/DMAPool: unchanged from the shared foundation – the reset-onlyCCselected-write claim, the brokered admin and I/O doorbells, the interrupt-driven admin completion wake, and theDMAPool-allocated queue/data pages described in NVMe §4-§8.- QEMU-emulable vs hardware-only: the classification and the full
bind/identify/read lifecycle are end-to-end QEMU-emulable (
make run-pci-nvme). Live EBS enumeration over a real Nitro controller – vendor-id0x1d0fconfirmation, real namespace geometry, and live block I/O – is hardware-only and is the deferredcloud-aws-storage-live-proof.
Related
- NVMe – the shared NVMe controller wire subset and brokered no-IOMMU storage-provider foundation this shape binds onto.
- virtio-net – the worked cloud-shape classification example (GCP virtio-net) this page mirrors for AWS storage.
docs/dma-isolation-design.md– the DMA-backend selection model and the no-IOMMU brokered bounce policy.docs/backlog/hardware-boot-storage.md– the cloud device tracks, including the deferred live-AWS storage proof.
Azure managed disk (NVMe storage)
This is a provenance map for the Azure managed-disk storage shape: how an Azure VM presents its managed (and local) disks to the guest, why the modern surface is the same standard NVMe device the shared NVMe storage-provider foundation already drives, why the older-family SCSI path is not a usable alternative here, and the small Azure delta capOS adds on top of the shared foundation. It is not a re-spec; the NVMe register/queue/PRP wire subset capOS actually touches is documented once in NVMe and not repeated here.
Maturity caveat. This page documents a local QEMU cloud-shape
classification, not a bound driver running on real Azure hardware. The NVMe
bind/identify/read lifecycle is proven locally on make run-pci-nvme against
QEMU’s -device nvme; the Azure delta is the Azure-context classification
proof line and the Azure DMA-backend policy note on top of that shared NVMe
foundation. End-to-end Azure managed-disk enumeration, live namespace I/O, and
cloud evidence capture are future work (tracked as
cloud-azure-storage-live-proof), to be done when Azure access is provisioned.
The Azure MANA NIC is a distinct driver-binding claim
(see Azure MANA) and is out of scope here.
1. Spec basis
- Device: Azure managed-disk storage controller. Azure presents storage in
two shapes depending on VM generation:
- Azure Boost and newer NVMe-capable families expose managed disks (and
local SSD) as NVMe namespaces behind a standard NVMe PCI controller –
PCI class
0x01(mass storage), subclass0x08(NVM), programming interface0x02(NVM Express). This is the same class triple QEMU emulates with-device nvmeand the kernel detects withPciDevice::is_nvme_controller(kernel/src/pci.rs). This is the path this page documents. - Older VM families present managed disks over a Hyper-V SCSI
controller (a virtio-scsi-shaped interface). capOS has no userspace
virtio-scsi provider driver, and
make run-virtio-blkproves the kernel-owned virtio-blk driver – a kernel-owned driver leaves the hidden kernel DMA ownership the userspace-provider acceptance forbids. The SCSI path is therefore out of scope for this driver (recorded on the classification line asazure_scsi_path=no-userspace-provider-driver-out-of-scope); supporting it would be a separate userspace virtio-scsi provider-driver foundation, not a re-use of therun-virtio-blkgate.
- Azure Boost and newer NVMe-capable families expose managed disks (and
local SSD) as NVMe namespaces behind a standard NVMe PCI controller –
PCI class
- Production PCI identity: the Azure Boost NVMe controller carries
Microsoft’s PCI vendor id
0x1414, distinct from QEMU’s0x1b36. capOS therefore classifies on the device class surface and the brokered no-IOMMU bounce DMA shape, not on a vendor-id match (see §3); live vendor-id confirmation and real namespace geometry belong to the deferredcloud-azure-storage-live-proof. - Authoritative spec: the NVM Express Base Specification (NVMe 1.4 / 2.0) is
the wire contract; Azure publishes no separate managed-disk register spec
because the modern device is a standard NVMe controller. Azure documents the
Boost NVMe interface and namespace exposure in the “Azure Boost” and
“Enable NVMe” VM documentation
(https://learn.microsoft.com/azure/virtual-machines/enable-nvme-interface);
the in-guest reference driver is the upstream Linux
drivers/nvme/host/. - Wire-format subset capOS implements: identical to the standard NVMe subset
documented in NVMe §1-§2 (controller registers
CAP/CC/AQA/ASQ/ACQ/CSTS, the admin and one I/O submission/completion queue pair, per-queue doorbells, and PRP1/PRP2 data pointers). Azure Boost adds no fields beyond that subset, so this page does not re-list them.
2. Wire format (relevant subset)
See NVMe §2 and §6-§8. There is no Azure-specific wire format to
document: the brokered controller enable (manager-authored AQA/ASQ/ACQ),
the admin IDENTIFY, the one I/O queue pair, and the bounded READ all use the
standard NVMe encoding the shared foundation already implements and proves.
3. capOS mapping
The Azure delta is a cloud-shape classification plus a DMA-backend policy consumption detail layered onto the shared NVMe storage-provider foundation; it adds no new driver code.
- Cloud-shape classification proof: after the first enumerated NVMe
controller is bound (
bind_qemu_nvme_controller), the enumeration path emits anvme: cloud shape classification cloud_shape=azure-managed-disk ...proof line (kernel/src/pci.rsreport_cloud_nvme_shape_azure, alongside the AWSreport_cloud_nvme_shape) classifying the same bound controller against the documented Azure managed-disk device surface. It prints the enumeratedpci_vendor/pci_device_idandclass/subclass/prog_if, records the productionazure_nvme_vendor=0x1414identity as documentation (not as a claimed match), records the out-of-scope SCSI path (azure_scsi_path=no-userspace-provider-driver-out-of-scope), and carries explicit scope flags (local_qemu_precursor=true,real_azure_enumeration=not-claimed,mana=separate-nic-driver-out-of-scope).make run-pci-nvmeasserts this line (tools/qemu-pci-nvme-smoke.shassert_nvme_cloud_shape_azure) in the same boot as the bounce-bufferdma: backend selectionline asserted byassert_nvme_cloud_shape, tying the bound device surface to the DMA backend resolved that boot. - Azure IOMMU-availability DMA-backend policy: Azure does not guarantee a
guest-visible VT-d/IOMMU the way QEMU’s emulated IOMMU does, so the DMA backend
the live Azure path consumes is selected by
cloud-dma-backend-selection(kernel/src/dma_backend.rsselect_and_report): direct-remapping where a usable+safe IOMMU is positively probe-verified, else the labeled bounce-buffer fallback. The classification line labels the expected backend (azure_labeled_dma_backend=bounce-buffer,dma_backend_policy=direct-remapping-if-verified-else-bounce-buffer); the resolved backend is proven separately by thedma: backend selectionline, which on the no-IOMMUmake run-pci-nvmegate isbounce-buffer. - Brokered DMA / no host-physical exposure: the binding lifecycle reuses the
brokered no-IOMMU lane documented in NVMe §6-§8 – the manager
authors every address-bearing register and PRP from the live DMA ledger, and
host_physical_user_visible=falseholds throughout. On a verified remapping lane the provider-written Model B path would apply instead; on the no-IOMMU gate the brokered bounce shape is the only consistent path (seedocs/dma-isolation-design.md, “Provider-Written Addresses And No-IOMMU Brokered Bounce”). DeviceMmio/Interrupt/DMAPool: unchanged from the shared foundation – the reset-onlyCCselected-write claim, the brokered admin and I/O doorbells, the interrupt-driven admin completion wake, and theDMAPool-allocated queue/data pages described in NVMe §4-§8.- QEMU-emulable vs hardware-only: the classification and the full
bind/identify/read lifecycle are end-to-end QEMU-emulable (
make run-pci-nvme). Live managed-disk enumeration over a real Azure Boost controller – vendor-id0x1414confirmation, real namespace geometry, and live block I/O – is hardware-only and is the deferredcloud-azure-storage-live-proof.
Related
- NVMe – the shared NVMe controller wire subset and brokered no-IOMMU storage-provider foundation this shape binds onto.
- AWS Nitro EBS (NVMe storage) – the sibling cloud NVMe storage shape; same shared foundation, different cloud provenance. AWS is NVMe-only with no SCSI alternative, whereas Azure’s older families use SCSI.
- virtio-net – the worked cloud-shape classification example (GCP virtio-net) the storage classifications mirror.
- Azure MANA – the distinct Azure NIC driver-binding claim, out of scope for this storage surface.
docs/dma-isolation-design.md– the DMA-backend selection model and the no-IOMMU brokered bounce policy.docs/backlog/hardware-boot-storage.md– the cloud device tracks, including the deferred live-Azure storage proof.
GCP Persistent Disk (storage)
This is a provenance map for the GCP Persistent Disk (PD) storage shape: how a GCE instance presents its persistent disks to the guest, why most current families expose them as standard NVMe namespaces the shared NVMe foundation already drives, and the small GCP delta capOS adds on top. It is not a re-spec; the NVMe register/queue/PRP wire subset capOS actually touches is documented once in NVMe and not repeated here.
Maturity caveat. This page documents one bounded live-GCE NVMe Persistent
Disk proof on a c3-standard-4 VM, plus the local QEMU/cloudboot proofs that
preceded it. The live proof is a single brokered NVMe READ through provider
authority; it is not a general reusable storage provider, filesystem
integration, virtio-scsi path, Local SSD path, direct-DMA claim, or
device-autonomous MSI-X claim. The older
cloud-prod-storage-bound-local-proof composes production grant surfaces over a
discovered NVMe function and emits
cloudboot-evidence: storage-bound on a local boot of the
make capos-cloudboot-image disk under QEMU. The later
cloud-prod-nvme-brokered-userspace-provider-local-proof child chain drives the
same local QEMU -device nvme surface through brokered controller bring-up,
admin IDENTIFY, I/O queue creation, BlockDevice read/write/flush, a
dedicated data-completion Interrupt route, and multi-PRP windows while
preserving manager-authored queue-base/PRP materialization. The live GCE
closeout is the cloud-gcp-storage-driver run described in §6.
1. Spec basis
- Device: GCE Persistent Disk. GCE exposes attached PD volumes as a block
device on the guest PCI surface. The legacy first-/second-generation
families use
virtio-scsi; current generations (Tau T2A, third-generation-or-later N2/N2D/C3, Confidential VM paths) expose them as NVMe namespaces behind a standard NVMe PCI controller – PCI class0x01(mass storage), subclass0x08(NVM), programming interface0x02(NVM Express) – the same class triple QEMU emulates with-device nvmeand the kernel detects withPciDevice::is_nvme_controller(kernel/src/pci.rs). - Production PCI identity: the GCE NVMe PD controller carries Google’s
PCI vendor id (current generation
0x1ae0, distinct from QEMU’s0x1b36). capOS therefore classifies on the device class surface and the brokered no-IOMMU bounce DMA shape, not on a QEMU vendor-id match (see §3). The livecloud-gcp-storage-driverrun confirmed the GCE NVMe PD identity asvendor.1ae0/dev.001fon BDF0000:00:05.0. - Authoritative spec: the NVM Express Base Specification (NVMe 1.4 / 2.0) is the wire contract; Google publishes no separate PD register spec because the device is a standard NVMe controller on the NVMe-family GCE shapes. Google documents PD device exposure under the “Persistent Disk overview” and “Local SSD” pages (https://cloud.google.com/compute/docs/disks).
- virtio-scsi alternative: older GCE families use
virtio-scsifor PD rather than NVMe. capOS has no userspace virtio-scsi provider driver and the in-treemake run-virtio-blkproves the kernel-owned virtio-blk driver, which would leave the hidden kernel DMA ownership the userspace-provider acceptance forbids. So the older-familyvirtio-scsipath is recorded out of scope here (gcp_scsi_path=no-userspace-provider-driver-out-of-scope), the same shape asdocs/devices/azure-disk.mdrecords for the Hyper-V/virtio-scsi older-family path.
2. Wire format (shared with docs/devices/nvme.md)
GCE NVMe PD is standard NVMe: the controller registers, admin SQ/CQ
descriptors, IDENTIFY data, I/O SQ/CQ descriptors, PRP entries, and the
on-notify validator scan targets are exactly the ones documented in
NVMe §2. No GCP-specific subset is reproduced here. The
shared NVMe storage-provider foundation
(nvme-bind-claimed-mmio-read,
nvme-controller-reset-selected-write,
nvme-no-iommu-brokered-controller-enable,
nvme-admin-queue-identify,
nvme-admin-interrupt-delivery,
nvme-io-queue-and-read) is the same wire model the local production
cloudboot chain ports into kernel/src/device_manager/stub.rs and the
production grant-source modules. The cloud-gcp-storage-driver closeout
validated that provider/storage binding against the live GCE PD controller
identity and evidence surface for one bounded NVMe READ.
3. capOS mapping
- Cloud-shape classification:
kernel/src/pci.rsreport_cloud_nvme_shape(the GCP path) classifies the bound controller against the GCE NVMe surface and emits thenvme: cloud shape classification cloud_shape=gcp-persistent-disk ...proof line onmake run-pci-nvme, conjunctively with the bounce-bufferdma: backend selectionline. - DMA backend: GCE IOMMU-availability is the
direct-remapping-if-verified-else-bounce-buffer policy from
cloud-dma-backend-selectionand the “Cloud DMA Backend” section ofdocs/dma-isolation-design.md. The 2026-05-24 GCE live probes recordedn1-standard-1,e2-small,c3-standard-4, andn2d-standard-2Confidential shapes asIOMMU disabled → SWIOTLB → labeled bounce-bufferin Cloud DMA Provider Evidence Inventory, so the cloud-shape proof line and the production storage-bind proof both run conjunctively with the bounce-buffer DMA backend. - No host-physical / IOVA export:
iova_export=disabled-future-only,host_physical_user_visible=0,direct_dma=blocked,real_dma=not-attempted— the same brokered-bounce shape NVMe records in §6–§8 ofnvme.mdand the production storage-bind proof records in §9.
4. Production storage-bind proof (local QEMU; non-qemu kernel)
cloud-prod-storage-bound-local-proof (the prerequisite of the billable
cloud-gcp-storage-driver slice) lands the production-path NVMe storage-bind
proof on the non-qemu cloud kernel. The implementation, composition, MSI-X
table program, I/O-completion handoff (kernel-side proxy), masked-no-wake,
teardown / stale-handle assertions, headline cloudboot evidence shape, why
the proof is settled with a kernel-side proxy, and asserted proof lines are
documented once in nvme.md §9
and not reproduced here. The marker is parsed by tools/cloudboot/run-test.sh
as STORAGE_BOUND_MARKER into provider.json.storage_bind_proof.
The local QEMU boot of target/disk.raw (make capos-cloudboot-image,
-device nvme) demonstrates the bound on QEMU’s NVMe class triple; it does not
exercise a live GCE PD NVMe vendor id.
5. Local production brokered NVMe provider chain
The moved parent
cloud-prod-nvme-brokered-userspace-provider-local-proof
closes the local production provider prerequisite through its child records.
The implemented path is the same brokered no-IOMMU shape as nvme.md: the
manager authors AQA/ASQ/ACQ, queue-base pages, PRP1 entries, PRP lists,
doorbells, and completion consumption from live DMAPool ledger records. The
provider sees capability results and returned data bytes, not host-physical
addresses, IOVAs, queue-base values, or provider-authored PRP/SGL fields.
The local evidence covers:
- brokered controller enable and admin
IDENTIFY; - I/O queue creation, bounded READ/WRITE, second-LBA and multiblock I/O;
BlockDevice.readBlocks,writeBlocks, and FLUSH-backed higher-level consumers over theNvmeBrokeredbackend;- dedicated data-path
Interrupt.wait/Interrupt.acknowledgecompletion proof; - multi-PRP windows larger than one PRP1 page, with PRP list entries written by the manager.
This remains the local QEMU/cloudboot foundation under the same brokered authority model. The billable real-GCE Persistent Disk bind run is the bounded NVMe evidence in §6.
6. Live GCE NVMe Persistent Disk proof
cloud-gcp-storage-driver closed with live GCE run 1780806087-bf69, launched
by make cloudboot-gcp-storage-nvme-io-read-test at source commit
28518165518c29a48633682f4a6d9b5844c43335. The run used a c3-standard-4
instance in europe-west3-a with storage_interface=nvme. The harness launched
with GVNIC guest feature / NIC type because C3 requires that launch posture;
this storage page does not claim a gVNIC driver or NIC datapath proof.
The evidence identified the GCE PD NVMe controller as class 01.08.02,
vendor.1ae0, device.001f, BDF 0000:00:05.0, with
selected_dma_backend=bounce_buffer and enumeration_source=legacy-io. The
manager drove the shared brokered NVMe chain: admin IDENTIFY, I/O CQ/SQ
creation, and one I/O READ against NSID 1, SLBA 0, NLB 1 / 512 bytes. The
serial marker recorded live_cloud=gce-persistent-disk,
io_read=completed, io_sq_doorbell=performed,
io_cq_completion=polled-io-cq, prp_source=manager-ledger,
host_physical_user_visible=0, and iova_export=disabled-future-only. The
read digest prefix was eb3c904c494d494e4520200002000000.
The capOS authority mapping is the same one recorded in nvme.md: DeviceMmio
gates BAR register and doorbell effects, DMAPool owns queue/data pages and
manager-authored PRP materialization, and Interrupt is present as the bounded
provider authority surface. The live read proof polls the I/O CQ; it does not
claim device-autonomous MSI-X delivery. The cloud harness evidence also recorded
no public IP, no service account, and teardown_status=complete.
7. Not in scope
- The older-family
virtio-scsiPD path (gcp_scsi_path=no-userspace-provider-driver-out-of-scope). - The Local SSD storage path (separate device surface, deferred).
- Multi-namespace, FUA, DSM, reusable
BlockDevice/filesystem integration on live GCE, or live-provider device-autonomous completion delivery (deferred pernvme.md). - Direct DMA, IOVA export, IOMMU/remapping programming (the
direct-remapping-if-verifiedbranch of the DMA-backend policy applies once a GCE shape with a verified vIOMMU is added; no current probed GCE shape satisfies that branch). - AWS EBS, Azure managed disk, and GCP NIC readiness.
ATAPI CD-ROM + ISO 9660 (boot-time reader)
This is a provenance map for the boot-time CD-ROM read path: it cites the
specs, summarizes only the wire-format subset the code actually implements, and
points into the implementation. It is not a re-spec. Unlike the PCI/virtio
device pages, this is a legacy port-I/O hardware transport used only during
boot or install-source proofs to read ELF/package bytes from an ISO; the
capOS-mapping section reflects its boot-only, kernel-owned status. The boot
source itself is planned CD-ROM/ISO support, not a deprecated path. The driver
is concise and feature-gated (boot_iso_read / boot_iso), so the treatment is
a short map rather than exhaustive register tables.
The whole reader lives in kernel/src/iso/mod.rs.
1. Spec basis
- Device: ATAPI CD-ROM on a legacy IDE (Parallel ATA) channel, accessed by
polled PIO over the legacy I/O ports. Not a PCI/virtio device and not
enumerated through PCI; the two legacy channels are probed at fixed port
bases (
PRIMARY_CMD/PRIMARY_CTRL0x1F0/0x3F6,SECONDARY_CMD/SECONDARY_CTRL0x170/0x376). QEMU’s-cdromshorthand attaches the disc on the secondary channel (master), whichAtapiDevice::probescans first. - Authoritative specs:
- ATA Packet Interface — the PACKET command transport and the
ATA/ATAPI register protocol, as standardized in the SFF-8020i /
ATA/ATAPI-4+ family (INCITS T13). The PACKET data-in handshake, the ATAPI
signature in the cylinder-low/high (LBA mid/high) registers, and the
READ(12)/READ CAPACITY(10)command-descriptor blocks come from this basis. The signature the driver matches isATAPI_SIG_MID(0x14) in the LBA-mid register andATAPI_SIG_HIGH(0xEB) in the LBA-high register. - ECMA-119 (equivalently ISO 9660), Volume and File Structure of CDROM for
Information Interchange — the volume-descriptor and directory-record
on-disk layout the
IsoFsparser indexes. The relevant structures are the primary volume descriptor (PVD) and the directory record.
- ATA Packet Interface — the PACKET command transport and the
ATA/ATAPI register protocol, as standardized in the SFF-8020i /
ATA/ATAPI-4+ family (INCITS T13). The PACKET data-in handshake, the ATAPI
signature in the cylinder-low/high (LBA mid/high) registers, and the
- Reference: the legacy IDE/ATAPI PIO sequence and the ISO 9660 fixed-offset field layout are the well-documented OSDev-wiki “ATAPI”/“ISO 9660” baseline; cross-checked against the QEMU IDE/ATAPI device behavior the proofs run against.
2. Wire format (implemented subset)
Only the polled-PIO read subset the driver uses is summarized; ATA features the driver never issues (DMA transfers, write commands, the full SCSI command set) are not implemented and are not transcribed here.
ATAPI PACKET read path
- Channel register map: command-block register offsets relative to the
channel command base —
REG_DATA(0),REG_FEATURES(1),REG_SECCOUNT(2),REG_LBA_LOW(3),REG_LBA_MID(4),REG_LBA_HIGH(5),REG_DRIVE(6),REG_STATUS/REG_COMMAND(7). In the ATAPI PACKET protocolREG_LBA_MID/REG_LBA_HIGHcarry the byte-count low/high for the data phase, not an LBA. - Status / control bits:
STATUS_BSY(0x80),STATUS_DRQ(0x08),STATUS_ERR(0x01); control-blockCTRL_NIEN(0x02, interrupts disabled — this path is polled only) andCTRL_SRST(0x04, soft reset insoft_reset). Drive select isDRIVE_MASTER(0xA0) /DRIVE_SLAVE(0xB0). - Probe / detect:
AtapiDevice::probesoft-resets each channel and callsdetect, which selects a drive and matches theATAPI_SIG_MID/ATAPI_SIG_HIGHsignature; a0xFFstatus is treated as a floating (empty) bus. Every status spin is bounded bySPIN_LIMIT(wait_not_busy/wait_drq), so an absent or wedged device fails closed rather than hanging boot. - PACKET command issue (
AtapiDevice::packet_data_in): writesCMD_PACKET(0xA0) to the command register, programs the per-block byte-count limitBYTE_LIMIT(2048, one CD logical sector) into the LBA mid/high registers, waits for DRQ, then writes the 12-byte command-descriptor block (CDB) as six 16-bit words. The data-in phase reads each DRQ block, taking its byte count from the LBA mid/high registers; it rejects a byte count overBYTE_LIMIT, an odd byte count, or one that would overflow the destination buffer (IsoError::Protocol/BufferTooSmall). - CDBs implemented:
CDB_READ12(0xA8) with the big-endian LBA at bytes 2..6 and transfer length at bytes 6..10 (built inAtapiDevice::read_sectors), andCDB_READ_CAPACITY10(0x25, inAtapiDevice::read_capacity) returning the last addressable LBA and the logical block size. A reported block size is range-checked againstMIN_BLOCK_SIZE(2048) /MAX_BLOCK_SIZE(4096). - Bounded sector read (
AtapiDevice::read_sectors): rejects a zero count, arithmetic overflow (IsoError::InvalidRequest), an LBA range past the reported capacity (OutOfRange), and a destination buffer shorter thancount * block_size(BufferTooSmall), all before any device access. A device that returns fewer bytes than requested is rejected (ShortRead).
ISO 9660 volume structure
- Primary volume descriptor:
IsoFs::mountreadsPVD_LBA(sector 16, after the reserved system area) throughread_sectorsand validates the descriptor type (pvd[0] == 1), theCD001standard identifier (pvd[1..6]), and the version (pvd[6] == 1). ISO 9660 stores integers both-endian (a little-endian half followed by a big-endian half); the driver reads the little-endian half withle_u16/le_u32. It indexes the logical block size (both-endianu16at offset 128, which must equal the device block size), the volume space size (both-endianu32at offset 80), and the embedded root directory record (34-byteMIN_DIR_RECORDat offset 156, whose extent LBA is bytes 2..6 and size bytes 10..14). - Directory records:
IsoFs::lookup/list_boot_binswalk a directory extent record by record. Each record’s length is byte 0 (a zero length skips to the next logical-sector boundary); the file-flags byte 25 carriesFILE_FLAG_DIR(0x02); the file-identifier length is byte 32 and the identifier starts at byte 33; the extent LBA/size are bytes 2..6 / 10..14. The./..self/parent records (identifier0x00/0x01) are skipped, and identifiers are matched case-insensitively after stripping the;versionsuffix and trailing dots (name_matches/normalize_ident). - Path resolution:
IsoFs::lookup_pathdescends from the root through each component;boot_bins_dirresolves/boot/bins/andopen_fileresolves a named file beneath it, each returning a validated(lba, size)extent.
3. capOS mapping
- Binding (boot-only, kernel-owned): this is not a DDF device. It is not
enumerated through PCI and does not bind through the
DeviceMmio/Interrupt/DMAPoolprovider grants the cloud-NIC/storage drivers use; it owns fixed legacy I/O ports directly in kernel mode and runs only during boot. There is no userspace driver and no*_grant_sourcefor the reader itself.- Under
boot_iso_readthe kernel runsiso::boot_read_proof/iso::boot_fs_proof(called fromkernel/src/main.rs) to exercise the device-read primitive and the ISO 9660 walk. - Under
boot_isothe reader is the live boot-binary source:iso::boot_source::init(inkernel/src/main.rsrun_init) builds a registry resolving each declared manifest binary name to its(lba, size)extent viaopen_file(name mapping throughiso_dname, which applies the ISO 9660 d-character substitutionxorrisorecords), andiso::boot_source::read_binaryreads each ELF on demand. The device is owned by the registry and serialized behind aMutexso concurrent spawns on multiple CPUs do not interleave PIO transfers on the shared IDE channel.
- Under
Directory/Filecap fixture: the read path has no caps of its own, but theinstallable_imagecap (kernel/src/cap/installable_image.rs) layers a read-onlyDirectory/FileCapObjectover this reader for the focused QEMU install-source proof. It exposes the packaged/boot/bins/tree to the installer smoke only; it is not a general post-bootstrap ISO filesystem service. It is granted via the qemu-gatedinstallable_image_source(KernelCapSource::InstallableImageSource);Directory.list/Directory.open+File.read/File.statare served and every mutating method fails closed (read-only is structural, not a rights flag). It reuses the driver’s in-bounds checks (IsoFs::validate_extentat mount/open,AtapiDevice::read_sectorsrange validation per read) and is physically scoped to the ATAPI medium, so it cannot reach the writable virtio-blk target disk.- MMIO / Interrupt / DMA: none. Access is legacy port I/O (
in/outvia the module’sinb/outb/inw/outwhelpers), not memory-mapped BARs. Interrupts are disabled (CTRL_NIEN) and the path is polled PIO, so there is no MSI/MSI-X vector binding. Transfers move through the data register word by word, so there is no DMA buffer and no IOMMU/bounce-buffer involvement. - Fail-closed / validation rules: every derived extent is validated against
the volume size (
IsoFs::validate_extent) before it is read, so a malformed or hostile volume cannot drive an out-of-bounds device read; directory extents are capped atMAX_DIR_BYTESand records are length/identifier-bounded before trusting them; capacity and buffer-length checks gateread_sectors; block size is floored/ceiled toMIN_BLOCK_SIZE/MAX_BLOCK_SIZE; and every status wait isSPIN_LIMIT-bounded. All failure modes funnel throughIsoError(NoDevice/Timeout/DeviceError/Protocol/InvalidRequest/OutOfRange/BufferTooSmall/ShortRead/BadVolume/NotFound/NotDirectory). - QEMU-emulable vs hardware-only: fully QEMU-emulable. QEMU’s
-cdromattaches an ATAPI CD-ROM on the secondary legacy IDE channel.make run-boot-iso-readproves the bounded ATAPI PIO read primitive and the ISO 9660 walk;make run-boot-isoand the defaultmake run-smokeprove the live on-demand boot-binary load path; andmake run-installable-image-sourceproves the read-onlyDirectory/Fileinstall-source fixture layered over the reader. No hardware-only path.
Related
kernel/src/iso/mod.rs— the ATAPI PIO reader, the ISO 9660IsoFsdriver, theboot_isoboot_sourceregistry, and the boot-time proofs.kernel/src/cap/installable_image.rs— the read-onlyDirectory/Filecap surface layered over this reader.kernel/src/main.rs—run_initISO boot-binary registry build and on-demand ELF load underboot_iso.
Azure MANA (Microsoft Azure Network Adapter)
This is a provenance map for the MANA / GDMA wire logic in
capos-lib/src/mana.rs: it cites the spec basis, summarizes only the
wire-format subset the code actually implements, and points into the
implementation by symbol name. It is not a re-spec.
Maturity caveat. This page documents protocol encode/decode logic with a
host-side conformance suite, not a bound driver. There is no MANA device in
QEMU, so this logic is a deliberate QEMU-exception gated by cargo test-lib
plus a warning-free cargo build --features qemu, not a make run-* smoke.
End-to-end MANA bind / send / receive / teardown on real Azure hardware –
including SR-IOV VF revocation with fallback-to-synthetic and DMA/MMIO/IRQ
teardown – is future work
(tracked as cloud-azure-mana-nic-live-proof), blocked until Azure access is
provisioned. The ## 3. capOS mapping section below therefore describes the
planned binding, not landed authority.
1. Spec basis
- Device: Microsoft Azure Network Adapter (MANA), the modern Azure NIC for
Dv5/Ev5 and later VM families. Exposed to the guest as a PCI SR-IOV Virtual
Function. PCI vendor
0x1414(Microsoft); device0x00ba(VF, the guest-bound function) /0x00b9(PF). IDs atcapos-lib/src/mana.rs(MANA_PCI_VENDOR_ID,MANA_VF_DEVICE_ID,MANA_PF_DEVICE_ID). The device is fronted by GDMA (Generic DMA), Microsoft’s queue/DMA abstraction; MANA is the network client riding on GDMA queues. - Authoritative spec: MANA has no freely published register specification.
The basis of record is the upstream open-source MANA Linux driver, whose “HW
DATA” structures are the documented wire contract:
include/net/mana/gdma.h– GDMA registers, doorbells, message headers, WQE/CQE/EQE, request-type space, device/queue enums.include/net/mana/mana.h– MANA TX/RX OOB descriptors, completion OOBs,mana_cqe_type,mana_command_code.include/net/mana/hw_channel.h,include/net/mana/shm_channel.h– the HWC management channel and the shared-memory bootstrap aperture.- Reference snapshot:
torvalds/linuxmaster at commitd60ec36cab338dfe2ae40d73e9c8d6c4af70d2b8(thegdma.hstructures are stable across recent kernels).
- Reference driver: the same MANA Linux driver
(
drivers/net/ethernet/microsoft/mana/) is the behavior cross-check;mana_gd_init_req_hdrdefines the standard request-header construction mirrored byGdmaReqHdr::standard.
2. Wire format (implemented subset)
All multi-byte words are little-endian; GDMA “HW DATA” structures are naturally
aligned (not packed). Every decoder validates buffer length, rejects unknown
enum members, and enforces must-be-zero (MBZ) reserved fields; every encoder
range-checks its bitfields. Symbols below are in capos-lib/src/mana.rs.
- Registers / BAR: single register BAR (BAR0). VF doorbell-page and
shared-memory aperture offsets (
GDMA_REG_*) and PF offsets (GDMA_PF_REG_*), the SR-IOV config base, and the fixed CQE/EQE/WQE-BU and max SQE/RQE sizes are in theregsmodule (REG_DB_PAGE_OFFSET,REG_SHM_OFFSET,PF_REG_*,SRIOV_REG_CFG_BASE_OFF,CQE_SIZE,EQE_SIZE,MAX_SQE_SIZE,MAX_RQE_SIZE,WQE_BU_SIZE). - Doorbells: the four-variant
union gdma_doorbell_entryis modeled by theDoorbellEntryenum (Cq/Rq/Sq/Eq), encoding the 24- or 16-bit queue id, the 31- or 32-bit tail pointer, the RQwqe_cnt, and the CQ/EQarmbit, with kind-specific reserved MBZ enforcement on decode. - Admin (HWC) messages:
GdmaMsgHdr(gdma_msg_hdr),GdmaDevId(gdma_dev_id),GdmaReqHdr(gdma_req_hdr, withstandardmirroringmana_gd_init_req_hdr), andGdmaRespHdr(gdma_resp_hdr, reserved-word MBZ). The request-type space isGdmaRequestType(gdma_request_type, fail-closed); the GDMA admin status is the openGdmaStatusspace (success /MoreEntries/CmdUnsupported/ preservedOther, since GDMA status is a firmware error space, not a closed enum). - Work queue:
GdmaSge(gdma_sge, 16-byte SGE with 64-bit address) andGdmaWqeHeader(gdma_wqe, the 8-byte WQE header:num_sge,inline_oob_size_div4,client_oob_in_sgl,client_data_unit, with reserved MBZ). MANA TX OOB descriptors that prepend the SGL:ManaTxShortOob(mana_tx_short_oob, checksum-offload + completion-CQ + vSQ-frame selection) andManaTxLongOob(mana_tx_long_oob, encapsulation / VLAN / inner-offset fields). - Completion / event:
GdmaCqeInfo(gdma_cqe.cqe_info:wq_num,is_sq, 3-bitowner_bits) andGdmaEqeInfo(union gdma_eqe_info: eventtypeviaGdmaEqeType,client_id,owner_bits). MANA completion OOBs:ManaCqeHeader(mana_cqe_header,cqe_typevia the fail-closedManaCqeTypeenum),ManaRxcompOob(mana_rxcomp_oob, RX flags +MANA_RXCOMP_OOB_NUM_PPIper-packetManaRxcompPerpktInfo+ RX WQE offset), andManaTxCompOob(mana_tx_comp_oob, TX data/SGL/WQE offsets + reserved-padding MBZ). - Capability / feature negotiation: the verify-version surface
(
GdmaRequestType::VerifyVfDriverVersion,GDMA_PROTOCOL_V1,GdmaOsType) and the MANA control command spaceManaCommandCode(mana_command_code, fail-closed) includingQueryDevConfig/QueryVportConfig/ConfigVportTx/Rx/CreateWqObj.
3. capOS mapping (planned – not yet implemented)
MANA is a vendor-custom cloud NIC behind SR-IOV. The intended binding, when the live-proof work is unblocked, follows the same userspace-driver authority gate the other DDF device classes use; none of the grants below are exercised by the host conformance logic.
- Authority gate: the MANA VF would be enumerated over PCI, claimed through the reviewed userspace-driver hardware-authority gate, and tracked in the device-manager ownership ledger, exactly as the cloud NIC/storage drivers are planned to bind. The current implementation grants nothing.
DeviceMmio: BAR0 (the GDMA register block, doorbell page, and SHM aperture) would be mapped device-uncacheable / NX, with doorbell writes scoped to the owning driver’s BAR window. The 64-bitDoorbellEntryvalues are the writes that path would emit.Interrupt: GDMA EQs deliver completions via MSI-X; the live driver would bind oneInterruptper EQ vector and arm it through the EQ doorbellarmbit. Theowner_bitsphase mechanism (GdmaCqeInfo/GdmaEqeInfo) is how the driver detects new entries without a tail register.DMAPool: GDMA queues and TX/RX buffers would be allocated from a labeled DMA pool through the selected DMA backend (cloud-dma-backend-selection: direct IOMMU vs labeled bounce buffer), with quiesce/scrub-before-reuse and host-physical-address / IOVA non-exposure. TheGdmaSgeaddress fields are IOVAs from that pool; the current implementation does not allocate or program any DMA.- Fail-closed / validation rules: the encode/decode logic is the fail-closed boundary capOS implements today – unknown request/queue/event/completion types and command codes are rejected, reserved fields are MBZ-enforced, and bitfields are range-checked. Stale-generation rejection, BAR bounds, doorbell scoping, and release/reset/VF-revocation teardown are the live driver’s responsibility and are future work.
- QEMU-emulable vs hardware-only: none of MANA is QEMU-emulable – QEMU
has no MANA device model. The wire logic here is provable only by the host
conformance suite (
cargo test-lib); SR-IOV VF revocation/hot-remove semantics in particular cannot be reproduced even by a hypothetical QEMU MANA device model and remain a live-hardware concern.
GCE gVNIC (Google Virtual Ethernet)
This is a provenance map for gVNIC, the Google Virtual NIC presented to Compute Engine guests. It cites the public specification basis, summarizes only the wire-format subset a capOS driver would implement, and maps the device onto capOS’s userspace-driver hardware-authority gate. It is not a re-spec: where the behavior is defined in the upstream driver or the public docs, it links rather than transcribing register tables.
Maturity caveat. This page remains primarily a grounding map. capOS has
landed live-GCE proofs that request the GVNIC image/instance posture, record
the gVNIC PCI function (1ae0:0042) with BAR and MSI-X metadata, map BAR0
through DeviceMmio, use manager-owned DMA pages for the admin queue and
descriptor buffer, and bring up one GQI/QPL TX/RX queue pair far enough to send
one DHCP DISCOVER raw Ethernet frame and receive one inbound IPv4 frame before
teardown. capOS also has a bounded hardware-only typed Nic adaptation proof
over that same queue path: the proof marker records Nic.transmit,
Nic.receive, Nic.macAddress, and Nic.linkStatus semantics with inline
frame transfer and no host-physical/IOVA export. capOS still has no reusable
gVNIC provider service and no host conformance suite. There is no gVNIC device
model in QEMU, so unlike the virtio-net path there is no local make run-*
smoke that can execute the device. The ## 3. capOS mapping section
distinguishes the landed inventory/admin-queue/raw-frame proof and typed
Nic-adaptation proof from future productionization work. The bounded
implementation lane that consumes this map is decomposed in
Hardware, Boot, and Storage.
gVNIC is a separate GCE portability lane, not a blocker for the first public
Web UI proof. GCE exposes a selectable VIRTIO_NET NIC type on supported
first/second-generation machine families, and capOS already drives modern
virtio-net (see virtio-net). A first public Web UI proof
scoped to a virtio-compatible GCE machine type needs no gVNIC support. gVNIC
matters because Google documents it as the Compute Engine NIC alternative to
virtio, with third-generation-and-later machine series supporting only gVNIC
for virtual network interfaces; it is the portability lane for those shapes, not
a precondition for the virtio-net Web UI proof.
1. Spec basis
- Device: Google Virtual NIC (gVNIC), the modern Compute Engine virtual
network interface. Exposed to the guest as a PCI function with vendor
0x1ae0(Google) and device0x0042. The same vendor/device pair is recorded for the GCP NIC path in Cloud Deployment (“PCI Device IDs for Cloud Hardware”). The upstream Linux driver names the device family GVE (Google Virtual Ethernet). - Authoritative spec: gVNIC has no freely published register specification.
The basis of record is the combination of:
- Google Cloud’s “Using Google Virtual NIC” Compute Engine documentation,
which defines the supported machine families, the
GVNICguest-OS image feature, thenic-type=GVNICinstance network-interface selection, and the virtio-net-versus-gVNIC machine-family matrix (https://cloud.google.com/compute/docs/networking/using-gvnic). - The Google Compute Virtual Ethernet (GVE) Linux driver, whose headers
are the documented wire contract: the device register block
(
gve_register.h), the admin-queue command space (gve_adminq.h), and the GQI / DQO descriptor formats (gve_desc.h,gve_desc_dqo.h). Source: https://github.com/torvalds/linux/tree/master/drivers/net/ethernet/google/gve. - The Linux GVE device-driver documentation, which is the closest thing to a published interface description: BAR layout, admin queue, interrupt classes, the GQI/DQO queue formats, QPL/RDA addressing, and the reset handshake (https://docs.kernel.org/networking/device_drivers/ethernet/google/gve.html).
- Google Cloud’s “Using Google Virtual NIC” Compute Engine documentation,
which defines the supported machine families, the
- Reference driver: the upstream GVE Linux driver
(
drivers/net/ethernet/google/gve/) is the behavior cross-check for the admin-queue handshake, queue creation, and the two descriptor formats.
2. Wire format (subset a capOS driver would implement)
The subset below is the slow-path bring-up plus one traffic-queue format a minimal capOS gVNIC driver would need. Exact register offsets, opcode numbers, and descriptor bit layouts are defined in the GVE headers cited above and are not transcribed here — this is a map, not a re-spec. Endianness is not uniform on this device: admin-queue messages and GQI descriptors are big-endian, while DQO descriptors are little-endian (per the GVE driver docs), so a capOS decoder/encoder must select endianness per structure.
- Registers / BARs: three 32-bit memory BARs.
- BAR0 — device configuration and status registers (the
gve_register.hblock):GVE_DEVICE_STATUS/ driver-status handshake, max TX/RX queue counts, the admin-queue PFN and doorbell, the admin-queue event counter, and the reset trigger. - BAR1 — the MSI-X vector table.
- BAR2 — the IRQ doorbells plus the per-queue RX and TX doorbells.
- BAR0 — device configuration and status registers (the
- Admin queue (AQ): a single page-sized command array. The driver writes a
command into a free slot, advances its submission counter, rings the
admin-queue doorbell in BAR0, and polls the admin-queue event counter until
the device marks the command executed and writes back its status. The
gve_adminq.hopcode space covers device description and resource lifecycle (describe device, configure/deconfigure device resources, register/unregister page list, create/destroy TX queue, create/destroy RX queue, and feature/option negotiation). The landed capOS proofs register the AQ page, issueDESCRIBE_DEVICE, parse the returned descriptor and GQI/QPL option, configure device resources with two notification blocks, register TX/RX queue page lists, create one TX and one RX queue, then destroy/unregister/deconfigure and release the admin queue before emitting evidence. - Interrupt classes: MSI-X only, in two roles.
- A management interrupt that tells the driver to re-examine
GVE_DEVICE_STATUS(link / device-state changes). - Notification-block interrupts, one block servicing a set of traffic queues; a block firing tells the driver to poll the associated queues. The notification blocks are the per-queue completion-signal path.
- A management interrupt that tells the driver to re-examine
- Queue formats (GQI vs DQO): gVNIC defines two mutually incompatible
descriptor formats; a device instance negotiates one.
- GQI (“Google Queue Interface”): fixed-size, power-of-two descriptor rings; the classic format. Big-endian descriptors.
- DQO (“Descriptor Queue, Out-of-order”): split descriptor and completion queues with per-completion generation bits for ownership tracking and 16-bit tags identifying which posted buffer a completion refers to, allowing out-of-order completion. Little-endian descriptors. DQO is the format the newer machine families use.
- Addressing modes (QPL vs RDA): independent of the descriptor format, each
queue uses one of two buffer-addressing modes.
- QPL (“queue page list”): the driver pre-registers a fixed set of guest pages with the device through the admin queue, and descriptors reference offsets into that registered page list rather than arbitrary guest physical addresses. The device only ever DMAs into pages the driver explicitly registered.
- RDA (“raw DMA addressing”): descriptors carry guest DMA addresses directly, so the device can DMA to dynamically allocated guest memory.
- Descriptor / ring ownership: the driver owns descriptor production and doorbell rings; the device owns completions. In GQI the device advances a completion/used position the driver reads; in DQO the device writes completion entries whose generation bit flips when the entry is the device’s to consume, so the driver detects new completions without a separate tail register.
- Reset / link-up sequence: bring-up drives the BAR0 device-status /
driver-status handshake, sets up the admin queue (legacy revision: program the
AQ PFN; newer revisions: program AQ length/base and set driver-status RUN),
issues the admin commands above to describe the device and create queues, and
arms the notification-block interrupts. Teardown follows the upstream driver:
legacy revision writes
0x0to the AQ PFN and waits for it to read back zero; newer revisions write driver-status RESET and wait forDEVICE_IS_RESET. - Known unsupported / out-of-scope features: offloads (checksum, TSO/LRO, RSS hashing), jumbo frames, multi-queue scaling beyond a single TX/RX pair, and the RDA addressing mode are out of scope for an initial bring-up. The first capOS lane targets QPL addressing with one TX and one RX queue (see §3).
3. capOS mapping
gVNIC is a vendor-custom cloud NIC. capOS now exercises inventory,
admin-queue/register, bounded raw-frame GQI/QPL TX/RX, and a bounded typed
Nic-adaptation proof in private GCE runs. Productionization remains future
work: there is no reusable gVNIC provider service, local device model, DQO/RDA
support, or host conformance suite yet.
- Authority gate: the gVNIC PCI function is inventoried over the production
PCI enumeration source. The admin-queue proof binds BAR0 and a manager-owned
DMA pool for one
DESCRIBE_DEVICEcommand (kernel/src/cap/gvnic_adminq_register_proof.rs). The raw-frame proof (kernel/src/cap/gvnic_raw_frame_proof.rs) then uses the same device-manager authority model to configure one GQI/QPL TX/RX queue pair, transmit one DHCP DISCOVER, poll a bounded RX descriptor completion, and tear the queues down. Thecloud_gce_gvnic_nic_cap_adaptation_proofbuild reuses that module’sreport_nic_cap_adaptationpath to prove the existingNicABI semantics over the same GQI/QPL data path: the marker records inline-frameNic.transmit/Nic.receive,Nic.macAddress, andNic.linkStatusevidence without exposing queue addresses or emitting the broader provider bind claim. Both proofs usekernel/src/pci.rsfind_driver_bind_devicefor resolved-source driver enumeration andkernel/src/device_manager/stub.rsdevicemmio_kernel_window_for_prooffor the live BAR0DeviceMmiowindow. They do not issue a reusable userspace gVNIC provider service and do not claimprovider-nic-bound. DeviceMmio: the landed proof stages BAR0 as a device-managerDeviceMmiorecord, bounds all big-endian register accesses to the staged window, rings the admin-queue doorbell, and detaches the record with a stale-handle assertion. The raw-frame proof also maps a bounded 64 KiB BAR2 kernel-only doorbell window and validates returned TX/RX doorbell indexes before ringing them. BAR1 MSI-X remains unprogrammed in this polling proof.Interrupt: the management interrupt and each notification-block vector would each bind oneInterruptcap over an MSI-X table entry, with the same mask-first / deferred-LAPIC-EOI lifecycle the landed production interrupt path uses (kernel/src/device_interrupt.rs, exercised by the virtio-net userspace IRQ-ownership slice). gVNIC uses MSI-X exclusively — there is no legacy-IRQ fallback. The admin-queue proof does not program MSI-X.DMAPool/DMABuffer: the admin-queue pages come from the manager-owned bounce-buffer pool throughstage_bounce_buffer_dmapool_recordandissue_manager_attached_dmabuffer_handle_with_request. The raw-frame proof keeps larger queue resources and QPL pages manager/proof-owned, publishes device-visible addresses only internally to the hardware, and never grants userspace aDMABuffercap or raw host-physical/IOVA value. It assertsDmaBufferCap::info_for_handlereportshost_physical_user_visible=0,device_iova=0, andiova_export=disabled-future-only. Teardown destroys queues, unregisters both QPLs, deconfigures device resources, releases/resets the admin queue, scrubs/frees traffic frames, requires scrub/ledger removal/frame-free labels for manager buffers, and checks stale pool/buffer/MMIO handles. Future reusable gVNIC provider integration must use the same selected DMA backend model documented in DMA Isolation.- Fail-closed / validation rules: the landed proof emits
cloudboot-evidence: gvnic-adminq-register <token>orcloudboot-evidence: gvnic-raw-frame-tx-rx <token>only after the bounded command/traffic sequence passes, the release/reset handshake completes, the PCI command register is restored, and staleDeviceMmio/DMAPool/DMABufferhandles all fail closed. The typed adaptation proof emitscloudboot-evidence: gvnic-nic-cap-adaptation <token>only after the same teardown and stale-handle checks plusNic-semantic TX/RX evidence. If queue or admin-queue release times out, the proof intentionally leaves still-owned DMA pages live and emits no success marker rather than freeing memory the device may still own. - QEMU-emulable vs hardware-only: none of gVNIC is QEMU-emulable — QEMU
has no gVNIC/GVE device model. Every bind step is therefore hardware-only and
requires a private, explicitly billable GCE instance launched with the
GVNICguest-OS feature andnic-type=GVNIC. The lane is gated accordingly: the landed inventory proof (cloud-gce-gvnic-image-launch-inventory-proof), the landed admin-queue/register proof (cloud-gce-gvnic-adminq-register-proof), the landed bounded raw-frame TX/RX proof (cloud-gce-gvnic-raw-frame-tx-rx-proof), and the landed typedNicadaptation proof (cloud-gce-gvnic-nic-cap-adaptation-proof). Each is decomposed in Hardware, Boot, and Storage and requires a private, explicitly billable GCE run for hardware evidence.
Documentation Workflow
The published documentation is organized as a system manual first. The top of
docs/SUMMARY.md should lead with pages that explain how to understand, build,
boot, configure, operate, and review the current capOS implementation.
The mdBook site may keep the wider project corpus reachable for maintainers: roadmap, changelog, backlog, proposal, paper, and research files can remain under the lower archive section. Those pages should not shape the primary reader path, and they should not be treated as part of the generated PDF manual unless they become current system documentation.
PDF Manual Pipeline
The PDF is a Typst-authored manual shell plus generated body content:
docs/manual.typexplicitly lists the Markdown pages that belong in the generated manual with{{CAPOS_MANUAL_PAGE:...}}placeholders. The mdBook site navigation indocs/SUMMARY.mdcan point at a different landing page or archive structure without changing the PDF contents.tools/docs-bundle.jsreads that explicit page list, rewrites bundled-doc links to PDF-local heading anchors, emits the aggregate generated Markdown attarget/docs-bundle/manual.md, and emits one Markdown file per manual page undertarget/docs-bundle/.mdbook-mermaidchecks Mermaid syntax, andmermaid-cliconverts Mermaid blocks in the generated Markdown to 2x PNG artifacts undertarget/docs-bundle/.uv tool run --constraints tools/md2typst-constraints.txt --from md2typst==0.3.3 md2typstconverts each generated Markdown page to Typst with the converter dependency set pinned.tools/build-typst-manual.jsnormalizes the converted pages, fillsdocs/manual.typwith generated version/date/source metadata and the selected page include paths, and writestarget/docs-bundle/manual.pdf.typ. The normalizer also collapses Markdown source-wrap line breaks outside code blocks so PDF prose and list items use normal paragraph layout, demotes generated page headings so manual parts remain the only top-level outline entries, and scales selected tall Mermaid diagrams so they fit with their surrounding manual context instead of becoming orphaned figure pages.- The pinned Typst binary compiles the final PDF.
docs/manual.typ owns the PDF document structure: title page, version block,
table of contents, page setup, base typography, and the explicit
manual page order. Manual part dividers are top-level headings; generated page
titles are demoted during PDF normalization so chapters sit below those parts
instead of appearing as peers.
Most manual pages are generated from Markdown through md2typst. A page can be
overridden for the PDF only by adding a checked-in Typst file at
docs/manual-overrides/<page-id>.typ, where <page-id> is the source path
with non-alphanumeric characters collapsed to hyphens, for example
docs/manual-overrides/architecture-memory.typ. Overrides replace the
generated Typst page in the PDF but do not change the mdBook page or
target/docs-bundle/manual.md. Override files are copied into target/docs-bundle/
before Typst compilation and should be self-contained Typst fragments.
Benchmark result tables stay in their source Markdown pages. If a wide
benchmark table needs PDF-specific layout, mark that Markdown table with
<!-- capos-benchmark-results:<id> start --> and matching end comments.
The mdBook site renders the source table, while tools/build-typst-manual.js
parses the marked table and replaces only that table region with a compact
Typst rendering. Keep interpretation, caveats, and conclusions in normal prose
around the table rather than encoding them in the table parser.
Generated files under docs/topics.md, target/docs-bundle/, and
target/docs-site/ must remain untracked.
The mdBook metadata preprocessor and PDF bundler normalize default cross-document link labels. When a link label is only the target Markdown path or filename, rendered site and manual output use the target document title instead. Keep explicit prose labels in source when the surrounding sentence needs a more specific phrase than the document title.
PDF Typography Rules
The manual and the schema paper should share a conservative typographic base: letter paper, readable serif body text, a restrained heading scale, consistent link color, consistent code styling, and predictable figure/table captions. They do not need identical layouts. The paper can remain citation-oriented and formal; the manual should favor scanning, command lookup, and dense technical reference pages.
For the manual PDF:
- Keep body text readable before optimizing page count. Avoid global spacing changes that create worse page breaks or orphaned callouts.
- Use headings as navigation markers: leave more room before a heading than after it, and keep headings with the first paragraph or code block whenever practical. In the manual PDF, the below-heading gap must be visibly larger than ordinary line leading, while the above-heading gap remains larger than the below gap so the heading belongs to the content that follows.
- Treat long bullets as structure problems. Prefer short bullets, definition lists, or command/proof tables over paragraph-length list items.
- Use framed code blocks for commands and transcripts. Give them visible internal padding, a very light background, and enough surrounding whitespace to read as intentional panels.
- Keep inline code sparse in prose. When a sentence accumulates several commands, paths, or target names, prefer a code block or table.
- Use one callout style consistently. A left rule or light box is acceptable, but the callout needs enough padding that it does not look like accidental indentation.
- Avoid visual changes without checking rendered pages. Review at least one command-heavy page and one dense prose/list page after each PDF style change.
Scope Rules
The PDF manual includes current system documentation: introduction, status, build and boot workflow, configuration, repository map, runnable demos, architecture, and security/verification pages.
Project archives stay on the mdBook site but are excluded from the PDF manual: proposals, backlogs, research notes, whitepaper planning, and other planning records are useful context for maintainers, not the operator-facing manual.
Topics Index
This page is generated from document front matter fields during mdbook builds:
statusdescriptiontopics
Quick Orientation
- Backlog — Detailed task decompositions.
- Benchmarks — Current benchmark policy and results.
- Build, Boot, and Test — Build, ISO, QEMU, host-test commands.
- Capability-Based and Microkernel Operating Systems Survey — Design consequences pulled from the survey.
- capOS Agentic Development Experiment — Longitudinal study design for using capOS development sessions, subagents, reviews, and recap tooling as an agentic software-engineering experiment.
- capOS Repository Harness Engineering — Repository-local harness engineering for making capOS legible, checkable, and safer for long-running coding agents.
- Changelog — Historical milestone reports.
- Current Design Authority — Current-design authority map and proposal lifecycle rule for keeping implemented behavior out of archival proposal records.
- Current Status — What works, what is partial.
- Design Risks and Open Questions — Consolidated index of long-horizon design risks.
- Introduction — Top-level documentation site entry.
- Proposal Index — Proposal status table.
- Repository Map — Source-tree subsystem index.
- Research and Design Gaps — Research/design gap triage backlog.
- Roadmap — Long-term architectural plan.
- What capOS Is — One-page system model.
Capabilities, IPC, and Authority
- ABI Evolution Policy — Compatibility policy for capOS schema and ring ABIs.
- Authority Accounting — Authority accounting rules for capability transfer and resource charges.
- Cap’n Proto Error Handling — Prior-art on capnp-rpc error semantics.
- Capability Model — Core capability object model, cap tables, schema interface IDs, grants, receiver metadata, and transfer.
- Capability Ring — Shared-memory capability ring ABI, dispatch paths, and completion semantics.
- Capability-Infrastructure Cluster — Decomposition of the near-term capability-infrastructure cluster: matured proposals and Stage 6 remainder that share the schema serial surface.
- Cloudflare, Cap’n Proto, Workers RPC, and Cap’n Web — Cloudflare Workers, workerd, Durable Objects, Workers RPC, Cap’n Web, and Cloudflare’s production use of Cap’n Proto/KJ.
- Crash Recovery and Supervision — Unplanned-failure detection, stale-cap propagation, structured crash records, watchdog liveness, and bounded restart policy for capOS services.
- Debug and Trace Authority — Capability-scoped debug session attach, read-only cap-table inspection, ring-trace replay, and sampler authority without ambient process inspection.
- Delegated Subject Context — Future delegated-subject and act-on-behalf-of capability model.
- Error Handling — Current error model for capability ring CQE status, CapException payloads, endpoint RETURN exceptions, and ordinary schema result unions.
- Error Handling — Transport and application error model for capability calls and CQE results.
- Genode — Genode OS Framework: capability-based component model, session routing, VFS plugin architecture, POSIX compatibility, and Sculpt OS – with lessons for capOS.
- IPC and Endpoints — Endpoint IPC, capability transfer, direct handoff, and shared-memory data paths.
- Memory Authority Model — Memory authority, residency classes, mapping consistency, OOM boundaries, and proof obligations.
- OS Error Handling — Cross-OS error-model comparison.
- Rejected: Cap’n Proto SQE Envelope — Rationale for keeping ring SQEs fixed-layout instead of Cap’n Proto envelopes.
- Rejected: Endpoint Badges as Service Identity — Post-mortem of the rejected seL4-style endpoint badge service identity model.
- Remote Session CapSet Clients — Remote host app model for authenticated capOS sessions, broker-issued CapSet views, and typed capability calls over Cap’n Proto RPC.
- Resource Accounting and Quotas — Resource profiles, quota ledgers, donation, reservation, and fail-closed accounting semantics.
- Schema Registry — A SchemaRegistry capability that serves Cap’n Proto reflection metadata – interface IDs, method names and ordinals, parameter/result layouts, and doc comments – at runtime, as the machine-readable twin of the System Manual.
- Service Architecture — Capability-based service composition, authority-at-spawn, exports, and service graph policy.
- Service Object Identity Migration — Superseded large-chunk migration plan for service object identity, retained as historical context after the active direction changed to session-bound invocation context.
- Session Context — Current session-bound invocation context, endpoint caller-session metadata, disclosure, transfer-scope, and liveness rules.
- Session-Bound Invocation Context — Implementation plan for one-session-per-process invocation context and session-keyed shared services.
- Session-Bound Invocation Context — Session-bound invocation context and privacy-aware disclosure model replacing service-object identity migration.
- Spritely, OCapN, and CapTP — Spritely, OCapN, CapTP, netlayers, locators, Syrup, promise pipelining, handoffs, and capability-network lessons for capOS.
- Stage 6 Capability Semantics — Stage 6 capability work.
- Standard App Capabilities — Per-app AppData storage, a user-mediated powerbox/file-picker grant, and attenuated capability sharing as standard app-facing capabilities.
- Superseded: Service Object Capabilities — Superseded service-minted object capability model that was replaced by session-bound invocation context.
- System Info Capability — SystemInfo capability for MOTD, hostname, host metadata, help topics, and shell bundle integration.
- System Manual Capability — A built-in man-pages analog: the Manual capability serves Unix-style reference pages, schema-derived interface manuals, and a man-shaped reference corpus through the shell, the self-served web UI, and a typed capnp API.
- Time and Clock Authority — Capability-native wall-clock authority with provenance labeling, clock discipline, and trusted timestamps for audit and TLS.
- Userspace Authority Broker — Userspace shell-bundle broker and lifecycle-control authority model.
- Zircon — Fuchsia Zircon kernel: handle-based capability model, channels, VMARs/VMOs, async ports, and FIDL – with lessons for capOS capability dispatch, IPC, and memory design.
Boot, Manifests, and Init
- Boot Flow — Kernel boot, manifest handoff, init launch, and QEMU boot-proof flow.
- Boot to Shell — Login, setup, session, credential, and broker path from boot into the native shell.
- Cloud Image Import and Serial-Console Boot — Cloud provider disk-image import and serial-console-boot notes.
- Cloud Metadata — Cloud metadata and config-drive bootstrap through scoped configuration capabilities.
- Configuration — How operators extend the default capOS boot manifest with a gitignored
system.local.cueoverlay and convert CUE-authored data to specified Cap’n Proto schemas. - Hardware, Boot, and Storage — Hardware bring-up backlog.
- Installable System — Ordered implementation track turning the installable-system proposal into work grounded in the landed BlockDevice/filesystem/Store/writable-persistence/disk-image contracts.
- Installable System — Design for an installed, persistent capOS that boots from disk and keeps mutable system configuration across reboots, composed with the immutable boot manifest.
- Manifest and Service Startup — Manifest encoding, service graph validation, bootstrap grants, and init-side spawning.
- Run Targets, Init Mandate, and Default-Run Integration — Run-target governance.
- Stateful Task and Job Graphs — Durable stateful task and job graphs for init orchestration, package builds, operator work, and notebook-style run stories without creating a god object.
- System Configuration and Operator Extensibility — Layered CUE configuration model for operator boot-manifest overlays, host-user injection, and per-user toolchain caches.
Process Model, Threading, and Scheduling
- Completion Rings And Threaded Runtimes — Io_uring-style transports under threaded runtimes.
- Crash Recovery and Supervision — Unplanned-failure detection, stale-cap propagation, structured crash records, watchdog liveness, and bounded restart policy for capOS services.
- Future Scheduler Architecture — Survey of modern scheduler algorithms and architectures for capOS scheduler evolution.
- HPC Parallel Patterns — HPC benchmark and programming-model grounding for generic parallel processing patterns.
- HPC Parallel Processing Patterns — Generic single-node and multi-node parallel processing patterns for HPC-style benchmark coverage.
- In-Process Threading — In-process thread lifecycle, scheduler references, ThreadControl, and ParkSpace integration.
- Linux Sandboxes and Virtualization for Workloads — Linux sandbox, container, gVisor, KVM, microVM, and CPU-isolation prior art for generic Linux workload execution.
- NO_HZ, SQPOLL, and Realtime Scheduling — Linux NO_HZ, io_uring SQPOLL, CPU isolation, PREEMPT_RT, SCHED_DEADLINE, and seL4 MCS grounding for capOS timer and realtime design.
- Out-of-Kernel Scheduling — Prior art survey on kernel versus userspace CPU scheduling policy split, with capOS design implications.
- Park Authority — ParkSpace wait/wake authority, ABI, and shared park-word constraints.
- Process Model — Process isolation, ELF loading, bootstrap ABI, lifecycle, and spawn authority.
- Rejected: Sleep(INF) Process Termination — Rationale for explicit process termination instead of infinite-sleep lifecycle semantics.
- Ring v2 For Full SMP — Per-thread ring, completion routing, SQPOLL ownership, and full-SMP transport model.
- Scheduler Evolution — Detailed task decomposition for future capOS scheduler evolution.
- Scheduler Evolution — Layered scheduler evolution from bootstrap round-robin to per-CPU fair scheduling, scheduling contexts, CPU leases, and user-space policy.
- Scheduling — Preemption, run queues, blocking waits, timer wakeups, and SMP scheduler proof points.
- SMP — Per-CPU state, AP startup, scheduler ownership, TLB shootdown, and multi-core roadmap.
- SMP Phase C — SMP backlog.
- Tickless and Realtime Scheduling — Tickless idle, SQPOLL nohz CPU isolation, request deadlines, scheduling contexts, and realtime islands.
- x2APIC And APIC Virtualization — Primary-source grounding for xAPIC/x2APIC backend selection and APIC virtualization constraints.
Memory and Resource Accounting
- Cloud DMA Provider Evidence Inventory — Official AWS/Azure/GCP device-surface facts, an evidence-matrix schema, a live guest-probe checklist, and classification rules for the cloud DMA backend decision.
- Cloud Driver Foundation Gap Analysis — Gap analysis between the existing userspace virtio driver foundation and the blocked cloud NIC/storage driver tasks: what is already proven, the narrow per-task remaining work, and the superseded live-NIC runnable-now claim.
- Device Manager Refactor — Refactor direction for separating the kernel device authority ledger from QEMU proof scaffolding.
- DMA Assurance Model — Assurance model for DMA authority, backend selection, and proof obligations.
- DMA Isolation — DMA isolation model for device memory, IOMMU policy, and capability-scoped hardware access.
- DMA User-Space Driver Isolation — DMA, user-space driver, vIOMMU, and no-IOMMU bounce-buffer design consequences for capOS device authority.
- Go VirtualMemory Contract — VirtualMemory cap contract for Go.
- IOMMU Remapping Grounding — Primary-source grounding for Intel VT-d (landed under cfg(qemu)), AMD-Vi, and QEMU IOMMU remapping work.
- Memory Authority Model — Memory authority model backlog.
- Memory Authority Model — Memory authority, residency classes, mapping consistency, OOM boundaries, and proof obligations.
- Memory Management — Physical frames, address spaces, user buffers, MemoryObject, and VirtualMemory contracts.
- NVMe Model B Doorbell DMA Validator — Conditional DMA-address ownership model for the userspace NVMe storage provider: provider-written queue-base and PRP/SGL addresses require a non-host-physical device-visible namespace; no-IOMMU GCP planning must use brokered bounce address publication instead.
- OOM Handling and Swap — Memory-pressure, OOM, anonymous-memory budgeting, and optional encrypted swap policy.
- Resource Accounting and Quotas — Resource profiles, quota ledgers, donation, reservation, and fail-closed accounting semantics.
- virtio-rng — Provenance map for the in-tree virtio-rng entropy device - spec basis, implemented wire-format subset, and its role as a QEMU-only DDF metadata and IOMMU-remapping hardware-DMA proof fixture (no userspace-facing capability, not a production driver).
Userspace Runtime, Languages, and Binaries
- Browser Capability and Agent Web Sessions — Browser profiles, cap-native document engines, visual browsing, and agent/shell browser sessions as capability-scoped services.
- Browser Engines, Document Engines, and Agent Browsers — Browser engine portability, cap-native document-engine options, and agent-browser patterns for capOS browser capabilities.
- Browser/WASM — Browser-hosted capOS experiment using WebAssembly and worker-per-process isolation.
- capOS SDK and Dual Transport — capOS front-door SDK crate with a transport abstraction for in-system and remote clients, plus crate-namespace publication.
- capos-service — Userspace service framework (Rust crate
capos-service) for lifecycle, endpoint loops, readiness, shutdown, metrics, context, and resource hooks. - Cloudflare, Cap’n Proto, Workers RPC, and Cap’n Web — Cloudflare Workers, workerd, Durable Objects, Workers RPC, Cap’n Web, and Cloudflare’s production use of Cap’n Proto/KJ.
- Go Runtime — Go runtime plan for GOOS=capos, memory growth, TLS, scheduling, and networking.
- IX-on-capOS Hosting — IX as a package corpus, content-addressed build/store model, and a capability-native build-service surface for capOS.
- Language Support Status and Plans — Current and planned programming-language support on capOS.
- Linux Sandboxes and Virtualization for Workloads — Linux sandbox, container, gVisor, KVM, microVM, and CPU-isolation prior art for generic Linux workload execution.
- LLVM Target — Custom LLVM target triple requirements: kernel on x86_64-unknown-none, userspace on x86_64-unknown-capos; calling conventions, TLS, relocations, and Go/C runtime porting.
- Lua Scripting — Capability-scoped Lua runner with curated libraries and explicit grants.
- POSIX Adapter — POSIX compatibility adapter (libcapos-posix) over the libcapos C-ABI substrate, with smallest-deps POSIX shell and DNS resolver as the first ports.
- POSIX Adapter Dash Port — POSIX adapter Phase P1.4 (dash port) backlog – libcapos-posix file/dir/stdio/env/printf surface, dash vendoring + per-call-site patch, and the run-posix-shell-smoke harness.
- Runtime, Networking, and Shell — Runtime/network/shell backlog.
- Scientific Agent-Lab Software Stack — Scientific computing, solver, proof-assistant, notebook, and reproducible-package prior art for a capOS-hosted LLM research lab.
- Scientific Standard Package and Agent Lab Capabilities — Scientific standard package and agent-lab capability services for CAS, solvers, proof assistants, notebooks, and reproducible research environments.
- Userspace Binaries — Native userspace binary model, capos-rt authority handling, language runtimes, and compatibility adapters.
- Userspace Runtime — capos-rt entry ABI, heap, CapSet lookup, ring client, and typed userspace capability clients.
- WASI Host Adapter — WASI host adapter as a userspace process whose imports are backed by typed capOS capabilities. Phase W.1 host-runtime scaffold landed 2026-05-05 19:12 UTC; Phase W.2 sub-slice 1 (wasm-host binary + empty-instantiation smoke + userspace-image budget bump) landed 2026-05-06 20:19 UTC; Phase W.2 sub-slice 2 (Preview 1 stdout-only imports plus probe-driven nosys=52 proof) landed 2026-05-07 08:03 UTC; Phase W.2 sub-slice 3 (Rust
hello, wasismoke + manifest-payload load path) landed 2026-05-07 09:36 UTC; Phase W.2 sub-slice 4 (Chello, wasismoke) landed 2026-05-07 10:53 UTC and closes Phase W.2; Phase W.3 (per-instance CapSet plumbing + LaunchParameters bounded-text argv grant + wasi-cli-args smoke) landed 2026-05-07 18:25 UTC; Phase W.4 (random_getproduction-ready against the kernelEntropySourcecap + wasi-random granted/ungranted smokes) landed 2026-05-07 20:09 UTC. A 2026-05-13 compatibility-import smoke promotes authority-free Preview 1 imports (clock_res_get(MONOTONIC),sched_yield, and stdio fd metadata/seek behavior); a 2026-05-13 bounded environment grant reflectsinitConfig.init.wasiEnvthroughenviron_get/environ_sizes_get, withmake wasi-env-negative-checkcovering count, per-entry, total-byte, and interior-NUL rejection; the refusal smoke (make run-wasi-preview1-refusals) proves nine representative blocked filesystem/socket imports fail closed withERRNO_NOSYS = 52(extended 2026-05-13 21:15 UTC to coverfd_pread,fd_pwrite,path_create_directory,sock_shutdownin addition to the original five). Open Questions §1 (per-instance vs per-process) and §3 (poll_oneoffsemantics) resolved 2026-05-13 16:46 UTC; §6 (environ_getsource) and §7 (args_getsource) reclassified as resolved by Phase W.3 with the bounded manifest-text grants. W.5 (filesystem) closed 2026-05-17 05:42 UTC: the wasm-host installs the manifest-granted rootDirectorycap (CapSet slotroot) as a single Preview 1 preopen at fd 3 (/preopen-0) and implementspath_open,fd_read,fd_write,fd_seek,fd_close,fd_filestat_get,fd_prestat_get, andfd_prestat_dir_nameagainst the kernelDirectory/Filecap interface incapos-wasm/src/wasi/fs.rs(POSIX P1.4 Slice 4 resolver shape);fd_readdirover the preopenDirectory.listlanded 2026-05-24 08:44 UTC;fd_tell(host-side position read) andfd_filestat_set_size(overFile.truncate) landed 2026-05-24 09:34 UTC, completing the File-cap method triad with no schema change;path_create_directoryandpath_remove_directory(overDirectory.mkdir/remove, same preopen sandbox, no schema change) landed 2026-05-24 10:09 UTC;fd_preadandfd_pwritelanded 2026-05-30 14:49 UTC as positional I/O over the hostFilecap (no schema change –File.read/File.writealready carry an explicit offset), using the WASI-supplied offset and leaving the fd’s stream position untouched (the positional-I/O invariant).path_filestat_getandpath_unlink_filelanded 2026-05-30 as path-resolved metadata/removal over the hostFile.stat/Directory.removecaps (no schema change), leaving onlypath_filestat_set_times,path_rename, and the symlink/link family fail-closed. Themake run-wasi-fssmoke (system-wasi-fs.cue,demos/wasi-fs/,tools/qemu-wasi-fs-smoke.sh) completes a fullpath_open(CREAT+TRUNC)/fd_write/fd_close/ re-open /fd_filestat_get/fd_seek/fd_readround trip, asserts the preopen sandbox refuses absolute paths and..segments withERRNO_NOTCAPABLE = 76, proves the positionalfd_pwrite/fd_preadround trip leaves the offset unchanged plus the negative-offset and stdio refusals, and statssmoke.txtby path (size 4, regular-file type) before unlinking it; the existingmake run-wasi-preview1-refusalssmoke continues to pass with W.5-split errnos (path_open/fd_prestat_get/fd_read/path_create_directory/fd_pread/fd_pwrite/path_filestat_get/path_unlink_filenow returnERRNO_BADF = 8against an absent preopen, only the socket imports stay atERRNO_NOSYS = 52).Store/Namespaceintegration remains deferred. W.6 (sockets) remains blocked on the userspace network stack. W.7 (Component Model) and W.8 (TinyGo / Go-on-WASI CUE evaluator) remain blocked on the std-userspace decision.
Shells and Interactive Surfaces
- Boot to Shell — Login, setup, session, credential, and broker path from boot into the native shell.
- Browser Capability and Agent Web Sessions — Browser profiles, cap-native document engines, visual browsing, and agent/shell browser sessions as capability-scoped services.
- Browser Engines, Document Engines, and Agent Browsers — Browser engine portability, cap-native document-engine options, and agent-browser patterns for capOS browser capabilities.
- capOS-Hosted Agent Swarms — capOS-hosted OpenClaw-like personal agents, agent swarms, harness controls, memory, retrieval, and research agenda.
- Chat As Multimedia Substrate — Chat as unified text/audio/video multimedia transport across human, agent, and service participants, with listener-cap delivery and a clean WebRTC mapping.
- Default User Avatar — Deterministic default user avatar derived from a stable account identifier, with explicit user override.
- Interactive Command Surfaces — Structured command-session model for native interactive applications over typed invocations.
- Language Models and Agent Runtime — Language-model, embedder, agent-runner, and browser-agent capability interfaces.
- Realtime Voice Agent Shell — Realtime audio agent shell model across browser media, provider sessions, and brokered tools.
- Remote Session CapSet Clients — Remote host app model for authenticated capOS sessions, broker-issued CapSet views, and typed capability calls over Cap’n Proto RPC.
- Schema Registry — A SchemaRegistry capability that serves Cap’n Proto reflection metadata – interface IDs, method names and ordinals, parameter/result layouts, and doc comments – at runtime, as the machine-readable twin of the System Manual.
- Shell — Native, agent-oriented, and POSIX shell models over explicit capability grants.
- SSH Shell Gateway — SSH terminal gateway design preserving TerminalSession and broker-issued shell boundaries.
- Stateful Task and Job Graphs — Durable stateful task and job graphs for init orchestration, package builds, operator work, and notebook-style run stories without creating a god object.
- System Info Capability — SystemInfo capability for MOTD, hostname, host metadata, help topics, and shell bundle integration.
- System Manual Capability — A built-in man-pages analog: the Manual capability serves Unix-style reference pages, schema-derived interface manuals, and a man-shaped reference corpus through the shell, the self-served web UI, and a typed capnp API.
- Telnet over TLS Shell — Optional TLS-protected Telnet TerminalSession gateway with client certificates and credential fallback.
Networking
- Azure MANA — Provenance map for the Azure MANA NIC / GDMA wire logic - spec basis, implemented host-conformance wire-format subset, and capOS authority mapping.
- Browser Capability and Agent Web Sessions — Browser profiles, cap-native document engines, visual browsing, and agent/shell browser sessions as capability-scoped services.
- capOS SDK and Dual Transport — capOS front-door SDK crate with a transport abstraction for in-system and remote clients, plus crate-namespace publication.
- capos-service — Userspace service framework (Rust crate
capos-service) for lifecycle, endpoint loops, readiness, shutdown, metrics, context, and resource hooks. - Chat As Multimedia Substrate — Chat as unified text/audio/video multimedia transport across human, agent, and service participants, with listener-cap delivery and a clean WebRTC mapping.
- Cloud DMA Provider Evidence Inventory — Official AWS/Azure/GCP device-surface facts, an evidence-matrix schema, a live guest-probe checklist, and classification rules for the cloud DMA backend decision.
- Cloudflare, Cap’n Proto, Workers RPC, and Cap’n Web — Cloudflare Workers, workerd, Durable Objects, Workers RPC, Cap’n Web, and Cloudflare’s production use of Cap’n Proto/KJ.
- GCE gVNIC — Provenance map for the GCE gVNIC (Google Virtual Ethernet) NIC - spec basis from the public gVNIC docs and the GVE Linux driver, the wire-format subset capOS exercises today, and the bounded Nic-cap adaptation proof. capOS has live-GCE inventory, admin-queue/register, raw-frame GQI/QPL TX/RX, and typed Nic-adaptation proofs, but no reusable gVNIC provider service or host conformance suite yet.
- Google Drive Storage Backend — Use a Google-authenticated user’s Drive as a capOS storage backend behind the standard storage caps, via a browser-transport near-term path and a native OAuth2/HTTP/TLS backend later.
- Network Usability and Post-smoltcp — Network usability, resolver, diagnostics, and post-smoltcp backlog.
- Network-Reachable Datapath Scope Decision — Scope decision recording that the real-GCE-boot milestone’s reachable-network-stack requirement means raw-frame TX/RX (Option A), not L4 sockets, grounded in what the billable cloudboot harness actually gates on.
- Networking — Network capability architecture from virtio-net smoke to TCP sockets and terminal handoff.
- Phase C Userspace NIC Driver Relocation — Phase C design for relocating the virtio-net driver into userspace: the cap-surface delta, the inline-
DataNic ABI (matching the networking-proposal draft), the writable selected-write common-config window (an extension of the accepted notify-doorbell discipline; slice 1 landed 2026-06-02 20:30 UTC at c9518b2d), the userspace-vring slice that reuses the landed production DMA isolation (bounce policy + dma_backend probe + IOMMU IOVA-export), the sustained-receiveNicABI design used by the multi-frame TCP path, the selected serve-from-userspace 7c-ii(b) socket-authority proof, and retirement of the non-qemu legacy kernel socket grant path. - Pingora — Proxy/server framework as a userspace runtime case study.
- Remote Session CapSet Client — Remote session CapSet client backlog.
- Remote Session CapSet Clients — Remote host app model for authenticated capOS sessions, broker-issued CapSet views, and typed capability calls over Cap’n Proto RPC.
- Spritely, OCapN, and CapTP — Spritely, OCapN, CapTP, netlayers, locators, Syrup, promise pipelining, handoffs, and capability-network lessons for capOS.
- SSH Shell Gateway — SSH terminal gateway design preserving TerminalSession and broker-issued shell boundaries.
- Telnet over TLS Shell — Optional TLS-protected Telnet TerminalSession gateway with client certificates and credential fallback.
- virtio-net — Provenance map for the in-tree modern virtio-net PCI NIC - spec basis, implemented wire-format subset, and capOS authority binding.
Storage, Persistence, and Naming
- Cloud DMA Provider Evidence Inventory — Official AWS/Azure/GCP device-surface facts, an evidence-matrix schema, a live guest-probe checklist, and classification rules for the cloud DMA backend decision.
- Google Drive Storage Backend — Use a Google-authenticated user’s Drive as a capOS storage backend behind the standard storage caps, via a browser-transport near-term path and a native OAuth2/HTTP/TLS backend later.
- Hardware Audit Log Persistence — Durable, tamper-evident persistence and admission policy for the hardware audit log.
- Hardware, Boot, and Storage — Hardware bring-up backlog.
- Installable System — Ordered implementation track turning the installable-system proposal into work grounded in the landed BlockDevice/filesystem/Store/writable-persistence/disk-image contracts.
- Installable System — Design for an installed, persistent capOS that boots from disk and keeps mutable system configuration across reboots, composed with the immutable boot manifest.
- IX-on-capOS Hosting — IX as a package corpus, content-addressed build/store model, and a capability-native build-service surface for capOS.
- Standard App Capabilities — Per-app AppData storage, a user-mediated powerbox/file-picker grant, and attenuated capability sharing as standard app-facing capabilities.
- Stateful Task and Job Graphs — Durable stateful task and job graphs for init orchestration, package builds, operator work, and notebook-style run stories without creating a god object.
- Storage and Naming — Capability-native storage, namespaces, boot packages, volumes, and persistence model.
- Volume Encryption — Encryption-at-rest model for system and user volumes with recovery and KMS options.
Identity, Policy, and User Accounts
- Configuration — How operators extend the default capOS boot manifest with a gitignored
system.local.cueoverlay and convert CUE-authored data to specified Cap’n Proto schemas. - Default User Avatar — Deterministic default user avatar derived from a stable account identifier, with explicit user override.
- Delegated Subject Context — Future delegated-subject and act-on-behalf-of capability model.
- Formal MAC/MIC — Formal mandatory access and integrity model for future policy and proof work.
- Google Drive Storage Backend — Use a Google-authenticated user’s Drive as a capOS storage backend behind the standard storage caps, via a browser-transport near-term path and a native OAuth2/HTTP/TLS backend later.
- Local Users, Storage, and Policy — Identity/local-user backlog.
- OIDC and OAuth2 — Federated login, OAuth2 clients, token capabilities, JWKS, DPoP, and broker integration.
- Rejected: Endpoint Badges as Service Identity — Post-mortem of the rejected seL4-style endpoint badge service identity model.
- Remote Session CapSet Client — Remote session CapSet client backlog.
- Remote Session CapSet Clients — Remote host app model for authenticated capOS sessions, broker-issued CapSet views, and typed capability calls over Cap’n Proto RPC.
- Service Object Identity Migration — Superseded large-chunk migration plan for service object identity, retained as historical context after the active direction changed to session-bound invocation context.
- Session Context — Current session-bound invocation context, endpoint caller-session metadata, disclosure, transfer-scope, and liveness rules.
- Session-Bound Invocation Context — Implementation plan for one-session-per-process invocation context and session-keyed shared services.
- Session-Bound Invocation Context — Session-bound invocation context and privacy-aware disclosure model replacing service-object identity migration.
- Standard App Capabilities — Per-app AppData storage, a user-mediated powerbox/file-picker grant, and attenuated capability sharing as standard app-facing capabilities.
- System Configuration and Operator Extensibility — Layered CUE configuration model for operator boot-manifest overlays, host-user injection, and per-user toolchain caches.
- User Identity and Policy — User, session, profile, RBAC/ABAC/MAC, and policy-layer model for capability grants.
Cryptography, Certificates, and Trust
- Certificates / TLS — Bounded implementation slice chain for the certificates/TLS track, from vendored verifier crates to a capOS-terminated Web UI endpoint.
- Certificates and TLS — Capability-native X.509, trust store, ACME, pinning, and TLS configuration model.
- Cryptography and Key Management — Capability model for keys, signing, encryption, vaults, entropy, and cryptographic policy.
- Google Drive Storage Backend — Use a Google-authenticated user’s Drive as a capOS storage backend behind the standard storage caps, via a browser-transport near-term path and a native OAuth2/HTTP/TLS backend later.
- Hardware Audit Log Persistence — Durable, tamper-evident persistence and admission policy for the hardware audit log.
- OIDC and OAuth2 — Federated login, OAuth2 clients, token capabilities, JWKS, DPoP, and broker integration.
- Telnet over TLS Shell — Optional TLS-protected Telnet TerminalSession gateway with client certificates and credential fallback.
- Time and Clock Authority — Capability-native wall-clock authority with provenance labeling, clock discipline, and trusted timestamps for audit and TLS.
- Volume Encryption — Encryption-at-rest model for system and user volumes with recovery and KMS options.
Security and Verification
- ABI Evolution Policy — Compatibility policy for capOS schema and ring ABIs.
- AWS Nitro EBS (NVMe storage) — Provenance map for the AWS Nitro EBS NVMe storage shape - spec basis, the standard-NVMe wire subset it shares with docs/devices/nvme.md, and the capOS cloud-shape classification plus DMA-backend policy it binds onto.
- Azure managed disk (NVMe storage) — Provenance map for the Azure managed-disk NVMe storage shape - spec basis, the standard-NVMe wire subset it shares with docs/devices/nvme.md, why the older-family virtio-scsi path is out of scope, and the capOS cloud-shape classification plus DMA-backend policy it binds onto.
- Cloud DMA Provider Evidence Inventory — Official AWS/Azure/GCP device-surface facts, an evidence-matrix schema, a live guest-probe checklist, and classification rules for the cloud DMA backend decision.
- Cloud Driver Foundation Gap Analysis — Gap analysis between the existing userspace virtio driver foundation and the blocked cloud NIC/storage driver tasks: what is already proven, the narrow per-task remaining work, and the superseded live-NIC runnable-now claim.
- Debug and Trace Authority — Capability-scoped debug session attach, read-only cap-table inspection, ring-trace replay, and sampler authority without ambient process inspection.
- Device Manager Refactor — Refactor direction for separating the kernel device authority ledger from QEMU proof scaffolding.
- DMA Assurance Model — Assurance model for DMA authority, backend selection, and proof obligations.
- DMA Isolation — DMA isolation model for device memory, IOMMU policy, and capability-scoped hardware access.
- DMA User-Space Driver Isolation — DMA, user-space driver, vIOMMU, and no-IOMMU bounce-buffer design consequences for capOS device authority.
- Error Handling — Current error model for capability ring CQE status, CapException payloads, endpoint RETURN exceptions, and ordinary schema result unions.
- Formal MAC/MIC — Formal mandatory access and integrity model for future policy and proof work.
- Full-Scope Review 2026-06-09 — Findings ledger and decomposition source for the 2026-06-09 full-scope review of the tree at 50e8eaba (review base bb776326e, 2026-05-23).
- GCP Persistent Disk (storage) — Provenance map for the GCP Persistent Disk storage shape - virtio-scsi vs NVMe families, the standard-NVMe wire subset it shares with docs/devices/nvme.md, the capOS cloud-shape classification, the DMA-backend policy on no-IOMMU GCE shapes, the local production brokered NVMe provider chain, and the bounded live-GCE NVMe Persistent Disk read proof.
- IOMMU Remapping Grounding — Primary-source grounding for Intel VT-d (landed under cfg(qemu)), AMD-Vi, and QEMU IOMMU remapping work.
- Memory Authority Model — Memory authority model backlog.
- Memory Authority Model — Memory authority, residency classes, mapping consistency, OOM boundaries, and proof obligations.
- NVMe — Provenance map for the NVMe controller wire subset capOS touches - conditional Model B validator scan targets, the read-only userspace bind, the reset-only CC selected-write claim, the no-IOMMU manager-op controller enable through the brokeredNvmeControllerEnable @6 verb, the no-IOMMU manager-op admin IDENTIFY through the brokeredNvmeAdminIdentify @7 verb, the brokered admin SQ/CQ doorbell + IDENTIFY command, the split admin SUBMIT @8 / COMPLETE @9 verbs whose completion handoff runs through a cap-waiter Interrupt.wait/acknowledge MSI-X route, the brokered I/O queue pair + bounded READ including one live-GCE Persistent Disk proof, and the dedicated BlockDevice data-completion Interrupt route - with spec basis and capOS authority mapping.
- NVMe Model B Doorbell DMA Validator — Conditional DMA-address ownership model for the userspace NVMe storage provider: provider-written queue-base and PRP/SGL addresses require a non-host-physical device-visible namespace; no-IOMMU GCP planning must use brokered bounce address publication instead.
- Panic Surface Inventory — Panic/unwrap/expect inventory.
- Public Release and Maintainer Boundaries — Public release posture, maintainer boundaries, issue intake, and repository hygiene gates.
- Remote Session UI Security — Web-security hardening posture for the trusted local remote-session-ui bridge, the capOS-served Web UI, public-origin carry-over policy, and the Tauri desktop wrapper.
- Repository Composition — Repository scope, sibling project split criteria, and cross-repository organization plan.
- Security and Verification — Security/verification backlog.
- Security and Verification — Security review vocabulary, trust-boundary checklist, and verification tracks for capOS.
- Security Verification Track Registry — Manual reference for Security Verification Track labels.
- Session Archive & Gantt Effort — A pipeline to collect, normalize, and archive per-task effort data from the run-telemetry log and agent session transcripts, enabling development timeline visualization and task-duration prediction.
- Trust Boundaries — The reviewer’s authority-boundary inventory.
- Trusted Build Inputs — Trusted toolchain inventory.
- Verification Workflow — The verification gates used by capOS.
Services, Operations, and Monitoring
- Benchmarks — Current benchmark policy and results.
- Capability-Infrastructure Cluster — Decomposition of the near-term capability-infrastructure cluster: matured proposals and Stage 6 remainder that share the schema serial surface.
- capos-service — Userspace service framework (Rust crate
capos-service) for lifecycle, endpoint loops, readiness, shutdown, metrics, context, and resource hooks. - Cloud Deployment — Cloud VM deployment plan covering hardware abstraction, storage, networking, and aarch64.
- Cloud Metadata — Cloud metadata and config-drive bootstrap through scoped configuration capabilities.
- Configuration — How operators extend the default capOS boot manifest with a gitignored
system.local.cueoverlay and convert CUE-authored data to specified Cap’n Proto schemas. - Crash Recovery and Supervision — Unplanned-failure detection, stale-cap propagation, structured crash records, watchdog liveness, and bounded restart policy for capOS services.
- Debug and Trace Authority — Capability-scoped debug session attach, read-only cap-table inspection, ring-trace replay, and sampler authority without ambient process inspection.
- Hardware Audit Log Persistence — Durable, tamper-evident persistence and admission policy for the hardware audit log.
- HPC Parallel Processing Patterns — Generic single-node and multi-node parallel processing patterns for HPC-style benchmark coverage.
- Live Upgrade — Service replacement, capability retargeting, quiesce/resume, and in-flight call handling.
- Rejected: Endpoint Badges as Service Identity — Post-mortem of the rejected seL4-style endpoint badge service identity model.
- Scientific Standard Package and Agent Lab Capabilities — Scientific standard package and agent-lab capability services for CAS, solvers, proof assistants, notebooks, and reproducible research environments.
- Service Architecture — Capability-based service composition, authority-at-spawn, exports, and service graph policy.
- Session Context — Current session-bound invocation context, endpoint caller-session metadata, disclosure, transfer-scope, and liveness rules.
- Session-Bound Invocation Context — Session-bound invocation context and privacy-aware disclosure model replacing service-object identity migration.
- Stateful Task and Job Graphs — Durable stateful task and job graphs for init orchestration, package builds, operator work, and notebook-style run stories without creating a god object.
- Superseded: Service Object Capabilities — Superseded service-minted object capability model that was replaced by session-bound invocation context.
- System Configuration and Operator Extensibility — Layered CUE configuration model for operator boot-manifest overlays, host-user injection, and per-user toolchain caches.
- System Monitoring — Capability-scoped logs, metrics, health checks, traces, crash records, and status views.
- System Performance Benchmarks — Correctness-gated benchmark model for primitives, workloads, and user stories.
- Time and Clock Authority — Capability-native wall-clock authority with provenance labeling, clock discipline, and trusted timestamps for audit and TLS.
AI, Agents, GPU, and Robotics
- Browser Capability and Agent Web Sessions — Browser profiles, cap-native document engines, visual browsing, and agent/shell browser sessions as capability-scoped services.
- Browser Engines, Document Engines, and Agent Browsers — Browser engine portability, cap-native document-engine options, and agent-browser patterns for capOS browser capabilities.
- capOS Agentic Development Experiment — Longitudinal study design for using capOS development sessions, subagents, reviews, and recap tooling as an agentic software-engineering experiment.
- capOS As A Robot Brain — Robotics service graph, actuator gateway, safety monitor, realtime island, and ROS bridge model.
- capOS Repository Harness Engineering — Repository-local harness engineering for making capOS legible, checkable, and safer for long-running coding agents.
- capOS-Hosted Agent Swarms — capOS-hosted OpenClaw-like personal agents, agent swarms, harness controls, memory, retrieval, and research agenda.
- Chat As Multimedia Substrate — Chat as unified text/audio/video multimedia transport across human, agent, and service participants, with listener-cap delivery and a clean WebRTC mapping.
- Enterprise Agent Game Showcase — Enterprise agent-management showcase through a capability-scoped business simulation game.
- GPU Capability — Capability-oriented GPU access, driver isolation, memory sharing, and CUDA-style compute model.
- Hosted Agent Harnesses — OpenClaw-like harnesses, swarms, memory/wiki systems, and agent orchestration research for capOS-hosted agents.
- Language Models and Agent Runtime — Language-model, embedder, agent-runner, and browser-agent capability interfaces.
- Linux Sandboxes and Virtualization for Workloads — Linux sandbox, container, gVisor, KVM, microVM, and CPU-isolation prior art for generic Linux workload execution.
- Multimedia Pipeline Latency — Research note.
- NO_HZ, SQPOLL, and Realtime Scheduling — Linux NO_HZ, io_uring SQPOLL, CPU isolation, PREEMPT_RT, SCHED_DEADLINE, and seL4 MCS grounding for capOS timer and realtime design.
- Realtime Multimodal Agent APIs — Research note.
- Realtime Voice Agent Shell — Realtime audio agent shell model across browser media, provider sessions, and brokered tools.
- Robotics Realtime Control — Research note.
- Scientific Agent-Lab Software Stack — Scientific computing, solver, proof-assistant, notebook, and reproducible-package prior art for a capOS-hosted LLM research lab.
- Scientific Standard Package and Agent Lab Capabilities — Scientific standard package and agent-lab capability services for CAS, solvers, proof assistants, notebooks, and reproducible research environments.
- Small LLM Survey — Model candidates for the on-ISO local LLM.
- Tickless and Realtime Scheduling — Tickless idle, SQPOLL nohz CPU isolation, request deadlines, scheduling contexts, and realtime islands.
Demos, Onboarding, and Contributor Surfaces
- Aurelian Frontier — Aurelian Frontier game-depth backlog.
- Aurelian Frontier — Capability-native Aurelian Frontier game design, mission model, content pipeline, and QEMU proof slice.
- Aurelian Frontier (proof slice) — Multi-process Aurelian Frontier smoke proof.
- Contributor Quest Mechanics — Contributor reward mechanics layered on Aurelian Frontier without granting repository authority.
- Enterprise Agent Game Showcase — Enterprise agent-management showcase through a capability-scoped business simulation game.
- First Chat Demo — Smallest resident-service proof.
- Game Mechanics Prior Art — Grounded mechanics research for Aurelian Frontier seasonal play, markets, construction, and tactical combat.
- Paperclips Terminal Demo — Clean-room incremental terminal demo.
- Paperclips Terminal Demo — Paperclips terminal demo backlog and content migration notes.
- Shared-Service Demos — Demo backlog.
Build, Tooling, and Documentation Site
- ABI Evolution Policy — Compatibility policy for capOS schema and ring ABIs.
- Build, Boot, and Test — Build, ISO, QEMU, host-test commands.
- capOS Agentic Development Experiment — Longitudinal study design for using capOS development sessions, subagents, reviews, and recap tooling as an agentic software-engineering experiment.
- capOS Repository Harness Engineering — Repository-local harness engineering for making capOS legible, checkable, and safer for long-running coding agents.
- Current Design Authority — Current-design authority map and proposal lifecycle rule for keeping implemented behavior out of archival proposal records.
- Documentation Workflow — How the mdBook site and generated PDF manual are positioned and built.
- mdBook Documentation Site — Documentation-site structure, metadata, status vocabulary, and curation workflow.
- Repository Composition — Repository scope, sibling project split criteria, and cross-repository organization plan.
- Repository Map — Source-tree subsystem index.
- Schema Registry — A SchemaRegistry capability that serves Cap’n Proto reflection metadata – interface IDs, method names and ordinals, parameter/result layouts, and doc comments – at runtime, as the machine-readable twin of the System Manual.
- System Manual Capability — A built-in man-pages analog: the Manual capability serves Unix-style reference pages, schema-derived interface manuals, and a man-shaped reference corpus through the shell, the self-served web UI, and a typed capnp API.
- Trusted Build Inputs — Trusted toolchain inventory.
Research and Papers
- Crash Recovery and Supervision — Prior-art survey of crash recovery and supervision for the Crash Recovery proposal.
- Debug, Trace, and Profiling Authority — Prior-art survey of debug/trace/profile authority for the Debug and Trace proposal.
- Papers — Long-form research write-ups.
- Research — Index of research deep-dive reports informing capOS design.
- seL4 HAMR — Evaluation of seL4 HAMR (AADL/Slang/CAmkES) versus the capOS Cap’n Proto schema-as-contract model.
- Time and Clock Authority — Prior-art survey of OS time/clock authority for the Time and Clock proposal.
Prior Art and Comparative OS Research
- Capability-Based and Microkernel Operating Systems Survey — Design consequences pulled from the survey.
- Cloudflare, Cap’n Proto, Workers RPC, and Cap’n Web — Cloudflare Workers, workerd, Durable Objects, Workers RPC, Cap’n Web, and Cloudflare’s production use of Cap’n Proto/KJ.
- EROS, CapROS, Coyotos — Persistent capability-system lineage.
- Future Scheduler Architecture — Survey of modern scheduler algorithms and architectures for capOS scheduler evolution.
- Game Mechanics Prior Art — Grounded mechanics research for Aurelian Frontier seasonal play, markets, construction, and tactical combat.
- Genode — Genode OS Framework: capability-based component model, session routing, VFS plugin architecture, POSIX compatibility, and Sculpt OS – with lessons for capOS.
- HPC Parallel Patterns — HPC benchmark and programming-model grounding for generic parallel processing patterns.
- Linux Sandboxes and Virtualization for Workloads — Linux sandbox, container, gVisor, KVM, microVM, and CPU-isolation prior art for generic Linux workload execution.
- Out-of-Kernel Scheduling — Prior art survey on kernel versus userspace CPU scheduling policy split, with capOS design implications.
- Plan 9 and Inferno — Plan 9 and Inferno: per-process namespaces, 9P protocol, file-server-as-service pattern, Dis VM, and Limbo concurrency — applied to capOS capability composition and IPC design.
- Scientific Agent-Lab Software Stack — Scientific computing, solver, proof-assistant, notebook, and reproducible-package prior art for a capOS-hosted LLM research lab.
- seL4 — Microkernel and capability reference.
- Spritely, OCapN, and CapTP — Spritely, OCapN, CapTP, netlayers, locators, Syrup, promise pipelining, handoffs, and capability-network lessons for capOS.
- Zircon — Fuchsia Zircon kernel: handle-based capability model, channels, VMARs/VMOs, async ports, and FIDL – with lessons for capOS capability dispatch, IPC, and memory design.
Stage Backlogs and Long-Form Planning
- Aurelian Frontier — Aurelian Frontier game-depth backlog.
- Capability-Infrastructure Cluster — Decomposition of the near-term capability-infrastructure cluster: matured proposals and Stage 6 remainder that share the schema serial surface.
- capOS SDK and Dual Transport — capOS front-door SDK crate with a transport abstraction for in-system and remote clients, plus crate-namespace publication.
- Certificates / TLS — Bounded implementation slice chain for the certificates/TLS track, from vendored verifier crates to a capOS-terminated Web UI endpoint.
- Cloud Driver Foundation Gap Analysis — Gap analysis between the existing userspace virtio driver foundation and the blocked cloud NIC/storage driver tasks: what is already proven, the narrow per-task remaining work, and the superseded live-NIC runnable-now claim.
- Cloud Image Import and Serial-Console Boot — Cloud provider disk-image import and serial-console-boot notes.
- Device Manager Refactor — Refactor direction for separating the kernel device authority ledger from QEMU proof scaffolding.
- Full-Scope Review 2026-06-09 — Findings ledger and decomposition source for the 2026-06-09 full-scope review of the tree at 50e8eaba (review base bb776326e, 2026-05-23).
- Go VirtualMemory Contract — VirtualMemory cap contract for Go.
- Hardware, Boot, and Storage — Hardware bring-up backlog.
- Installable System — Ordered implementation track turning the installable-system proposal into work grounded in the landed BlockDevice/filesystem/Store/writable-persistence/disk-image contracts.
- Local Users, Storage, and Policy — Identity/local-user backlog.
- Network Usability and Post-smoltcp — Network usability, resolver, diagnostics, and post-smoltcp backlog.
- NVMe Model B Doorbell DMA Validator — Conditional DMA-address ownership model for the userspace NVMe storage provider: provider-written queue-base and PRP/SGL addresses require a non-host-physical device-visible namespace; no-IOMMU GCP planning must use brokered bounce address publication instead.
- Paperclips Terminal Demo — Paperclips terminal demo backlog and content migration notes.
- POSIX Adapter Dash Port — POSIX adapter Phase P1.4 (dash port) backlog – libcapos-posix file/dir/stdio/env/printf surface, dash vendoring + per-call-site patch, and the run-posix-shell-smoke harness.
- Proposal Group Archive — Archived proposal cluster.
- Remote Session CapSet Client — Remote session CapSet client backlog.
- Research and Design Gaps — Research/design gap triage backlog.
- Run Targets, Init Mandate, and Default-Run Integration — Run-target governance.
- Runtime, Networking, and Shell — Runtime/network/shell backlog.
- Scheduler Evolution — Detailed task decomposition for future capOS scheduler evolution.
- Security and Verification — Security/verification backlog.
- Service Object Identity Migration — Superseded large-chunk migration plan for service object identity, retained as historical context after the active direction changed to session-bound invocation context.
- Session Archive & Gantt Effort — A pipeline to collect, normalize, and archive per-task effort data from the run-telemetry log and agent session transcripts, enabling development timeline visualization and task-duration prediction.
- Session-Bound Invocation Context — Implementation plan for one-session-per-process invocation context and session-keyed shared services.
- Shared-Service Demos — Demo backlog.
- SMP Phase C — SMP backlog.
- Stage 6 Capability Semantics — Stage 6 capability work.
Capabilities And Security
- POSIX fork/execve fd Inheritance — Target POSIX fork/execve full-fd-table inheritance for the recording shim, reconciled with the capability model, so unmodified POSIX software inherits stdio/cwd without bespoke per-app dup2 patches.
Hardware
- Network-Reachable Datapath Scope Decision — Scope decision recording that the real-GCE-boot milestone’s reachable-network-stack requirement means raw-frame TX/RX (Option A), not L4 sockets, grounded in what the billable cloudboot harness actually gates on.
- Phase C Userspace NIC Driver Relocation — Phase C design for relocating the virtio-net driver into userspace: the cap-surface delta, the inline-
DataNic ABI (matching the networking-proposal draft), the writable selected-write common-config window (an extension of the accepted notify-doorbell discipline; slice 1 landed 2026-06-02 20:30 UTC at c9518b2d), the userspace-vring slice that reuses the landed production DMA isolation (bounce policy + dma_backend probe + IOMMU IOVA-export), the sustained-receiveNicABI design used by the multi-frame TCP path, the selected serve-from-userspace 7c-ii(b) socket-authority proof, and retirement of the non-qemu legacy kernel socket grant path. - Real-Filesystem Decision — Real-filesystem direction for capOS: a role-split between capnp-native managed state and read-only FAT32 for host-populated/interop images, with ext4-read deferred and FAT write rejected, grounded in the existing Directory/File/Store cap surface and the storage layouts already in tree.
Hardware And Drivers
- ATAPI CD-ROM + ISO 9660 — Provenance map for the planned CD-ROM boot/install ATAPI PIO reader and read-only ISO 9660 driver - spec basis, implemented wire-format subset, and boot-only kernel-owned capOS mapping.
- AWS Nitro EBS (NVMe storage) — Provenance map for the AWS Nitro EBS NVMe storage shape - spec basis, the standard-NVMe wire subset it shares with docs/devices/nvme.md, and the capOS cloud-shape classification plus DMA-backend policy it binds onto.
- Azure MANA — Provenance map for the Azure MANA NIC / GDMA wire logic - spec basis, implemented host-conformance wire-format subset, and capOS authority mapping.
- Azure managed disk (NVMe storage) — Provenance map for the Azure managed-disk NVMe storage shape - spec basis, the standard-NVMe wire subset it shares with docs/devices/nvme.md, why the older-family virtio-scsi path is out of scope, and the capOS cloud-shape classification plus DMA-backend policy it binds onto.
- Device Driver Specifications — Per-device driver specs - cited authoritative spec, implemented wire-format subset, and capOS authority mapping.
- Device Spec Template — Blank three-part device-spec template - copy to docs/devices/
.md when starting a driver. - DMA User-Space Driver Isolation — DMA, user-space driver, vIOMMU, and no-IOMMU bounce-buffer design consequences for capOS device authority.
- FAT32 (read-only backer) — Provenance map for the read-only FAT32 Directory/File backer over virtio-blk and NVMe - spec basis, the vendored fatfs read subset used, timestamp provenance limits, and the capOS cap mapping.
- GCE gVNIC — Provenance map for the GCE gVNIC (Google Virtual Ethernet) NIC - spec basis from the public gVNIC docs and the GVE Linux driver, the wire-format subset capOS exercises today, and the bounded Nic-cap adaptation proof. capOS has live-GCE inventory, admin-queue/register, raw-frame GQI/QPL TX/RX, and typed Nic-adaptation proofs, but no reusable gVNIC provider service or host conformance suite yet.
- GCP Persistent Disk (storage) — Provenance map for the GCP Persistent Disk storage shape - virtio-scsi vs NVMe families, the standard-NVMe wire subset it shares with docs/devices/nvme.md, the capOS cloud-shape classification, the DMA-backend policy on no-IOMMU GCE shapes, the local production brokered NVMe provider chain, and the bounded live-GCE NVMe Persistent Disk read proof.
- NVMe — Provenance map for the NVMe controller wire subset capOS touches - conditional Model B validator scan targets, the read-only userspace bind, the reset-only CC selected-write claim, the no-IOMMU manager-op controller enable through the brokeredNvmeControllerEnable @6 verb, the no-IOMMU manager-op admin IDENTIFY through the brokeredNvmeAdminIdentify @7 verb, the brokered admin SQ/CQ doorbell + IDENTIFY command, the split admin SUBMIT @8 / COMPLETE @9 verbs whose completion handoff runs through a cap-waiter Interrupt.wait/acknowledge MSI-X route, the brokered I/O queue pair + bounded READ including one live-GCE Persistent Disk proof, and the dedicated BlockDevice data-completion Interrupt route - with spec basis and capOS authority mapping.
- virtio-blk — Provenance map for the QEMU-fixture virtio-blk BlockDevice driver - spec basis, implemented wire-format subset, capOS authority binding, and why it is a qemu-gated fixture rather than the production storage route.
- virtio-net — Provenance map for the in-tree modern virtio-net PCI NIC - spec basis, implemented wire-format subset, and capOS authority binding.
- virtio-rng — Provenance map for the in-tree virtio-rng entropy device - spec basis, implemented wire-format subset, and its role as a QEMU-only DDF metadata and IOMMU-remapping hardware-DMA proof fixture (no userspace-facing capability, not a production driver).
Programming Languages And Runtimes
- POSIX fork/execve fd Inheritance — Target POSIX fork/execve full-fd-table inheritance for the recording shim, reconciled with the capability model, so unmodified POSIX software inherits stdio/cwd without bespoke per-app dup2 patches.
Remote Session
- Remote Session CapSet Clients — Remote host app model for authenticated capOS sessions, broker-issued CapSet views, and typed capability calls over Cap’n Proto RPC.
- Remote Session UI Security — Web-security hardening posture for the trusted local remote-session-ui bridge, the capOS-served Web UI, public-origin carry-over policy, and the Tauri desktop wrapper.
Security
- Phase C Userspace NIC Driver Relocation — Phase C design for relocating the virtio-net driver into userspace: the cap-surface delta, the inline-
DataNic ABI (matching the networking-proposal draft), the writable selected-write common-config window (an extension of the accepted notify-doorbell discipline; slice 1 landed 2026-06-02 20:30 UTC at c9518b2d), the userspace-vring slice that reuses the landed production DMA isolation (bounce policy + dma_backend probe + IOMMU IOVA-export), the sustained-receiveNicABI design used by the multi-frame TCP path, the selected serve-from-userspace 7c-ii(b) socket-authority proof, and retirement of the non-qemu legacy kernel socket grant path.
Storage
- FAT32 (read-only backer) — Provenance map for the read-only FAT32 Directory/File backer over virtio-blk and NVMe - spec basis, the vendored fatfs read subset used, timestamp provenance limits, and the capOS cap mapping.
- Real-Filesystem Decision — Real-filesystem direction for capOS: a role-split between capnp-native managed state and read-only FAT32 for host-populated/interop images, with ext4-read deferred and FAT write rejected, grounded in the existing Directory/File/Store cap surface and the storage layouts already in tree.
- virtio-blk — Provenance map for the QEMU-fixture virtio-blk BlockDevice driver - spec basis, implemented wire-format subset, capOS authority binding, and why it is a qemu-gated fixture rather than the production storage route.
Roadmap
Long-term direction for capOS. Related material lives elsewhere: detailed task
decomposition in docs/backlog/, selected-milestone state in
docs/tasks/state.toml, current execution order in root task records under
docs/tasks/, and shipped-milestone reports in docs/changelog.md.
Current Direction
Current selected milestone: GCE Self-Hosted Web UI.
The next visible goal is a self-hosted capOS Web UI reachable through the
Phase C userspace network stack, then proved on private GCE reachability before
any public endpoint. The userspace smoltcp-backed TcpListenAuthority local
path is proved by
cloud-prod-userspace-network-stack-smoltcp-local-proof.
The local DHCP/IPv4 configuration proof is done by
cloud-prod-network-stack-dhcp-ipv4-config-local-proof:
the userspace stack acquires a QEMU SLIRP DHCPv4 lease, installs the default
route, resolves gateway and same-subnet ARP neighbors, and serves
NetworkManager.getConfig before public or live GCE exposure. The
cloudboot-local Web UI authority inventory is done by
remote-session-webui-cloudboot-authority-inventory:
it records the required and forbidden remote-session-web-ui grants, trusted
listener/source metadata, browser-visible forbidden markers, and local L4 proof
markers for the completed cloudboot proof. Server-side session hardening is done by
remote-session-web-ui-session-hardening
(Review C high closed: unpredictable rotated server-side session ids, idle/absolute
expiry enforced before dispatch, Host/Origin/double-submit-CSRF gates, and a
Secure-when-HTTPS cookie posture). Web UI connection bounds are done by
remote-session-web-ui-connection-bounds
(per-connection request-read/response-send deadlines in the Web UI client over
the bounded network-stack listener, with a drip-feed abandon proof).
The legacy kernel socket-path retirement is done by
cloud-prod-legacy-kernel-network-socket-path-retirement:
non-qemu production manifests reject kernel network_manager /
tcp_listen_authority grants, leaving those sources as qemu-only fixtures.
The local
cloud-prod-remote-session-web-ui-l4-local-proof
is the done service-level L4 proof on top of the userspace L4 and DHCP/IPv4
substrate. The legacy-virtio serving gap is closed locally by
cloud-gce-legacy-virtio-webui-serving-local-proof
(2026-06-11): a kernel-brokered legacy virtio 0.9 runtime backs the typed
Nic cap and a host HTTP peer fetches the byte-verified UI bundle under
disable-modern=on. A public-ingress hardening set is done on the L4 gate
(public-origin policy, IAP-aware SameSite cookie policy, JSON content-type
guard, security response headers and strict CSP, GFE-range-pinned
forwarded-scheme trust, the public /healthz contract, and in-guest login
peer-gate/backoff hardening), and a no-spend provider-harness fixture set is
done (private --preflight-only, private/public proof-evidence validators,
public ingress plan gate, journal-driven teardown engine, provider-command
allowlist gate) — all local QEMU/cloudboot or recording-stub fixture evidence
with no real provider invocation or mutation; the current ladder summary lives
in
Current Status.
cloud-gce-private-self-hosted-webui-proof
remains on hold: the cloudtest credential lacks the firewall IAM a private
same-VPC probe needs against GCE default-deny ingress, and the live run needs
per-run billable authorization. Public
GCE ingress and TLS remain under the explicit on-hold
cloud-gce-public-self-hosted-webui-ingress-tls
task and require separate authorization; the selected milestone does not grant
public exposure, broad firewall changes, TLS key custody, or production release
authority. The capOS-terminated TLS successor remains a separate later
evidence class behind the provider-terminated first public proof.
The previous selected milestone, Installable System, is complete through
commit 12b8334a (commit timestamp 2026-06-07 18:19 UTC; task closeout
2026-06-07 18:20 UTC) for the bounded local/QEMU contract: persistent
data-region mount, config-overlay compose/merge fallback, generation/rollback
machinery, integrated installable disk packaging, target-disk install
(make run-installable-install), first-boot provision
(make run-installable-provision), update/rollback
(make run-installable-update), and structural proposal/body wording reconcile
are landed. The closeout preserves the RAM-only Namespace caveat and does not
claim secure boot/signing, production release authority, public ingress,
AWS/Azure live support, direct-remapping production hardware, userspace
smoltcp/L4 readiness, or full durable account policy. Detailed decomposition
lives in docs/backlog/installable-system.md.
The preceding selected milestone, Device Driver Foundation, is complete by
the 2026-06-07 08:23 UTC production-authority closeout recorded in
ddf-production-authority-closeout.
That closeout ties together the landed provider-driver, interrupt, audit, and
DMA-policy prerequisites and preserves the runtime fail-closed DMA backend
baseline: remapping only when capOS can validate it, otherwise brokered bounce
buffers or unsupported. The related GCP-first provider NIC/storage rollup is
also closed by
cloud-usable-instance-provider-nic-storage
(2026-06-07 05:26 UTC), but only for the recorded operator serial path,
selected raw-frame NIC/storage evidence, and gVNIC portability evidence. Public
L4 ingress, AWS/Azure live support, direct-remapping production hardware,
device-autonomous MSI-X delivery, userspace smoltcp/L4 readiness, and
high-throughput or multiqueue NIC readiness remain explicit future follow-ups,
not part of the closed DDF selected milestone.
The previous selected milestone, In-Process Threading Scalability, is
complete at commit 136b72de (2026-05-01 14:58 UTC) after repairing the
benchmark validity issue found on 2026-05-01: the old 1 MiB/spinning-parent
workload was not a valid four-core scaling reference because the matching Linux
pthread baseline also stayed flat at four workers. The repaired shape now uses a
blocking parent join, 262,144 blocks (16 MiB), and work_rounds=64. The
controlled capOS/Linux pair on capos-bench 2026-05-02 21:38 UTC against
main commit 374f8556 (5 runs each, both pinned to physical-core logical
CPUs 0,1,2,3) recorded capOS 1-to-2 work/total speedups 1.883x /
1.787x and matching Linux pthread baseline 1.988x/1.987x. Its
1-to-4 row became the diagnostic that justified Phase D’s fair-share enqueue
policy: capOS sat at 1.566x/1.538x while Linux scaled to
3.963x/3.858x on the same physical-core pin set. Phase D WFQ has now
closed that diagnostic gap as a scheduler-evolution milestone, recording capOS
3.088x/2.700x and Linux 3.974x/3.850x on 2026-05-10. These rows are
summarized in docs/benchmarks.md and docs/changelog.md. Historical
pre-collapse 1-to-2
(1.828x/1.687x) and the post-collapse 3-run diagnostic remain in
docs/benchmarks.md for reference. Ordinary -smp 2 regression coverage
also passed.
The previous selected milestone, Multi-Process SMP Concurrency, is
complete at commit 3fb89923 (2026-04-30 09:45 UTC):
make run-smp-process-scale has repeated KVM-backed evidence for independent
CPU-bound worker processes with 1.608x 1-to-2 speedup, and the ordinary
run-smoke/run-spawn coverage passed under -smp 2.
The previous selected milestone, Session-Bound Invocation Context, is
complete: normal workload processes have one immutable live session context,
endpoint calls reveal only privacy-preserving caller-session metadata by
default, explicit subject disclosure is gated by request and scope, and
chat/adventure/terminal/stdio paths no longer derive ordinary caller identity
from caller-selected service-visible metadata. Gate 4 verification is recorded
at commit faeff80 (2026-04-29 21:39 UTC), and paper/status closeout is
merged at commit 503abc9. Follow-up session lifecycle work remains outside
that completed milestone: production interactive shells need mutable session
liveness cells, explicit logout/close propagation, and renewal/recovery paths
so fixed short expiry is not the only way to bound stale authority.
Username-aware local password login is prioritized ad-hoc implementation work, not the selected milestone, unless explicitly selected later.
Current priority ladder, reflecting user direction (2026-05-05 17:56 UTC redirect supersedes the earlier SMP/threading-first ladder; the previous ordering is retained as background only at the end of this section):
- Userspace driver transition prerequisites – the S.11.2
hostile-smoke gate items in
docs/dma-isolation-design.mdand the matching open items ofdocs/backlog/hardware-boot-storage.mdTask 3 are now closed. S.11.2.7 stale IRQ after revoke/reset closed2026-05-05 18:17 UTCvia real-INT $vectorcross-reset injection inmake run-net. S.11.2.8 stale DMA completion after revoke/reset closed2026-05-05 19:37 UTCvia the device-managerprove_qemu_stale_dma_completion_handoffproof inmake run-net: real virtio-net DMA page free + reallocate cycle bumps the live ledger’s page generation at three boundaries (after revoke, after detach, after reset/reuse), then a synthesized staleDeviceDmaAllocationis fed to the productiondevice_dma::record_virtio_net_completion_for_allocationpath and rejected asstale-dma-handlewith side-effect blocking. S.11.2.9 hostile-smoke gate-wiring closed2026-05-05 20:49 UTCby aggregating every hostile-smoke acceptance matrix proof line into themake run-net->tools/qemu-net-smoke.shgate, including the newly wireddevice-manager: devicemmio driver crash hook proofanddevice-manager: interrupt driver crash hook proofassertions. The manifest-grantedDMAPoolpath currently exposes eight fixed manager-owned bounce-bufferDMABufferresult caps with typed allocate/free/map/unmap/submit/complete surfaces;DMABuffer.unmapremoves only the caller’s borrowed userspace VMA and preserves pool/page and descriptor accounting, and acceptedsubmitDescriptornow writes a bounded provider-owned queue entry plus submit marker after authority validation and the submit scrub. The manifest-grantedDeviceMmiopath now exposes a read-only borrowed userspace VMA over boot-preseeded BAR pages, with explicitDeviceMmio.unmap, duplicate-map/no-op-unmap denials, revoke-before-detach cleanup, brokered read-onlyread32, and one boundedwrite32effect for the provider-scoped PCI MSI-X metadata-derived virtio-rng vector-control mask dword, while arbitrary register writes, doorbells, host physical/IOVA exposure, and production provider-driver consumers remain blocked. The remaining gating prerequisites for moving NIC/block drivers out of the kernel are production userspaceDMAPool/DeviceMmio/Interrupthandles, real device-manager page quiesce/scrub/release hooks, real userspaceInterruptwaiter objects, and durable/signed production audit consumption beyond the first volatileHardwareAuditLog.snapshotcap. IOMMU domain programming has landed for the bounded QEMU Intel remapping path (umbrella closed2026-05-23 23:35 UTC); production-hardware IOMMU programming, AMD-Vi, and trusted sharing groups remain future work. The device-manager refactor proposal is already onmainat commit77358400; treat its proof/handles/domain/transaction-helper splits as high-priority, behavior-preserving risk reduction only when they unblock or lower risk for those DDF authority gates. It remains subordinate to behavior-moving DDF slices and the scheduler SMP/nohz prerequisite chain. - Scheduler evolution in
docs/backlog/scheduler-evolution.md: Phase D best-effort fair scheduling closed at commit77caafc0(2026-05-10 19:39 UTC) and docs commit1a08ec23(2026-05-10 21:47 UTC). The WFQ slice uses per-thread vruntime accounting,SchedulingPolicyCapweight/latency-class authority, per-CPU WFQ run queues, and bounded steal/migration invariants. The controlled Task 6 benchmark pair materially closed the 1-to-4 thread-scale diagnostic gap: capOS recorded work/total speedups3.088x/2.700xversus the prior1.566x/1.538xbaseline, while Linux on the same host/pin set recorded3.974x/3.850x. Phase ESchedulingContextcapability follow-ups are now closed: endpoint donation/return and the scheduler-observableUserSession.logout()hook are merged; timeout/depletion notifications use fixed per-context cells plus drain observer results; ordinary non-donated session-logout stale-context coverage is proven; donated receiver logout keeps the conservative counted/skipped policy until endpoint return restores only reduced donor budget; and clean local owner-shell exit calls the sameUserSession.logout()path before process exit. Phase F auto-nohz / SQPOLL / tickless idle follows Phase E; the one-SQ-consumer ring ownership prerequisite,CpuIsolationLeasescaffold, nohz activation/deactivation telemetry child, and explicit housekeeping/deferred-work placement, bounded SQPOLL ring mode, the clockevent/deadline substrate, and bounded producer-wake SQPOLL progress are complete. The telemetry proof records accepted active candidates, rejected activation decisions, stale/revoked rollback labels, ready and selected housekeeping CPUs, selected deferred-work placement or fail-closed reasons, target runnable entity counts, monotonic clocksource/accounting readiness, and explicit disabled tick/SQPOLL/full-nohz guardrails. The first two automatic nohz activation increments have since landed: theCpuIsolationLeasepreflight performs real per-CPU periodic-tick suppression for the narrow single-runnable-entity window with fail-closed rollback (docs/tasks/done/2026/scheduler-phase-f-auto-nohz-activation.md), and a ring-coupledkernelSqpolllease whose bound ring is in SQPOLL running/sleeping mode with a live owner is admitted for tick suppression with the SQPOLL ring-state re-check as the decisive rollback gate (docs/tasks/done/2026/scheduler-phase-f-auto-nohz-sqpoll.md). Timeout-based auto-revoke, generic full-nohz for explicitly budgeted compute leases, and generic SQPOLL nohz for explicitly leased caller-thread rings have since landed; production policy-service issuance and broader userspace-poller/device-queue admission remain future work. The future full-SMP hardware scalability milestone is now recorded in the existing SMP/scheduler/benchmark/HPC proposal set anddocs/backlog/scheduler-evolution.mdPhase F.5. It targets direct high-core hardware/perf-runner rows at 1/2/4/8/16/32 workers, with QEMU kept for boot/regression and virtualization context rather than as the primary performance source. Phase G realtime islands follows Phase F. EEVDF is retained as a follow-on policy evaluation, not a Phase D blocker; generic full-nohz is landed for explicitly budgeted compute leases, with policy-service issuance still future. - Language-support tracks remain active high-priority parallel work
alongside the kernel/scheduler focus. POSIX adapter v0 P1.2 (UDP
cap + dns.c) and P1.3 (Pipe cap + fork-for-exec + recording-shim
posix_spawn) landed; the remaining v0 phase is P1.4 (dash port- libcapos-posix file/dir/stdio/env/printf surface + the
run-posix-shell-smokeharness), which is in flight against the Storage Phase 3 RAM-backedFile/Directory/Store/Namespacecaps. P1.4 Slice 3 (FdBacking File/Directory/Terminal variants +make run-posix-file-backing-smoke) landed atae58f936, and Slice 4 (absolute-path resolver + functionalopen()/opendir()over the bootstrap-granted root Directory cap with per-fd file position +make run-posix-open-smoke) landed at94b29177. The file/directory fd closeout landed at commitf97d9833(2026-05-23 06:23 UTC):make run-posix-fileprovesopen(),write(),lseek(),read(),opendir(),readdir(), andclosedir()through a live POSIX C process. Together these bring POSIX file I/O to functional end-to-end parity as the first non-shell POSIX subsystem. Identity stubs landed at commit1a8a9896(2026-05-23 06:51 UTC):make run-posix-identityproves parent and fork/exec childgetpidlines with hardcoded uid/gid0. The printf/string subset now hasmake run-posix-printf, which proves formatted output plus string/mem, numeric conversion, and ctype behavior from a live capOS C process. The signal/time surface landed at commit90e64011(2026-05-23 08:11 UTC):make run-posix-signal-timeproves Timer-backedtime,nanosleep, andsleepplus fail-closed signal-delivery stubs from a live capOS C process. Remaining P1.4 work is dash vendoring + smoke (Slices 11-13). Long-form decomposition lives indocs/backlog/posix-adapter-dash-port.md. WASI host adapter v0 W.1/W.2, Lua iteration follow-ons, libcapos / libcapos-posix successor work, and Go runtime stay in the parallel pool when selectable.
- libcapos-posix file/dir/stdio/env/printf surface + the
- Storage capability interfaces, starting with RAM-backed
Store/Namespace; proceed to local disk and a small read-only filesystem when the block path and the userspace-driver gate are ready. Phase 2 (schema-onlyBlockDevice/File/Directoryinterfaces), Phase 3 slice 1 (minimal RAM-backedFileCapObjectwith theKernelCapSource::filegrant source and themake run-file-server-smokeproof), Phase 3 slice 2 (minimal RAM-backedDirectoryCapObjectwith theKernelCapSource::directorygrant source, result-cap transfer ofFile/Directoryhandles, and themake run-directory-server-smokeproof), and Phase 3 slice 3 (theStore/Namespaceschema interfaces plus minimal RAM-backedStore/NamespaceCapObjects with theKernelCapSource::store/KernelCapSource::namespacegrant sources, content-addressed blob storage,Namespace.sub()result-cap transfer, and themake run-store-namespace-smokeproof) have landed. The local-disk path has also reached its first read-only milestone: the first virtio-blkBlockDeviceCapObject(make run-virtio-blk) and a read-only filesystem service overBlockDevice(kernel/src/cap/readonly_fs.rs, parsing a fixedCAPOSRO1on-disk layout and servingDirectory.list/open+File.read;make run-storage-fs) now serve a known on-disk tree to a userspace consumer. The Local Disk Storage Milestone’s final gate has also landed: a disk-backed persistentStore(kernel/src/cap/persistent_store.rs, aCAPOSST1on-disk layout written through the virtio-blk driver, granted via thepersistent_storeKernelCapSource) with a two-pass reboot proof (make run-storage-persist) that stores+commits a capnp object on the first boot and reads it back on a fresh boot of the same disk image. The Writable Local Storage Milestone has now landed: directory/file mutation, the fail-closed concurrent-writer policy, clean-reboot durability for both filesystem mutations and co-locatedStoreobjects on one disk (kernel/src/cap/writable_fs.rs, aCAPOSWF1sub-volume; two-pass proofmake run-storage-writable), and a bounded unclean-shutdown recovery proof (make run-storage-writable-recovery): an induced forced poweroff in the record-written / superblock-pending window proves the next mount recovers to a consistent tree with the interrupted allocation atomically absent. Seedocs/proposals/storage-and-naming-proposal.md. - Keep serial diagnostics as the first remote troubleshooting path for
cloud/hardware bring-up, then add SSH, Telnet development access, and
basic WebShell access when network and identity prerequisites are
credible. The host-served remote-session UI remains separate from the
self-served capOS web UI path. The old self-served proof target is retired
with the qemu-only kernel TCP listener; the replacement proof is the future
Phase C Web UI L4 gate. Ordinary
make runstill starts the host-local remote-session CapSet path, and the full boot-resource UI bundle is served with fixed names and integrity labeling. The host-servedmake remote-session-uibridge remains a separate trusted development path, not the self-hosted cloud Web UI proof. - Boot on GCP/AWS in staged provider tracks. The first GCP serial-console boot proof landed as run
1778230874-715a(2026-05-08 09:06 UTC, source commit3951e275). The GCP-first usable-instance provider rollup is also closed: serial-console operator access, live virtio-net raw-frameprovider-nic-bound, live NVMe Persistent Disk brokeredREAD, and separate gVNIC raw-frame / typed-Nic portability evidence are recorded undercloud-usable-instance-provider-nic-storage. AWS/Azure providers, public L4 ingress, SSH/WebShell productization, broader storage variants, and cloud benchmark reruns remain future gates.
Game/demo plans (Paperclips, Aurelian Frontier) are deprioritized
opportunistic-only per the same redirect; see docs/tasks/README.md Ad-Hoc
Planning / Research Tasks for the High / Normal / Low / Closed bands and
the dispatch ordering.
Earlier (pre-2026-05-05) priority ladder retained as background:
- Finish a reasonable SMP/threading milestone, including the current scheduler hot-lock bottleneck if the milestone still claims scalability.
- Build the device-driver foundation before cloud/network/storage expansion: ACPI/MADT/MCFG, PCI/PCIe, I/O APIC, MSI/MSI-X, DMA/MMIO/IRQ authority, and reusable virtio/device lifecycle code.
- Implement storage capability interfaces, starting with RAM-backed
Store/Namespace; proceed to local disk and a small read-only filesystem when the block path is ready. - Keep serial diagnostics as the first remote troubleshooting path for cloud/hardware bring-up, then add SSH, Telnet development access, and basic WebShell access when network and identity prerequisites are credible.
- Boot on GCP/AWS in two stages: first imported-image serial-console boot, then a usable cloud instance with provider storage/network drivers and network shell access.
The 2026-05-05 ladder above is the authoritative current ordering; the earlier ladder remains as background context only.
Details:
docs/tasks/README.mddocs/backlog/smp-phase-c.mddocs/backlog/session-bound-invocation-context.mddocs/proposals/session-bound-invocation-context-proposal.mddocs/proposals/user-identity-and-policy-proposal.mddocs/backlog/local-users-management.mddocs/proposals/boot-to-shell-proposal.mddocs/proposals/oidc-and-oauth2-proposal.md
Whitepaper Track
A future capOS whitepaper / technical report consumes – not duplicates –
work from the other tracks. The plan, outline, and live evidence-gap log
remain in docs/paper/ (plan.md, outline.md, evidence-gaps.md).
The paper itself is a Typst project at papers/schema-as-abi/ and is
built via make paper.
The paper’s Tier-1 evidence requirements pull these existing items into explicit paper-supporting roles. They are not new tracks; they are the selection lens this track applies:
- Stage 6 session-bound invocation context migration (closes the “interface IS the permission” claim).
- A measurement harness over
make run-measureproducing reproducible ring throughput,cap_enterlatency, IPC handoff, and schema-dispatch numbers (closes the ring-as-sufficient-boundary claim). - A paper-scoped persistence proof-of-concept narrower than the storage proposal (closes the wire-format-enables-persistence claim).
- A paper-scoped network-transparency proof-of-concept narrower than the general networking proposal (closes the wire-format-enables-network-transparency claim).
- At least one of {promise pipelining, notification objects} (closes capnp-rpc-shaped composition beyond CALL/RECV).
Tier-2 strengtheners: ring-protocol Kani proof, full concurrent SMP scheduling, end-to-end SSH Shell Gateway, one non-toy demo beyond Adventure or First Chat.
Out of scope for the first paper (acknowledge in Future Work only): aarch64, GPU, live upgrade, formal MAC/MIC, Go/WASI, cloud metadata, production volume encryption.
When workplan slices close a paper-evidence gap they should reference
docs/paper/evidence-gaps.md and update it in the same task, including
the matching #todo block in papers/schema-as-abi/main.typ. A
structural pre-evidence draft already exists at
papers/schema-as-abi/main.typ; the abstract, the Evaluation section,
the Conclusion, and any contribution claim that depends on missing
Tier-1 evidence stay deferred until that evidence lands. New paper
content that does not depend on missing artifacts may be drafted at
any time and lives next to the existing #todo blocks.
Completed Foundation
- Stage 0: Foundations: bitmap physical frame allocator, heap for
alloc, IDT exception handling, and initial Cap’n Proto schema scaffolding. - Stage 1: Virtual Memory: kernel and per-process address spaces, page table abstraction, HHDM preservation, and user-half cleanup.
- Stage 2: User-Space Transition: GDT/TSS/syscall setup and Ring 3 round-trip path.
- Stage 3: Process Abstraction: ELF loading, process ownership of address
spaces and cap tables, process exit cleanup, and the current
exit/cap_entersyscall surface. - Stage 4: Capability Syscalls / Ring Transport: Console capability,
shared-memory submission/completion rings,
cap_enter, CQE transport errors, and alloc-free dispatch paths. - Stage 5: Scheduling Core: PIT/PIC timer preemption, round-robin scheduler, context switching, generation-tagged caps, and VirtualMemory cap.
- Kernel Networking Smoke: in-kernel QEMU virtio-net lower-layer fixture evidence for PCI/device discovery, descriptor-accounting guards, ARP, and ICMP. TCP/UDP socket proof has moved to the Phase C userspace network-stack gates.
- Boot To Shell / Native Shell: shell-led boot flow, split debug/terminal UARTs, local setup/login, anonymous/operator sessions, and shell REPL.
- Verified Core: bounded local/GitHub Kani model-checking gate plus high-memory proof gate for selected cap-table, frame-bitmap, transfer rollback, and resource accounting invariants. These are bounded model checks (small input sizes such as <=8 frames and 63 ELF bytes), not unbounded proofs; they hold within the harness bounds, not for all inputs.
- Shared-Service Demo Base: chat, adventure, NPC-as-process, and shared service harness prototypes.
Historical completion reports live in docs/changelog.md.
Stage 6: IPC And Capability Transfer
Outcome: cross-process capability calls, capability transfer, revocation, and process spawning are capability-shaped and usable by init-owned service graphs. Caller-selected service-visible identity is being replaced by session-bound invocation context: each normal process has one immutable session context, endpoint calls expose privacy-preserving caller-session metadata, and broker-granted service roots/facets carry service access.
Implemented:
cap_enterblocking wait- Endpoint kernel object
- RECV/RETURN ring opcodes
- cross-process IPC
- direct-switch IPC handoff
- legacy endpoint receiver metadata as transitional IPC machinery
- copy/move capability transfer
CAP_OP_RELEASE- runtime handle release integration
- epoch revocation and Revocable Read proof
- MemoryObject substrate – the kernel-level mapping mechanism that backs
zero-copy IPC. Demonstrated end-to-end by
make run-memoryobject-shared(single-shot transfer) andmake run-ipc-zerocopy(multi-message shared point-to-point buffer with metadata-only endpoint CALLs). The typedSharedBuffersurface and service APIs that consume it (File.readBuf,BlockDevice.readBlocks, NIC RX/TX rings) are still pending. - ProcessSpawner / ProcessHandle
- init-owned manifest execution and boot package boundary cleanup
- immutable per-process
SessionContextownership, default child-session inheritance, and trusted broker-selected child sessions, demonstrated bymake run-session-context
Remaining themes:
- typed
SharedBuffercapability and consuming service APIs (storage, block, network, GPU) on top of the existingMemoryObjectsubstrate - notification objects (so zero-copy producers/consumers can signal each other without per-record endpoint CALLs)
- promise pipelining
- CapabilityManager list/grant interface
- stable service-audit identity for endpoint caller-session references across intentional service replacement or upgrade
- scheduling context and resource donation
- init ELF embedding
Details:
docs/backlog/session-bound-invocation-context.mddocs/backlog/service-object-identity-migration.md(superseded)docs/backlog/stage-6-capability-semantics.mddocs/proposals/service-architecture-proposal.mddocs/proposals/storage-and-naming-proposal.mddocs/proposals/error-handling-proposal.md
Stage 7: SMP, Runtime, Networking, And Shell
Outcome: capOS moves from single-CPU scheduling and local-only shell access to multi-CPU execution, thread-aware runtime behavior, socket-shaped network capabilities, and agent/web shell entry points.
SMP status:
- Phase A complete: BSP per-CPU syscall stack/current-thread state and unified kernel-entry stack hook.
- Phase B complete: APs start through Limine MP, switch to capOS kernel paging/stacks, initialize AP-local CPU state, and park.
- Phase C selected AP scheduler-owner proof complete: GS/
swapgs, LAPIC timer/IPI, TLB shootdown, and first AP scheduler-owner proof are complete. Commitd88bca7at2026-04-25 11:31 UTCproves AP cpu=1 can run scheduler-owned user contexts under-smp 2while a scheduler-owner latch keeps the BSP in kernel idle. Per-CPU scheduler ownership, the narrow idle-to-runnable reschedule-IPI wake path, and the focused process-scale proof harness are now present. - Multi-Process SMP Concurrency is complete at commit
3fb89923(2026-04-30 09:45 UTC).make run-smp-process-scalerecords repeated raw QEMU serial logs plus per-case medians and fails closed below the1.6xspeedup threshold. The accepted KVM-backed run recorded1.608x1-to-2 speedup, and ordinaryrun-smoke/run-spawncoverage passed under-smp 2. - In-Process Threading Scalability has the formal capOS+Linux
thread-scale evidence pair on
capos-bench2026-05-02 21:38 UTC againstmaincommit374f8556: capOS work1.883xand total1.787xclear the configured 1-to-2 gates against the then-current single-global-queue scheduler; matching Linux pthread baseline1.988x/1.987xvalidates the workload shape. Its 1-to-4 row became the diagnostic that justified Phase D’s fair-share enqueue policy (capOS1.566x/1.538xvs Linux3.963x/3.858xon the same physical-core pin set). Phase D WFQ later manually accepted the recorded 1-to-4 diagnostic with capOS3.088x/2.700xand matching Linux3.974x/3.850x.
Runtime/network/shell themes:
- reconcile in-process threading implementation status and any follow-on work
- scheduler evolution after the accepted Phase D WFQ closeout: Phase E
SchedulingContextcapability authority is closed; CPU isolation housekeeping/deferred-work placement is closed; bounded SQPOLL ring mode and the clockevent/deadline substrate are closed; bounded non-periodic SQPOLL producer-wake progress is closed. The narrow single-runnable-entity and SQPOLL-coupled automatic nohz activation increments are closed (scheduler-phase-f-auto-nohz-activation,scheduler-phase-f-auto-nohz-sqpollunderdocs/tasks/done/2026/); generic full-nohz for explicitly budgeted compute leases and generic SQPOLL nohz for explicitly leased caller-thread rings have since landed, while policy issuance remains future work. Keep EEVDF as a follow-on best-effort ordering evaluation and keep stateful task/job graph coordinators above CPU dispatch rather than turning them into global schedulers. Userspace policy-service AutoNoHz placement for ordinary “capable of saturating a CPU core” threads sits in Phase H ofdocs/backlog/scheduler-evolution.mdand the “Policy-Service Userstories” section ofdocs/proposals/tickless-realtime-scheduling-proposal.md: the policy-service-issuedCpuIsolationLeaseadds placement isolation only and never mints CPU-time authority, with bounded lifetime, revocation, accounting target, and operator-declared auto-claim pool - session lifecycle for production shell UX: mutable session liveness cells,
UserSession.logout, owner-shell/gateway close propagation, and narrow renewal/recovery paths that mint fresh grants without reviving stale ordinary caps; clean local owner-shell exit now reaches the logout path, while renewal/recovery remains future work - Telnet Shell Demo as first TCP-backed
TerminalSessionproof. Plaintext, loopback-only research demo; not a shippable Telnet service. - Tickless idle as the near-term timer cleanup: split clocksource from
clockevent, convert timeout waiters to absolute deadlines (done), migrate
the scheduler idle path to a CPL0 per-CPU kernel idle thread (done), then
stop the periodic tick only when no runnable work exists. After the
one-SQ-consumer, CPU-isolation authority, nohz telemetry, and housekeeping
placement prerequisites, bounded SQPOLL ring mode and the clockevent/deadline
substrate closed, and bounded non-periodic SQPOLL progress was proven; the
periodic tick is now suppressed for the narrow single-runnable-entity window
and for the ring-coupled
kernelSqpolllease (scheduler-phase-f-auto-nohz-activation,scheduler-phase-f-auto-nohz-sqpoll), with the periodic tick as the fail-closed fallback everywhere else. Timeout-based auto-revoke, generic full-nohz for explicitly budgeted compute leases, and generic SQPOLL nohz for explicitly leased caller-thread rings have since landed. Seedocs/proposals/tickless-realtime-scheduling-proposal.mdanddocs/research/nohz-sqpoll-realtime.md. - SSH Shell Gateway as the production remote CLI successor to plaintext Telnet after host-key, authorized-key, audit, and persistence prerequisites exist
- remote session CapSet clients as the programmatic/UI counterpart to shells:
regular host apps, desktop GUI/Tauri front ends, and server-side webapp
gateways authenticate through the same session/admission path, receive
broker-issued remote capability views, and call granted services over
Cap’n Proto RPC without turning chat, Paperclips, agent tools, or future
command surfaces into shell-only protocols. The first default-run development
endpoint and focused interop harness now prove this shape with
schema-framed Cap’n Proto DTOs; standard
capnp-rpcproxy transport remains future work. Later UI-composition caps let capOS-side services or agents propose bounded session workspace changes without receiving arbitrary browser or desktop authority. - self-served capOS web UI has historical focused proof evidence, but the old
make run-remote-session-self-served-web-uitarget is retired with the qemu-only kernel TCP listener. The replacement proof belongs to the future Phase C Web UI L4 gate.make runforwarding the guest remote-session CapSet endpoint is still not the same as capOS serving the web UI, andmake remote-session-uiremains the host-side trusted development bridge. The blockedremote-session-self-served-web-ui-default-runtask records the future decision and wiring gate if self-served UI should become part of ordinarymake run. - Telnet over TLS as an optional compatibility/service-terminal transport after certificate/TLS, durable identity, and session lifecycle work exists. It should not be a default main access interface ahead of SSH/WebShell.
- decomposed userspace NIC/network-stack milestone after driver authority gates
- native shell agent runner
- WebShellGateway using the same broker-issued shell/agent authority model
Remote shell priority: do not treat Agent Shell or WebShellGateway as the next
default visible milestone before the driver/storage foundation unless the user
explicitly redirects. SSH/WebShell production access is more useful after
session lifecycle, durable account/key material, network listener authority,
and serial/cloud diagnostics have credible proofs. Plaintext Telnet remains a
loopback/local development proof and a simple transport for exercising
TerminalSession; it is not a production cloud access target. Telnet over TLS
may remain as a later optional transport, but SSH and WebShell are the main
production access tracks.
Details:
docs/backlog/smp-phase-c.mddocs/backlog/scheduler-evolution.mddocs/backlog/runtime-network-shell.mddocs/backlog/remote-session-capset-client.mddocs/proposals/smp-proposal.mddocs/proposals/scheduler-evolution-proposal.mddocs/research/future-scheduler-architecture.mddocs/proposals/tickless-realtime-scheduling-proposal.mddocs/proposals/networking-proposal.mddocs/proposals/shell-proposal.mddocs/proposals/remote-session-capset-client-proposal.mddocs/proposals/llm-and-agent-proposal.mddocs/proposals/boot-to-shell-proposal.md
Hardware, Boot, And Storage
Outcome: capOS boots beyond the current ISO/QEMU manifest path, discovers real hardware, supports block devices, and exposes local persistent storage through typed capabilities.
Tracks:
- hybrid BIOS+UEFI raw disk image and
make run-disk - serial diagnostics console for cloud/hardware bring-up
- ACPI/MADT/MCFG discovery
- reusable interrupt and PCI/PCIe infrastructure
- virtio-blk and NVMe block-device paths
- boot binary ISO layout that moves ELF payloads out of the manifest blob
- RAM-backed
Store/Namespace - read-only local filesystem proof
- writable local storage with recovery policy
- installable system: boot from disk with persistent, mutable system configuration composed over the immutable boot manifest (own milestone, sequenced after the writable-local-storage milestone it builds on)
- staged cloud boot: first serial-console boot, then provider block/NIC drivers and network shell access
Details:
docs/backlog/hardware-boot-storage.mddocs/proposals/cloud-deployment-proposal.mddocs/proposals/storage-and-naming-proposal.mddocs/proposals/installable-system-proposal.mddocs/dma-isolation-design.md
User Identity, Sessions, And Policy
Outcome: shell, service, and future web sessions receive narrow capability bundles based on explicit identity, freshness, policy, and audit context.
Implemented base:
- anonymous/operator shell sessions
- password setup/login proof
- broker-issued shell bundles
- redacted auth/session audit records
Remaining themes:
- manifest-seeded local accounts, recovery identities, service identities, and initial role/resource profiles
- disk-backed local account store over capability-native storage
- default per-account, guest, anonymous, external, and service-account resource bundles
- explicit external identity bindings for OIDC/passkey/cloud/certificate principals
- durable verifier/passkey records
- WebAuthn and passkey-only setup path
- broader AuditLog completion
- ABAC context such as auth freshness, session age, source, and claims
- mandatory-policy labels and wrapper caps
- guest and anonymous workload demos
- POSIX profile adapter metadata
- OIDC/OAuth2 integration
Details:
docs/proposals/user-identity-and-policy-proposal.mddocs/backlog/local-users-management.mddocs/proposals/oidc-and-oauth2-proposal.mddocs/proposals/certificates-and-tls-proposal.mddocs/proposals/cryptography-and-key-management-proposal.mddocs/security/trust-boundaries.md
Security And Verification
Outcome: trust boundaries fail closed, proof gates stay practical, and trusted build inputs remain review-visible.
Implemented base:
- host tests for pure logic
- Loom ring model (a bounded concurrency model of the ring protocol, not the
shipped
kernel/src/cap/ring.rs) - Miri/proptest/bounded Kani model-checking paths
- dependency policy checks
- pinned Limine and Cap’n Proto tooling
- DMA isolation design gate
- panic-surface inventory
Remaining themes:
- Stage-6 trust-boundary refresh
- untrusted-service hardening and quota/exhaustion smokes
- Kani harness bounds refresh when new proof obligations are concrete
- DMA assurance model operationalization: turn the v0 TLA+/Alloy skeletons into
checked run targets (
make model-dma-tla/model-dma-alloy/kani-dma-authority+ aDeferredCompletionQueueLoom) reconciled with landed DMA code and wired to CI - Scheduler & IRQ assurance models: first formal coverage for the densest
unmodeled race surface – nohz activation/rollback (TLA+ + Loom), the LAPIC
one-shot timer fix (Kani + TLA+),
CpuIsolationLeaseauthority (Alloy + TLA+), and the MSI-X waiter determinism ordering (TLA+)
Details:
docs/backlog/security-verification.mdREVIEW.mddocs/tasks/README.mddocs/proposals/security-and-verification-proposal.mddocs/security/verification-workflow.mddocs/trusted-build-inputs.md
Shared-Service Demos
Outcome: multi-process demos prove resident services, shell-spawned clients, session-bound invocation context, shared harnesses, and eventually network-transparent federation.
Implemented:
- First Chat MVP
- Local MUD/adventure prototype
- NPC-as-process fleet
- shared service harness extraction
- session-bound chat/adventure state keyed by live caller-session metadata
Remaining themes:
- per-principal chat state and audit
- Aurelian Frontier game-depth work after the first deterministic mission slice
- native command-surface replacement for prototype
StdIO - federated chat after network transparency
Details:
docs/backlog/shared-service-demos.mddocs/backlog/aurelian-frontier.mddocs/demos/adventure.mddocs/proposals/aurelian-frontier-proposal.mddocs/proposals/interactive-command-surface-proposal.md
aarch64 Support
Outcome: port the architecture layer after x86_64 hardware abstraction stabilizes.
Shared code expected to carry over:
- capability model and schema
- ring structs and transport contracts
- userspace runtime model
- process/capability abstractions above
arch/
Architecture-specific work:
- EL0/EL1 syscall entry/exit
- GICv3 interrupts
- ARM generic timer
- PL011 UART
- TTBR0/TTBR1 MMU setup
- TPIDR_EL1 per-CPU data
kernel/linker-aarch64.ld
Future Tracks
These are not selected unless docs/tasks/state.toml or explicit user direction
pulls them into active selected-milestone scope. Add root task records and
backlog/proposal decomposition only when one of these tracks becomes the
selected visible outcome:
- regular Rust runtime support
- C
libcapos - Go
GOOS=capos - Python runtime adapters
- Lua scripting (Phase 0 capability-aware Lua-subset interpreter
shipped in
demos/lua-smoke/; PUC Lua dialect compatibility remains future, awaiting C/libcapos) - POSIX compatibility adapters
- WASI runtime
- C++ experiments
- GPU/CUDA capability integration
- system monitoring
- network transparency
- process persistence/checkpoint-restore
- live upgrade
- cloud metadata
- volume encryption
- formal MAC/MIC modeling
- browser/WASM support
- robotics realtime control
- trusted time and clock authority
- crash recovery and supervision
- debug and trace authority
Use proposal files under docs/proposals/ and research notes under
docs/research/ before promoting any future track into docs/tasks/README.md.
Lua scripting should arrive as an ordinary capability-scoped userspace runner,
not as kernel scripting or ambient shell authority.
seL4 HAMR (model-based high-assurance engineering)
Evaluated HAMR (High Assurance Modeling and Rapid engineering): AADL component
models, Slang/GUMBO contracts, and seL4/CAmkES backend generation, and how that
model-to-capability-system pipeline compares with capOS’s “the Cap’n Proto
schema is the contract” model, capability partitioning, and the schema-as-ABI
story. Findings: docs/research/sel4-hamr.md (reference talk:
https://youtu.be/gP1klZJi04U).
Crate publication
Publish capOS’s reusable no_std crates – capos-abi, capos-lib,
capos-config, and the capos/capos-rt runtime/facade – to crates.io with
stable versioning, rendered docs, and license/metadata, so the ELF parser,
capability table, ring/SQE wire validation, manifest/CUE loader, and typed
clients can be reused and cited independently of the kernel tree. The
publish-set decision is pinned in docs/backlog/capos-sdk-dual-transport.md:
publish capos-abi, the capos-capnp-build build helper, capos-config, and
capos-lib first; publish capos-rt and the bare capos facade with the
transport seam; ship the libcapos/libcapos-posix C substrate as release
artifacts only (not crates.io – their consumers link .a archives, decision
2026-06-02 16:10 UTC); the publish-set MSRV
is the stable Rust 1.88.0 proven by the slice-2 dry-run (the Rust 2024 floor
1.85.0 cannot build capos-config’s let chains); and keep generated
Cap’n Proto bindings inside capos-config rather than publishing a separate
bindings crate. The versioning policy (pre-1.0 SemVer, schema/ABI changes as
breaking bumps, lockstep across the set) and the repeatable
make sdk-publish-dry-run gate are recorded in
docs/backlog/capos-sdk-dual-transport.md.
This track now also covers the front-door capos SDK crate: one published
crate whose typed capability clients run unchanged against two transports – the
in-process capability ring (an application running inside capOS) and a remote
connection (a host-side RPC client) – behind a Transport seam. The bare
capos name is the facade; capos-rt provides the ring transport and the
remote feature provides the host transport. The seam and facade have landed:
capos-rt defines the Transport trait and the in-system RingTransport, the
typed clients are transport-generic, and the standalone capos facade crate
re-exports the runtime, clients, and entry_point! macro behind the default
ring feature (proved in-system by make run-spawn). The remote transport
backend remains ahead. Crates.io remains a flat, first-come namespace; the
exact crate names were verified free before the 2026-06-05 upload and are now
claimed by the capOS 0.1.0 release, while the adjacent capos-bitstruct
crate from an unrelated cap-os/rust-tools repository shows the namespace
contention risk. The near-term reservation work is closed: existing reusable
layers were published with real content, the bare capos facade was reserved
with transport-seam content, and the seam landed early. The repository-wide
license file required by the public-release boundary is recorded (LICENSE-APACHE /
LICENSE-MIT, MIT OR Apache-2.0 on the SDK crates). The first six-crate
0.1.0 publish completed on 2026-06-05 after the final crates.io name
re-check, the custom-target SDK gate, and the local Cargo API-token upload. The
capos-config docs.rs accommodation is implemented through the packaged
generated-binding fallback, and the GitHub Actions trusted-publishing workflow
is present for subsequent releases from refs/heads/main after a current
explicit user release instruction and crates.io trusted publishers are
configured for the six crates. Decomposition and publication ordering are in
docs/backlog/capos-sdk-dual-transport.md; the transitional host-backend
remote transport (slice 4a) can ship now, while the live-proxy capnp-rpc
upgrade (slice 4b) remains gated on the remote-session async-runtime rewrite.
Observable Milestones
Completed visible milestones:
- 2026-04-22 16:35 UTC, commit
d4016ab: Unprivileged Stranger - 2026-04-23 08:41 UTC, commit
f554e88: Native Cap Shell - 2026-04-23 13:39 UTC, commit
e5adafb: Boot to Shell - 2026-04-23 16:15 UTC, commit
7f19af2: Revocable Read - 2026-04-23 16:34 UTC, commit
8b66c13: split UART shell session - 2026-04-23 22:09 UTC, commit
d43b691: Verified Core - 2026-04-24 00:13 UTC, commit
2cd85a8: First Chat MVP - 2026-04-24 01:40 UTC, commit
add7f9b: Local MUD/adventure prototype - 2026-04-24 03:13 UTC, commit
da5f5e9: Ring as Black Box - 2026-04-24 15:37 UTC, commit
b56a5c1: First Packet - 2026-04-24 16:47 UTC, commit
a4f1722: First HTTP - 2026-04-25 05:36 UTC, commit
0b79054: SMP Phase A: per-CPU data on BSP - 2026-04-25 06:59 UTC, commit
d3c30c6: SMP Phase B: APs running - 2026-04-25 11:31 UTC, commit
d88bca7: First AP Scheduler - 2026-04-25 20:25 UTC, commit
2834bfc: Telnet Shell Demo - 2026-04-30 09:45 UTC, commit
3fb89923: Multi-Process SMP Concurrency - 2026-05-01 14:23 UTC, commit
fb102828: Remote Session CapSet Web UI Proof - 2026-05-11 14:38 UTC, branch commit
28db3277: Self-Served capOS Remote Session Web UI Proof. The now-retiredmake run-remote-session-self-served-web-uitarget booted the focused manifest, loaded browser assets from the capOSremote-session-web-uiservice over its scoped listener, denied no-cookie browser commands, called backend-heldSystemInfo, logged out, and then attempted the retained backend-heldSystemInfocapability to prove expired-session stale failure. The hostmake remote-session-uibridge remains a development tool. - 2026-05-13 11:05 UTC, branch commit
5f5028e7: WASI bounded environment grant smoke.make run-wasi-envboots the focused wasm-host manifest, reads the boundedinitConfig.init.wasiEnvtext grant, reflects it through Preview 1environ_get/environ_sizes_get, and the Rustwasm32-wasip1payload prints[wasi-env] CAPOS_WASI_ENV_SENTINEL=capos-wasi-env-sentinel. MissingwasiEnvremains the empty-environment behavior. - 2026-05-01 16:13 UTC, commit
5198e255: Remote Session Adventure Launch - Cloudboot run
1778230874-715a(2026-05-08 09:06 UTC), source commit3951e275(2026-05-08 08:50 UTC): GCP Imported-Image Serial Boot.make cloudboot-testbooted the GCE imported disk image to thecapos kernel startingserial landmark on a temporary no-public-IP, no-service-accounte2-smallinstance, captured serial output, and tore down the temporary cloud resources. This is a boot-path portability milestone, not provider NIC/storage driver readiness. - GCP-first usable-instance provider rollup, closed
2026-06-07 05:26 UTCby commitb5fdcc3eandcloud-usable-instance-provider-nic-storage: serial-console operator access run1779868872-2424(source commitc92c8bc1), live legacy virtio-net raw-frameprovider-nic-boundrun1780412056-e1cb(source commit1fb65683), live NVMe Persistent Disk brokeredREADrun1780806087-bf69(source commit28518165), and separate live gVNIC raw-frame / typed-Nic portability runs1780794927-1aa9(source commit3ef8997a) and1780796615-decc(source commit2a0857d). This closes the selected GCP provider NIC/storage bar while leaving public L4 ingress, SSH/WebShell productization, AWS/Azure providers, broader storage, high-throughput/multiqueue NIC, and direct-remapping DMA for future tracks. - Device Driver Foundation (DDF) bounded-authority proof series,
2026-05-08through2026-05-23: read-only hardware-audit snapshots (make run-hardware-audit*), boundedDMAPool/DMABufferresult caps with parent-first release and proof-slot reuse (make run-dmapool-grant),DeviceMmiobrokered read/write andInterruptwait/ack/mask/unmask grant proofs (make run-devicemmio-grant,make run-interrupt-grant,make run-hardware-grant-cycle), a device-manager-ownedDMAPoolbudget ledger, and the userspace provider-consumer TX/RX path (make run-ddf-provider-consumer): bounded selected-route descriptor/avail/ doorbell/used-ring/CQ handoffs, full selected TX queue-depth CQ ownership, bounded RX synthetic-token CQ identity, selected TX/RX MSI-X/LAPIC wait/ack/EOI, selected-route reset/reassignment, and teardown/stale-handle blocking. These are bounded-proof milestones, not live hardware RX used-ring ownership, full virtio-net ownership, direct DMA/IOMMU, cloud NIC/storage readiness, or production userspace driver readiness. The provider virtio-net closeout slice is commitc86374f8(2026-05-23 16:51 UTC); the executable decomposition and remaining gates live indocs/backlog/hardware-boot-storage.mdand the DDF task files underdocs/tasks/. Visible demo follow-ups: - Adventure/shared-service follow-ups after the Local MUD prototype:
73d83aa,da51dc7,353c8bc,e20cf07,948c96e, andca6300c. These refine discoverability, room context, expedition map, relic custody, explicit resume, and chat-only named actors; detailed reports live in commit history. - 2026-04-26 04:10 UTC, commit
5480304: Scoped Telnet Gateway Authority.telnet-gatewaynow uses manifest-forwarded scoped listener authority plusRestrictedShellLauncher; detailed verification history lives in commit history. - 2026-04-26 23:12 EEST, commit
4304b0e: Default run Telnet wiring. The default manifest startstelnet-gateway, andmake runattaches host-local127.0.0.1:2323 -> guest :23forwarding. - 2026-05-01 16:54 UTC, branch commit
367117be: Default run Telnet wiring retired. The default manifest no longer startstelnet-gateway, andmake runnow forwards only the remote-session CapSet endpoint. The plaintext Telnet research fixture was later retired with the qemu-only kernel TCP listener;make run-telnetnow exits before QEMU with a retirement diagnostic. - 2026-05-02 02:24 UTC, branch commit
84f5ac61: Remote Session Gate 3 auth-denial proof. Focused backend/account-store coverage rejects inactive accounts, unknown principals, and missing or retired resource profiles before remote-client bundle authority exists. The live CLI/QEMU proof now drives bad password proof, unknown account, wrong requested profile, and anonymous profile mismatch denials before any session, CapSet, or service-launch activity; denied re-login clears prior gateway/client/UI session state. - 2026-05-02 06:23 UTC, branch commit
482e5e07: Remote Session Adventure mutable control proof. The remote Adventure fixture and trusted web bridge now call boundedAdventure.go(direction)through the same session-bound worker/client path as status, look, and inventory, then verify movement text, changed room state, redacted transcripts, and visible-button UI automation without exposing raw capOS authority. - 2026-04-27 00:02 EEST, commit
7a155f4: Telnet IAC handoff fix and repeat-connect support. Telnet handoff no longer consumes raw socket input beforeintoTerminalSession, repeated host connections succeed, and the harness drives two consecutive sessions. - 2026-04-28 17:46 UTC, commit
d09243d: Aurelian Phase 9 competency gates. The adventure proof now has host-testable rank/star/circle policy, status output for rank marks and standing, signifer skill gates, first-mission spell gates, and QEMU assertions for rank denial plus debrief reward. - 2026-04-28 18:12 UTC, commit
47dbfc5: Aurelian Phase 10 market logistics. Adventure now has typed quote/buy/sell/trade/repair calls, bounded market roles, a deterministic Maro route purchase, and QEMU assertions for market quote, successful exchange, and clean-custody trade refusal. - 2026-04-28 19:36 UTC, commit
e204454: Aurelian Phase 11a calendar foundation. Generated content now carries fixed-smoke season/day/weather and hazard state plus bounded seasonal resources, Adventure status prints that state, and the real scenario process asserts it throughAdventure.status. - 2026-04-30 08:56 UTC, commit
4045576: Aurelian Phase 11a calendar event metadata. Generated content now carries a fixed-smoke active festival and later military event with pure Rust validation; Adventure status prints the active event metadata, and the real scenario process asserts it throughAdventure.status. Actor movement, shop mutation, witness blocking, route mutation, debrief branching, quests, gifts, and affection remain future work. - 2026-04-30 13:09 UTC, commit
64933131: Aurelian Phase 11a seasonal shop-stock purchase.adventure-contentowns the bounded active-stock, standing-gate, remaining-stock, and depletion decision for seasonal shop purchases. The quartermasterfield-rationsbuy path now spends audited Aurelian standing, records service-owned per-expedition seasonal stock usage, adds the ration to inventory, and the real scenario process asserts both the pre-debrief refusal and post-debrief purchase throughAdventure.buy. Broader seasonal economy mutation, persistence, seeded normal-play calendars, and automatic world advancement remain future work. - 2026-04-28 20:08 UTC, commit
48c62db: Aurelian Phase 11b regional foundation. Generated content now carries settlement, outpost, and route metadata with validation and stable ordering; Adventure status prints a regional summary, and the real scenario process asserts it throughAdventure.status. - 2026-04-30 12:07 UTC, commit
6afd87aa: Aurelian Phase 11b regional market transaction proof.adventure-contentowns bounded reserve, commit, cancel/release, stale-version rejection, idempotent replay from ordered receipt facts, and terminal-receipt-capacity checks for one generated order-book match at a time.adventure-serverkeeps transaction state inside each expeditionPlayerState, so fresh and resumed expeditions do not share market idempotency history. The real scenario process asserts regional quote/reserve/retry/commit/stale/release/cancel flows through existingAdventure.quote,Adventure.buy, andAdventure.sellcalls. - 2026-04-30 13:39 UTC, commit
6605ee6a: Aurelian Phase 11b regional market delivery proof. Fresh committedfield-rationreceipt facts now produce a bounded player-local supply delivery into expedition inventory, while commit replay and errors do not duplicate items. The real scenario process asserts delivery of the committed quantity and no replay duplication through existingAdventure.buyandAdventure.inventorycalls. NPC stores, outpost stock, currency, durable ledgers, profile balances, and crash recovery remain future work. - 2026-04-30 14:15 UTC, commit
b1c98eb1: Aurelian ordinary inventory capacity proof.adventure-contentnow owns a deterministic admission helper for bounded ordinary inventory, andadventure-serverroutes room takes, seasonal harvests, quartermaster field-ration purchases, and regional market delivery through one helper. Regional committed delivery fails closed when the full quantity cannot fit, avoids partial duplication, and remains replayable after items are dropped. - 2026-04-30 14:51 UTC, commit
f06aa732: Aurelian capacity replay proof. The capacity-denial path now uses authored/generated resources only, keeps transfer on the same ordinary inventory admission helper, exposes bounded repair-material collection at resource sites, and proves through the real scenario process that held regional delivery mutates no partial items and later delivers the full quantity afterbuy commit-field-ration from regional-marketis replayed. - 2026-04-30 15:14 UTC, commit
fd432147: Aurelian regional market currency debit proof. Fresh committed regionalfield-rationbuys now spend two player-local Aurelian chits exactly once, expose the balance in inventory, reject insufficient balances before transaction mutation, and keep held item delivery replay independent from debit replay. NPC stores, outpost stock, durable currency ledgers, profile balances, fees, expiry advancement, and crash recovery remain future work. - 2026-04-30 15:53 UTC, commit
7a9a4af5: Aurelian regional outpost stock proof. Fresh committed regionalfield-rationbuys now decrement sellerash_farmstock from six to two exactly once, expose that stock in status, reject insufficient seller stock before mutation, and keep committed replay plus held item delivery replay from decrementing again. NPC stores, broader outpost inventories, durable stock ledgers, profile balances, fees, expiry advancement, and crash recovery remain future work. - 2026-04-30 16:23 UTC, commit
00b18598: Aurelian regional market fee accrual proof. Fresh committed regionalfield-rationbuys now accrue the generated buy and sell order fees into a service-owned regional-market pool exactly once, expose that pool in status, ignore release/no-cross and non-ration facts, and keep committed replay plus held item delivery replay from accruing again. NPC stores, broader outpost inventories, durable stock and currency ledgers, profile balances, durable fee ledgers, expiry advancement, and crash recovery remain future work. - 2026-04-30 16:57 UTC, commit
bdcc23ed: Aurelian regional seller proceeds proof. Fresh committed regionalfield-rationbuys now credit the service-ownedash_farmproceeds pool two chits exactly once, expose that pool in status, ignore release/no-cross, stale, mismatched, and non-ration facts, and keep committed replay plus held item delivery replay from crediting proceeds again. NPC stores, broader outpost inventories, durable stock and currency ledgers, durable seller-proceeds ledgers, profile balances, durable fee ledgers, expiry advancement, and crash recovery remain future work. - 2026-04-30 17:41 UTC, commit
29c065a9: Aurelian regional market order expiry proof.adventure-contentnow has pure order activity and day-aware deterministic matching;adventure-serveruses the fixed smoke day for live regional-market reserve and quote, and the scenario process proves a day-73 expired field-ration reserve releases without status, inventory, currency, outpost stock, fee, seller-proceeds, or delivery mutation. Durable calendar advancement, durable order books, profile ledgers, durable fee ledgers, and crash recovery remain future work. - 2026-04-30 18:40 UTC, commit
205fd6a0: Aurelian regional market fee withdrawal proof.adventure-contentnow has a pure resolver for bounded regional-market fee withdrawal from the current pool plus applied withdrawal ids;adventure-serverowns the live fee pool, applied withdrawal ids, and service treasury balance; and the scenario process provessell withdraw-fees to regional-marketmoves the two accrued fee chits exactly once without mutating inventory, currency, outpost stock, seller proceeds, or delivery state. - 2026-04-30 19:43 UTC, commit
a547db3d: Aurelian regional market receipt snapshot proof.adventure-contentreconstructsRegionalMarketTransactionStatefrom ordered receipt facts with bounded validation, andadventure-serverexposesbuy receipt-snapshot from regional-marketto prove the old field-ration commit still replays after reconstruction without mutating live market, inventory, fee, treasury, seller-proceeds, stock, or delivery state. Durable restart loading remains future work. - 2026-04-30 20:07 UTC, commit
4b44b32: Aurelian regional market settlement snapshot-view proof.adventure-contentchecks the settlement side-effect snapshot view from applied delivery, currency debit, outpost stock decrement, fee accrual, fee withdrawal, and seller proceeds ids plus the current balances, rejects over-capacity id snapshots, and proves the already committed field-ration fact plus fee withdrawal replay as already applied.adventure-serverexposesbuy settlement-snapshot from regional-market, and the real scenario process proves the command leaves live status and inventory unchanged. Durable restart loading remains future work. - 2026-04-28 21:08 UTC, commit
0b7db05: Aurelian Phase 11c construction foundation. Generated content now carries material, facility, blueprint, artifact, and enchantment-slot metadata with pure Rust validation and deterministic property derivation; Adventure status prints a construction summary, and the real scenario process asserts it throughAdventure.status. Service-mediated construction jobs are tracked by the later Phase 11c construction-job proof; escrow, durable stock ledgers, output/currency inventory, and full artifact crafting gameplay remain future work. - 2026-04-30 13:01 UTC, commit
9f8cfb6c: Aurelian Phase 11c construction-job proof.adventure-contentowns bounded reserve/start, completion, cancel/release, stale-version rejection, idempotent replay, service-owned material hold/release facts, older terminal replay, and fact capacity checks on top of existing construction metadata.adventure-serverowns per-player construction material stock and applies holds/restores only for new successfulrepairoutcomes; completion consumes the held materials, while replay and denial paths do not mutate stock. The real scenario process asserts denial, reserve/retry, open-reserve conflict, complete/replay, stale rejection, release/replay, and reserve-after-release through existingAdventure.repaircalls. Durable persistence, broad stock ledgers, outpost replenishment, output/currency inventory, job-time advancement, and general crafting remain future work. - 2026-04-30 22:46 UTC, commit
fd57de6b: the Aurelian construction receipt snapshot follow-on is scoped to pure Rust construction receipt snapshot semantics plus a size-constrained QEMU no-mutation probe. Pureadventure-contenttests reconstruct a separate construction job state from ordered facts and reject malformed, over-capacity, and non-closed snapshot shapes. The QEMU scenario drivesrepair receipt-snapshot with field-engineeronly to confirm status, inventory, live construction state, and material stock are not mutated. The runtime command is not a proof that receipts replay into the live service, and this is not durable restart loading or a general construction persistence layer. - 2026-04-28 21:36 UTC, commit
f53d044: Aurelian Phase 11d agent NPC budget foundation. Generated content now carries disabled-by-default optional NPC agent budget metadata with model profiles, per-session/day input/output token limits, tool-call limits, cooldown, fatigue, sleep, refusal, and audit visibility. Pure Rust fake-model tests cover spending, refusals, disabled transcript stability, bounded output, and no authority mutation from model text; Adventure status prints an aggregate budget line asserted throughAdventure.status. Live LLM integration, hosted-agent execution, durable memory, autonomous NPC actions, and authority mutation from model output remain future work. - 2026-04-30 08:22 UTC, commit
c6d887: Aurelian Phase 11d fake-agent purpose expansion. Deterministic fake-agent responses now cover personal routines, nonbinding shop negotiation flavor, and festival reactions as dialogue/proposed-action data only. Pure Rust tests cover quota spending, quota refusal, bounded lines, and no authority mutation; Adventure status prints the supported purpose count and the real scenario process asserts it throughAdventure.status. - 2026-04-28 22:22 UTC, commit
335a9ee: Aurelian Phase 12 party foundation. Adventure now has typed local party create/invite/accept/leave/delegate calls andassist, keyed by service-created local player labels derived from live caller-session keys. The server uses the unit-testedadventure-contentparty transition state for invite, accept, scoped delegation, assist, and leave cleanup; the scenario process asserts the one-client cap surface and party status line. Two-client QEMU proof, transfer escrow, duel/spar/contest authority, and cross-device multiplayer remain future work. - 2026-04-29 06:43 UTC, commit
ac49375: Aurelian Phase 12 physical-item transfer foundation. Adventure adds typedtransferfor same-party service-local player labels, with ordinary inventory mutation kept atomic inside the existing service and backed by pure Rust transfer tests. The scenario process asserts one-client refusal paths without faking a second live session. Currency escrow, broad market/trade coordination, and successful two-client QEMU transfer proof remain future work. - 2026-04-29 18:07 UTC, commit
f4a7fdb: Aurelian authority-combat verb foundation. Adventure adds the boundedchallenge-authorityskill andchallenge authority <target>text alias for the ward-wraith proof slice: acceptedward-writattacks hostile ward authority instead of hp, records success-only evidence/effects, and QEMU coverage exercises wrong-target, missing-authority, success, and shell-alias paths. Broader authority-combat verbs, hostile authority enemy variants, writ affixes, and rank/base reach unlocks remain future work. - Merged on main at commit
6678d40(2026-04-30 03:55 UTC): Paperclips Terminal Demo follow-up. The default manifest advertises the clean-roompaperclipsterminal game, andsystem-paperclips.cueplusmake run-paperclipsprovide the focused QEMU proof for one-at-a-time manual production, representative refusal output, explicit sales, repeatable marketing, autoclipper unlock, real-time automation, generated Cap’n Proto content loading, scaled business-phase production,precision-rollers,design-search,forecast-engine,survey-drones, and the visible== autonomous phase ==transition. The demo remains outside the current SMP process scaling milestone because it exercises a standaloneStdIOplusTimerterminal process rather than SMP process-count or scheduler behavior. - Task branch commit
88536a9e(2026-04-30 17:38 UTC): Paperclips client/server showcase first slice. The focused manifest now boots Paperclips server services plus a terminal client; the server owns generated content, game state, regular timer cadence, unlock checks, game-rule mutation, and proof-command gating, while the client receives explicitStdIOplus aPaperclipsGameendpoint. - Task branch commit
532207c1(2026-04-30 20:54 UTC): Paperclips structured command-list slice. The server exposes current command specs for terminalhelpwithout changing the raw text command execution path. Normal and proof sessions use separate server endpoints, preserving proof-onlyrun <ms>andstatus --jsonauthority. - Task branch commit
e9ae4e97(2026-04-30 22:02 UTC): Paperclips structured plain-status snapshot slice. The server exposesPaperclipsStatusSnapshotfields for terminal-rendered plainstatus, whilestatus --jsonremains proof-only and server-gated. - Task branch commit
32462e9f(2026-04-30 22:32 UTC): Paperclips structured project-list slice. The server exposes unlocked project entries for terminal-rendered plainprojects, whileproject <id>remains raw text execution against server-owned mutable state. Remaining Paperclips showcase work includes broader structured state/events, command facets, capability transfer/revocation ergonomics, and the later web-shell client path. - Commit
5ef16c3(2026-04-30 04:17 UTC): Paperclips autonomous scaling follow-up. The CUE-authored generated content now owns millisecond drone matter-conversion, factory production, probe harvest, and probe replication caps; host tests cover the bounded transitions and completion gating. The focused QEMU proof continues after== autonomous phase ==throughmaterial-harvestersandfoundry-lines, then asserts lower local matter, increased autonomous production, and clean process exit. - Commit
65f9d2c(2026-04-30 07:36 UTC): Paperclips cosmic/completion transcript follow-up. The focused QEMU proof now continues throughmesh-coordination,seed-probes,== cosmic phase ==, a bounded probe interval with visible replication, cosmic-matter conversion, and clip production, thenfinal-conversionand== complete phase ==. That proof used compact clean-room values for the cosmic matter grant and terminal conversion clip cost so the run remained representative rather than an exhaustive full playthrough. - Commit
52d30d2b(2026-04-30 12:00 UTC): Paperclips completion rebalance. The late-game matter and final conversion costs now prevent normal play from reaching== complete phase ==within one real-time hour. The focused QEMU proof stops at the cosmic production milestone withfinal-conversionstill locked instead of scripting a compact full win. - Commit
9262938b(2026-04-30 12:26 UTC): Paperclips machine-readable status follow-up. The terminal demo now supportsstatus --jsonas a stable compact state snapshot, and the focused QEMU proof asserts that late-game JSON line after the cosmic milestone while preserving the human transcript checks. - Commit
119acaad(2026-04-30 12:53 UTC): Paperclips review-fix follow-up. Active schema, CUE content, Rust rules, generated-content guardrails, and focused smoke assertions now use clean-room Strategy internals. Purchase parsing keeps omitted counts as one but rejects explicit zero counts without mutating game state.
Recently completed visible milestone:
- Device Driver Foundation: the selected milestone is complete by the
production-authority closeout task
ddf-production-authority-closeoutat commitef8d98c2(2026-06-07 08:15 UTC; task completion recorded2026-06-07 08:23 UTC). The DDF closeout records the landedDeviceMmio/DMAPool/Interruptlifecycle status, the provider-driver local authority evidence, hardware-audit consumption for abort-held DMA mapping records, and the runtime fail-closed DMA backend baseline. The related GCP-first usable-instance rollupcloud-usable-instance-provider-nic-storage(2026-06-07 05:26 UTC) records live operator serial access, selected raw-frame NIC/storage evidence, and gVNIC portability, without claiming public L4 ingress, AWS/Azure support, direct-remapping production hardware, device-autonomous MSI-X delivery, full userspace smoltcp/L4 readiness, or high-throughput/multiqueue NIC readiness. - POSIX Adapter v0 – File/Directory fd closeout: commit
f97d9833(2026-05-23 06:23 UTC) closes the P1.4 file/directory fd surface over the existing RAM-backed rootDirectorycap.libcapos-posixnow exposes functionalopen,read,write,close,lseek,opendir,readdir, andclosedirfor the v0 Directory-backed path, withreaddirbacked by a lazyDirectory.listsnapshot andlseekbacked by the fd-table file position plusFile.statforSEEK_END.make run-posix-fileboots a C process that creates"/hostname", writes and seeks through it, reads the full payload and tail, lists the root directory to find the file, proves relative paths still fail closed, exits 0, and halts QEMU. - POSIX Adapter v0 – Identity stubs: commit
1a8a9896(2026-05-23 06:51 UTC) closes the P1.4 identity-stub surface.libcapos-posixnow exposesgetpid,getuid, andgetgidfrom the existing unistd-style header;getpidreturns the stable capos-rt bootstrap pid for the current process, whilegetuidandgetgidreturn the single-identity uid/gid0.make run-posix-identityboots a C process that prints its identity, fork/execs the same binary through the recording shim, proves the child observes a distinct pid, exits both processes cleanly, and halts QEMU. The latermake run-posix-printfproof closes the printf/string subset with live formatted output, string/mem, numeric conversion, and ctype markers. Commit90e64011(2026-05-23 08:11 UTC) closes the signal/time surface:make run-posix-signal-timeproves Timer-backed time/sleep observations plus fail-closedkill/raisesignal-delivery stubs. Remaining dash-port gates are dash vendoring/patching, the multi-translation-unit C build, andrun-posix-shell-smoke. - POSIX Adapter v0 – Pipe + fork-for-exec plus direct posix_spawn Smoke: POSIX adapter
Phase P1.3 first closed at commit
ceaf5475(2026-05-07 10:04 UTC) under an in-process x86_64 setjmp/longjmp recording-shim contract. A subsequent fix slice on top – spanning commits44838ad7(2026-05-07 11:07 UTC) through7c08501c(2026-05-07 14:24 UTC) and integrated into mainline-tracking history via merge commitb8c7fb43(2026-05-07 18:16 UTC) – replaced setjmp/longjmp with the return-the-pid contract because the longjmp re-entered fork()’s already-deallocated stack frame (undefined behaviour). An iter-15..iter-22 SMP-correctness hardening cycle followed, extending the fix slice through commit05b52873(2026-05-07 21:07 UTC); each iteration closed a distinct kernel pipe race surface (transport-error CQE on saturated waiter restore at iter-15, deferred-error retry queue + nested-fork reset at iter-16, write-overflow queue preserving partial-write CQE at iter-17, buffer-aware EOF + combined-cap waiters + child-order fd replay + EBADF on Moved at iter-18, close+write race + fd-recording precheck + Moved self-dup2 at iter-19, same-end waiter completion on close at iter-20, close_side publishing under the buffer lock at iter-21, and the matching in-lock close re-check in handle_write at iter-22).make run-posix-pipe-smokeboots the focused manifest, links thedemos/posix-pipe-shim/main.cparent anddemos/posix-pipe-child/main.cchild againstlibcapos.a+libcapos_posix.a, drivespipe(); pid_t child = fork(); if (child == 0) { dup2(); close(); child = execve(...); } close(); read(); waitpid(child);end to end through the kernelPipecapability and the recording-shim ProcessSpawner Move-grant path, and prints[posix-pipe] read 14 bytes: hello via pipefrom the parent. The parent and child both exit 0 cleanly and the QEMU scheduler halts. fork() returns 0 unconditionally; dup2/close between fork and execve record into a TLS window without mutating the parent fd table; execve() drains the recording and returns the synthetic child pid as its own return value (a deliberate v0 deviation from POSIX). The direct publicposix_spawn()successor proof landed at commitb8fb3131(2026-05-13 10:15 UTC):libcapos-posixexposesposix_spawn()plusposix_spawn_file_actions_init/destroy/adddup2/addclose, andmake run-posix-spawn-smokecreates a pipe, uses file actions to move the existingposix-pipe-childstdout onto the pipe, reads[posix-spawn] read 14 bytes: hello via pipe, waitpid()s the child, and halts after both processes exit 0.argvandenvpare accepted for source compatibility but remain undelivered until LaunchParameters / environment support lands. The Console-backed stdio successor proof landed at commitaa6a56d7(2026-05-13 11:03 UTC):libcapos-posixmaps POSIX fd 1/2 to the granted Console cap when nostdio_<N>Pipe grant already occupies the slot, keeps fd 0 closed without stdin backing, andmake run-posix-stdio-smokeprints distinct stdout/stderr markers through POSIXwritebefore proving the no-stdin refusal path. - WASI Host Adapter Phase W.4 –
random_getproduction wiring: Phase W.4 closed at commitb0f6939f(2026-05-07 20:09 UTC); Phase W.3 closed at commitca41ecc1(2026-05-07 18:29 UTC; the W.3 narrative stamps from2026-05-07 18:25 UTCpredate the feat commit by a few minutes); Phase W.2 closed at commit7bfcb1d8(2026-05-07 10:53 UTC) across four sub-slices. The bounded environment grant smoke landed at branch commit5f5028e7(2026-05-13 11:05 UTC). Sandboxedwasm32-wasiis now a booted language path on capOS; the W.2 slice delivered the first WASI-hosted, sandboxed portable-payload path (native C boots already existed via the libcapos C-substratemake run-c-helloand the historical POSIX-adapter DNS resolver); W.3 added the per-instance argv text grant; W.4 wires Preview 1random_getthrough the kernelEntropySourcecap; the 2026-05-13 follow-up adds the boundedinitConfig.init.wasiEnvtext grant as the v0 environment source.make run-wasi-hello-rust,make run-wasi-hello-c,make run-wasi-cli-args,make run-wasi-env,make run-wasi-random(granted), andmake run-wasi-random-ungranted(refusal) are the regression, environment-grant, and W.4 gates; the environment smoke proves one granted value reaches a Rustwasm32-wasip1payload through Preview 1environ_get/environ_sizes_get; the random granted variant reads N=64 bytes throughrandom_getand prints[wasi-random] entropy_bytes=64 entropy_bound_ok=true, and the ungranted variant observesERRNO_NOSYS = 52from the closed-fail refusal branch which never enters the kernel. Wall-clock support stays deferred:clock_time_get(CLOCKID_REALTIME)keeps the W.2 sentinelERRNO_NOSYSuntil capOS has a typedWallClock/RealTimeClockcap. The next selectable WASI work is Phase W.5 (Preview 1 filesystem), blocked on the missingNamespace/File/Storecap surface. - POSIX Adapter v0 – DNS Resolver Smoke: POSIX adapter Phase P1.2
Phase B completed at commit
b4f1a400(2026-05-05 21:21 UTC). The now-retiredmake run-posix-dns-smokebooted the focused manifest, linked thedemos/posix-dns-resolver/main.cC binary againstlibcapos.a+ the newlibcapos_posix.a, sent a DNS A query forexample.comthrough the kernelUdpSocketcapability to QEMU slirp’s resolver at 10.0.2.3:53, decoded the answer-section IN/A record, and printed[posix-dns-resolver] resolved example.com -> <ipv4>(e.g.104.20.23.154; the upstream resolver picks the value, the harness grepped loosely). The target now exits before QEMU because the qemu-only kernelUdpSocketowner was removed; rebuild the resolver on the Phase C userspace network stack before using it as validation. Thevendor/dns-c-wahern/snapshot atrel-20160808is in-tree as a structural reference but not yet compiled into the smoke; widening the POSIX surface so dns.c can build whole is follow-on work after P1.3. - In-Process Threading Scalability: completed at commit
136b72de(2026-05-01 14:58 UTC) after the benchmark repair replaced the invalid 1 MiB/spinning-parent four-worker shape with a blocking-parent 16 MiB/64-round shape. Reaffirmed against the then-current single-global-queue scheduler oncapos-bench2026-05-02 21:38 UTC againstmaincommit374f8556with the formal capOS+Linux 5-run pair pinned to physical-core logical CPUs0,1,2,3: capOS work1.883xand total1.787xclear the configured 1-to-2 gates; matching Linux pthread baseline1.988x/1.987xvalidates the shape. The 1-to-4 row became the diagnostic that justified Phase D’s fair-share enqueue policy (capOS1.566x/1.538xvs Linux3.963x/3.858x); Phase D WFQ later manually accepted the recorded 1-to-4 diagnostic with capOS3.088x/2.700xand matching Linux3.974x/3.850x. Four-worker capOS speedup remains evidence of material improvement, not a completed linear-scaling claim. - Multi-Process SMP Concurrency: completed at commit
3fb89923(2026-04-30 09:45 UTC), with repeated KVM-backed process-scale evidence intarget/smp-process-scale/cycle-balanced-default/(1.608x1-to-2 speedup) and ordinaryrun-smoke/run-spawncoverage under-smp 2. - Session-Bound Invocation Context: completed at commit
503abc9(2026-04-30 02:26 UTC), with Gate 4 implementation verification recorded at commitfaeff80(2026-04-29 21:39 UTC). The milestone includes one immutable process session, privacy-preserving endpoint caller metadata, explicit disclosure gating, session-aware transfer scopes, chat migration, terminal/stdio bridge liveness guards, adventure shared-service cleanup, and aligned paper evidence/status text. - Installable System: completed through commit
12b8334a(commit timestamp2026-06-07 18:19 UTC; task closeout2026-06-07 18:20 UTC) for the bounded local/QEMU contract. The milestone includes persistent data-region mount, config-overlay compose/merge fallback, generation/rollback machinery, integrated installable disk packaging, target-disk install, first-boot provision, update/rollback, and structural proposal/body wording reconcile. It preserves the RAM-onlyNamespacecaveat and does not claim secure boot/signing, production release authority, public ingress, AWS/Azure live support, direct-remapping production hardware, full userspace smoltcp/L4 readiness, or full durable account policy.
Active visible milestone:
- GCE Self-Hosted Web UI: serve the remote-session Web UI through the Phase C
userspace network stack, prove the local cloudboot L4 path, and then prove
private GCE reachability before any public endpoint. The selected milestone
now has the userspace smoltcp-backed
TcpListenAuthoritylocal path proved bycloud-prod-userspace-network-stack-smoltcp-local-proofand local DHCP/IPv4 address/default-route/ARP configuration proved bycloud-prod-network-stack-dhcp-ipv4-config-local-proof; the cloudboot authority inventory (remote-session-webui-cloudboot-authority-inventory) is done and records the Web UI service authority boundary for the local L4 proof. The local Web UI L4 proof (cloud-prod-remote-session-web-ui-l4-local-proof) is done: the Phase C userspace network-stack process servesremote-session-web-uion guest port 8080 with the full fixed-name bundle, login, a backend-heldSystemInfocall, logout/stale failure, and the manual viewer undermake run-cloud-prod-remote-session-web-ui-l4. Web UI session hardening (remote-session-web-ui-session-hardening) is done (2026-06-09), and Web UI connection bounds (remote-session-web-ui-connection-bounds) are done (2026-06-09): per-connection request-read/response-send deadlines in the Web UI client with a drip-feed abandon proof on the L4 gate. The narrow legacy kernel socket-path retirement is done; non-qemumanifests now reject kernelnetwork_manager/tcp_listen_authoritygrants and leave those sources as qemu-only fixtures. The broadercloud-prod-phase-c-kernel-smoltcp-virtio-net-removalcleanup is also done: the kernel no longer depends onsmoltcp, qemu-only kernel TCP/UDP socket entry points fail closed, and the remaining virtio-net code is lower-layer QEMU fixture evidence rather than production cloud socket ownership. The localcloud-prod-remote-session-web-ui-l4-local-proofgate consumed the done DHCP/IPv4 task and landed. Legacy GCE virtio-net Web UI serving is done locally (cloud-gce-legacy-virtio-webui-serving-local-proof, 2026-06-11), the public-ingress browser hardening set (public-origin policy, SameSite policy, JSON content-type guard, headers/CSP, forwarded-scheme trust,/healthz, in-guest login hardening) is done on the L4 gate, and the no-spend provider-harness gates (private preflight, private/public evidence validators, ingress plan, teardown engine, provider-command allowlist) are done as stub-fixture evidence.cloud-gce-private-self-hosted-webui-proofremains on hold on missing firewall IAM and per-run billable authorization. Public GCE ingress and TLS remain under the separate on-holdcloud-gce-public-self-hosted-webui-ingress-tlstask and require explicit authorization; the local fixture gates bound that future run but do not authorize exposure.
Paused visible milestone:
- SSH Shell Gateway:
sshreaches the capOS login/native shell flow through an SSH-backedTerminalSessionin QEMU, using host-local forwarding, public-key authentication, denied unsupported SSH features, and the same child shell capability boundary proven by Telnet. This remains planned Stage 7 work, but network-backed shell delegation should wait for durable remote-account/key prerequisites.
Candidate next visible milestones:
- Storage Capability Substrate: add RAM-backed
Store/Namespacefirst, thenBlockDevice, local disk, and a read-only filesystem proof if the block path is ready. - Serial Diagnostics And AWS Serial Boot: extend the current bounded COM1 diagnostics console with richer device dumps and prove the same imported image path on AWS. GCP imported-image serial boot is already recorded.
- Remote Shell Access: SSH, Telnet development access, and basic WebShell over the capability terminal model after session lifecycle, durable key/account, and network prerequisites are credible.
- Cloud follow-ups after the GCP-first provider rollup: public L4 ingress and
SSH/WebShell productization, AWS/Azure provider ports, broader storage
variants, high-throughput/multiqueue NIC readiness, and separate cloud
benchmark reruns. The completed GCP rollup record is
cloud-usable-instance-provider-nic-storage. - Agent Shell and federated chat remain future candidates, not the default next milestones ahead of the driver/storage/cloud bring-up ladder.
Select the next milestone in docs/tasks/state.toml only after the current
selected milestone is achieved and recorded, or when the user explicitly changes
the selected milestone. Update or add task records and linked backlog/proposal
decomposition in the same change when the new milestone needs different
execution context.
Backlog
Detailed task decompositions for work that is not useful in mandatory agent context.
Start from docs/tasks/state.toml for the selected milestone, then use
root-level task records under docs/tasks/ to choose dispatchable work and the
source links in those records to reach the relevant long-form decomposition.
- Scheduler Evolution
- Research And Design Gaps
- Session-Bound Invocation Context
- Stage 6 Capability Semantics
- Runtime, Networking, And Shell
- Network Usability And Post-smoltcp
- POSIX Adapter Dash Port
- Remote Session CapSet Client
- capOS SDK And Dual Transport
- Capability-Infrastructure Cluster
- Go VirtualMemory Contract
- Memory Authority Model
- Hardware, Boot, And Storage
- Certificates / TLS
- Security And Verification
- Local Users, Storage, And Policy
- Shared-Service Demos
- Paperclips Terminal Demo
- Aurelian Frontier
- Run Targets, Init Mandate, And Default-Run Integration
- Full-Scope Review 2026-06-09
Archived
Retained for historical context only – do not select work from these.
- SMP Phase C (archived; milestones complete, residual full-SMP in Scheduler Evolution Phase F.5)
- Service Object Identity Migration (archived; superseded by Session-Bound Invocation Context)
Runtime, Networking, And Shell Backlog
Detailed decompositions for runtime, networking, shell, agent, and web shell
work. docs/tasks/README.md links here but should not inline these subtasks.
Scheduler/Park Measurement
Pre-thread dispatch instrumentation and compact-vs-generic ParkBench comparison are historical context. In-process threading later closed the first blocked/resume measurement path with QEMU samples for private ParkSpace wait/wake. Future measurement work should be tied to a concrete runtime or SMP change, especially per-thread/per-CPU ring behavior.
In-Process Threading Implementation
Current implementation subgates recorded in the old workplan were all marked
complete, but the parent task still appeared unchecked. Before starting
follow-up work, reconcile this status against code, docs/roadmap.md, and
docs/changelog.md.
Completed subgates retained for context:
- Add
Threadstate with per-thread kernel stack, registers, and FS base. - Change scheduling from process-level to thread-level while preserving process-owned address spaces and cap tables.
- Add
ThreadSpawner/ThreadHandleand basic join/exit smoke. - Implement the first park authority capability and contended-path measurements.
Runtime Ring Reactor Bridge
The current kernel ABI still exposes one process-owned capability ring. A multithreaded runtime therefore needs a compatibility bridge until per-thread kernel rings land.
Ordered gates:
- Add one runtime-owned process-ring CQ drainer.
- Map
user_datacompletions back to ParkSpace-backed per-thread wait records. - Prove sibling threads can issue ordinary calls and receive out-of-order completions without both draining the process CQ.
- Retire the bridge when per-thread capability rings and completion routing
by generation-checked
ThreadRefbecome the kernel ABI.
Telnet Shell Demo
Historical, fully retired track (2026-06-10). The visible outcome below was
delivered and later retired together with the qemu-only kernel TCP listener
and socket owner: make run-telnet and its sibling kernel-socket smokes now
exit with retirement diagnostics, the telnet-gateway /
ssh-gateway-terminal-host / network-client demos and their manifests are
removed, the kernel SocketTerminalSession is deleted, and
TcpSocket.intoTerminalSession fails closed in every dispatch path. Remote
shell access belongs to the in-guest login surface (web UI over the Phase C
userspace network stack) and the future SSH Shell Gateway
(docs/proposals/ssh-shell-proposal.md); a network-backed TerminalSession
must be re-built as a userspace terminal-session service over the userspace
TCP stack if a byte-stream terminal transport is needed again.
Original visible outcome: make run-telnet boots capOS in QEMU with
hostfwd=tcp:127.0.0.1:2323-:23, a telnet-gateway boot service listens on
guest port 23 through the kernel TCP capability surface, and a scripted host
smoke runs telnet 127.0.0.1 2323, logs in through the existing credential
flow, issues one shell command, and sees a clean disconnect.
Ordered gates:
- Add the Phase B TCP interfaces to the canonical shared schema:
NetworkManager,TcpListener, andTcpSocket. Keep this milestone TCP-only;UdpSocket,DeviceMmio,DMAPool, andInterruptare decomposed-NIC / userspace-driver scope. - Replace the synthetic 10 ms smoltcp clock with scheduler-driven polling
on real
TICK_COUNT; the HTTP proof now persists as a retained smoltcp runtime polled from scheduler ticks. Depends onTimer. - Close the delegated endpoint relabeling gap before exposing shell launch
over Telnet. A remote shell user must not be able to type an arbitrary
endpoint identity such as
badge 200and spawn a child that acts as a different chat/adventure participant. Omitted shell syntax now preserves the delegated source identity, and the low-level spawn hardening proof keeps the legacy badge-zero encoding covered. The containment gates indocs/backlog/stage-6-capability-semantics.mdare complete; do not expose Telnet shell launch to any future badge-selection regression. Normal shell help and smoke-help expectations no longer advertise badge syntax. - Implement
NetworkManager,TcpListener, andTcpSocketas kernelCapObjects wrapping the existing smoltcp smoke path. Reuse ring dispatch; do not add syscalls.acceptandrecvmay be blocking calls for this milestone, with bounded result buffers and explicit close behavior. Initial implementation landed in commit7446e04at2026-04-25 14:48 UTC; follow-up review fixes removed timer-path allocation from deferred completion, hardened result-cap cleanup, and addedmake qemu-network-client-harnesscoverage for userspaceNetworkManagerClient,TcpListenerClient.accept, andTcpSocketClientsend/recv/close. - Complete the next endpoint-identity containment transition before unrelated Telnet gateway work: Gate 1 representation plus the minimum trusted mint path landed as the historical service-object routing proof. The selected follow-on is now Session-Bound Invocation Context: keep production remote shell launch blocked until one-session-per-process, privacy-preserving endpoint caller-session metadata, and shared-service migration settle.
- Add the socket-backed terminal handoff needed by the demo.
capos-shellmust still receive a cap namedterminalwithTerminalSessioninterface id, backed by the accepted TCP socket. Do not pass rawTcpSocket,ByteStream, orStdIOas a replacement for the login terminal boundary. Satisfy this either by adding typed service-export / grant support so a userspacetelnet-gatewayendpoint can be presented as aTerminalSession, or by implementing a real kernel socket-backedTerminalSessionCapObject. Implemented asTcpSocket.intoTerminalSession, which consumed a connected socket cap and returned a move-onlyTerminalSessionresult cap, backed by the kernelSocketTerminalSessioncooked-mode line-discipline shim. Retired 2026-06-10: the kernel socket owner behind it was removed by the Phase C userspace network-stack migration, so the shim and its handoff were deleted;TcpSocket.intoTerminalSessionnow fails closed with a retirement error in every dispatch path, and the consumer smokes (qemu-network-client-harness,run-telnet,run-ssh-gateway-terminal-host) exit with retirement diagnostics. - Add a
telnet-gatewaydemo binary andsystem-telnet.cuemanifest. The trusted demo gateway gets bootstrapNetworkManagerandProcessSpawnerauthority, plus pass-throughcreds,sessions,audit, andbrokercaps needed to spawncapos-shellwith the same login/session semantics as the UART shell. The spawned shell must not receive raw network or broad process-spawn authority. - Add
make run-telnetand a scriptedqemu-telnet-harnesshost smoke that drives the full login/command/exit sequence and requires a proof line. - Document in
docs/proposals/networking-proposal.mdanddocs/proposals/shell-proposal.mdthat telnet is demo-only plaintext, binds only to host loopback in the QEMU harness, preserves theTerminalSessionboundary, and will be replaced by the SSH gateway once host-key, user-key, account, audit, and persistence prerequisites land. Implemented by branch commit5d11b12at2026-04-25 20:06 UTC.make qemu-telnet-harnessproves127.0.0.1:2323 -> guest :23, password login,caps, thesessioncommand, and clean exit with no password, rawNetworkManager, rawProcessSpawner, raw TCP, or unknown-cap leakage in the host transcript. Replacing the gateway’s factory network/spawn authority with scoped listener and shell-launch caps is tracked in task records; it is not required for the host-local visible demo.
Telnet Over TLS Optional Track
Telnet over TLS is not a default main access interface. Keep it as an optional future transport for service terminals or certificate-heavy deployments after the certificate/TLS, durable identity, session lifecycle, audit, and scoped listener-authority prerequisites exist. SSH remains the main production operator CLI track, and WebShellGateway remains the main browser/agent access track.
Ordered gates before this can be considered production-shaped:
- Certificate/TLS server configuration, private-key custody, trust-store, and rotation primitives exist outside the kernel TCB.
- Client identity maps through durable account/session policy, preferably mTLS client certificates with password fallback only by explicit policy.
- Session lifecycle close propagation exists for terminal disconnect, process-tree exit, explicit logout, and administrator revocation.
- The gateway receives only scoped listener, TLS config, terminal-factory, session, broker, audit, and restricted-launch grants; no raw broad network or process-spawner authority.
- QEMU and host-network harnesses prove TLS handshake, failed client-auth behavior, terminal login, disconnect cleanup, and transcript redaction.
Remote Session CapSet Clients
Programmatic and GUI remote clients are a sibling track to terminal shells. A
regular host app – CLI, native GUI, Tauri backend, webapp gateway, desktop
tool, service client, or agent runner – should authenticate through the capOS
session/admission path, obtain a broker-issued remote CapSet view in its
trusted backend, and call provided capabilities over Cap’n Proto RPC. It should
not be forced to spawn capos-shell, and it should not be reduced to a
special-purpose chat proxy.
The first implementation slice exposes this path as a host-local development
endpoint. Default make run starts remote-session-capset-gateway and forwards
guest port 2327 to a loopback host port, preferring 127.0.0.1:2327 while
falling back to a free port when another QEMU run is already using it. The
focused make run-remote-session-capset-interop harness runs the Linux Rust
client, authenticates through SessionManager, lists a broker-shaped remote
CapSet, calls session/system-info DTO operations, and proves denial/stale
paths. This slice uses schema-framed Cap’n Proto DTOs; standard capnp-rpc
proxy transport and endpoint-backed service calls remain the next gates.
Detailed decomposition lives in Remote Session CapSet Client. Keep this track coordinated with SSH/WebShell work:
- SSH remains the production operator CLI terminal transport.
- WebShellGateway remains the browser/agent terminal and tool-proxy surface.
- Remote session CapSet clients are the programmatic and UI API surface for Linux host tools, desktop/Tauri apps, webapp gateways, service clients, and server-side agent runners.
- Optional UI-composition caps let capOS-side services and agents propose bounded panes, command palettes, visualizations, layout hints, and theme tokens through host-validated surfaces instead of treating “remote GUI” as only a window or terminal frame.
- All three paths consume the same
SessionManagerandAuthorityBrokermodel and must support non-password admission methods where policy enables them. - Browser JavaScript and model providers must not receive raw capOS caps; gateway-side workers hold the session CapSet and expose only terminal frames, command metadata, or bounded tool requests.
SSH Shell Gateway
Visible outcome: make run-ssh-shell boots capOS in QEMU with a host-local
forward to guest SSH, an ssh-gateway service authenticates a normal OpenSSH
client with a configured public key, launches capos-shell with an
SSH-backed TerminalSession, runs one shell command, and disconnects cleanly.
The shell must see the same terminal/session/broker boundary as the Telnet
demo, not raw TCP or SSH protocol authority.
Blocked by: Telnet Shell Demo for socket-backed TerminalSession,
cryptography/key-management for sign-only host keys, local account/key records
for authorized SSH keys, audit records for remote authentication decisions,
and persistent storage before production host or authorized keys are treated
as durable.
Closeout prerequisite: before this milestone closes, reconcile its target
name and host-harness placement with the run-target/init-mandate policy in
docs/backlog/run-targets-and-init-policy.md (Gate A naming split, Gate B
init mandate, Gate C test split, Gate D default-make run integration).
The current make run-ssh-shell working name and any scripted host harness
may need to become test-ssh-shell and be relocated, and default-run
exposure has to be addressed there, not as another run-ssh-* recipe.
Ordered gates:
- Document the first SSH gateway contract in
docs/proposals/ssh-shell-proposal.md: gateway authority, host-key custody, authorized-key mapping, accepted channel set, denied SSH features, terminal handoff, audit, resource limits, and teardown. - Close or explicitly preserve the scoped gateway authority gap for SSH
before implementation: the gateway must receive a manifest-declared
scoped listener or listener factory for only the configured SSH port, and
the spawned shell must receive no raw
NetworkManager,TcpListener,TcpSocket, or transport protocol authority. A temporary host-local demo compromise must stay documented in a task record and the harness must prove the child boundary withcaps. - [x] Scoped listener authority sub-slice:tcp_listen_authoritymanifest grants use the cap badge as a validated TCP port and mint a one-shotTcpListenAuthoritythat can create only that listener;make run-tcp-listen-authorityproves generic init can forward the scoped cap to a child without rawNetworkManager. - Terminal-host wiring sub-slice:
ssh-gateway-terminal-hostused manifest-scopedTcpListenAuthorityon the SSH development port andRestrictedShellLauncherto hand a socket-backedTerminalSessiontocapos-shellwhile proving the child lacked raw network, TCP, spawn, key-store, host-key, SSH gateway, terminal-factory, and launcher authority. This closed the scoped gateway authority gap for the bounded host-local proof. The demo and its smoke were retired 2026-06-10 with the kernel socket owner andSocketTerminalSession; the final OpenSSH transport must be rebuilt as a terminal host over the userspace network stack. - Add manifest-declared shell launch authority for the gateway. Prefer a
shell-only launcher or supervisor grant that can start only
capos-shellwith reviewed pass-through caps; do not grant broadProcessSpawnerauthority to the SSH gateway unless it is explicitly recorded as a host-local development compromise. - [x] Restricted shell launcher sub-slice:restricted_shell_launchermanifest grants forward an init-heldRestrictedShellLaunchercap to a child service.make run-restricted-shell-launcherproves the child service has no rawProcessSpawner,launchShellhas no binary selector and launches onlycapos-shell, session/profile mismatch and dangerous grant attempts fail closed, and the spawned shell uses the supplied session while lacking raw network, TCP, host-key, authorized-key-store, SSH gateway, and restricted-shell-launcher authority. - Add schema/design stubs for the minimum SSH support objects:
SshGatewayor equivalent service contract, sign-onlySshHostKeywrapper around aKeyVault/PrivateKey,AuthorizedKeyStore, and SSH-backedTerminalSessionconstruction. Do not expose private-key bytes, raw authorized-key storage, or vault administration to the spawned shell. Implemented as schema/type-surface stubs forSshGateway,SshHostKey,AuthorizedKeyStore,SshTerminalFactory,TcpListenAuthority, andRestrictedShellLauncher; no bootable kernel or userspace implementation is implied by this gate. - Add a development host-key path. Manifest-seeded keys may be used only
for QEMU proof and must be labeled non-production; production host keys
require the key-management and storage path. Implemented as
kernelParams.sshDevelopmentHostKeyplus the narrowssh_development_host_keykernel source. The focused proof ismake run-ssh-host-key; the development cap signs boundedssh-ed25519exchange hashes from the manifest seed, verifies against the configured public key in QEMU, denies wrong algorithms, and remains explicitly non-production. Persistent production host-key storage, rotation, and key management remain future work. - Add public-key user authentication. Accepted SSH keys map to principals
and allowed shell profiles;
SessionManagermints the session only after signature verification, andAuthorityBrokerstill decides the actual shell bundle. - [x] Public-key session bridge sub-slice:SessionManager.sshPublicKeychecks a configuredAuthorizedKeyStorerecord plus bounded fixture auth bytes/signature, mints aUserSessionwith the accepted principal/profile andpublicKeyauth strength, andmake run-ssh-public-key-authproves unknown, disabled, unsupported, and bad-signature paths fail closed before broker bundle minting. This is not full SSH transport authentication or shell launch wiring. - [x] AccountStore-bound session sub-slice:SessionManager.sshPublicKeyconsults the bootstrapRamAccountStoreafter signature verification (lookup_by_principal), so non-Activeaccount statuses (Disabled, Locked, RecoveryOnly) and missing principals fail closed before a session is minted. Each denial cause maps to a stable, principal-blankedauth=audit code (ssh-key-unknown,ssh-key-disabled,ssh-key-profile-not-allowed,ssh-bad-signature,ssh-account-missing,ssh-account-disabled,ssh-account-locked,ssh-account-recovery-only,ssh-account-lookup-failed,ssh-profile-kind-invalid,ssh-profile-not-interactive,ssh-auth-bytes-invalid).make run-ssh-public-key-authcovers the non-account-status codes; thessh-account-*codes need anAccountStoreManagerCapkernel cap source for runtime-mutated QEMU proofs (tracked indocs/backlog/local-users-management.mdGate 2). - Reject unsupported SSH features with protocol failures and audit reason
codes: password auth when disabled,
exec, SFTP/subsystems, port forwarding, agent forwarding, X11 forwarding, arbitrary environment import, and multiple active shell channels. - [x] Policy-surface sub-slice:capos-config::ssh_policyreturns allowed/denied decisions, SSH protocol failure classes, and stable audit reason codes for the narrow allowed path and the denied feature set, including second session-channel opens before any shell request. Password auth remains fail-closed until a real verifier/backoff path is part of the gateway policy.make run-ssh-feature-policyproves the table in QEMU. The full gateway item remains open until this policy is invoked byssh-gateway. - Implement the gateway as a terminal host. It owns SSH packet/channel
state and gives
capos-shellonly a cap namedterminalplus the normal scoped launch grants. The child must not receive raw network, host-key, authorized-key-store, key-vault, or broad spawn authority. - [x] Bounded terminal-host wiring sub-slice (retired 2026-06-10 with the kernel socket owner andSocketTerminalSession; the smoke now exits with a retirement diagnostic and a future terminal host must target the userspace network stack):make run-ssh-gateway-terminal-hostproved a generic-init child service can combine scopedTcpListenAuthority,AuthorizedKeyStore,SessionManager,AuthorityBroker, andRestrictedShellLaunchergrants to deny an unknown key, mint apublicKeysession from a configured key, reject a mismatched broker profile, accept the matching broker profile, convert one host-local TCP socket into aTerminalSession, and launchcapos-shellwithout giving the shell raw network, process-spawner, TCP listener/socket, host-key, authorized-key-store, SSH gateway, SSH terminal-factory, or restricted-shell-launcher authority. The proof keeps the listener service-live across shell exits, proves a second host TCP connection succeeds, and externally stops QEMU through the harness pidfile instead of treating service exit as success. This remains a bounded plain-TCP proof and does not complete full SSH packet/channel ownership or the OpenSSH harness gate. - Add
system-ssh-shell.cue,make run-ssh-shell, and a host harness usingsshagainst the forwarded port. The harness must prove one successful public-key login, one shell command, clean exit, unknown-key denial, disabled-password denial, denied forwarding/subsystem requests, and cleanup after client disconnect. - [ ] OpenSSH version-exchange slice: add a realssh-gatewayservice andsystem-ssh-shell.cueskeleton that accepts one host-local OpenSSH TCP connection, exchanges RFC 4253 identification strings, records the client software/version in bounded audit/proof output, and disconnects before key exchange without launching a shell. The normal compatibility harness should use/usr/bin/ssh; a separate low-level hostile TCP/banner fixture should prove malformed banners plus overlong identification strings fail closed. - [ ] KEXINIT and algorithm-selection slice: parse the unencrypted KEXINIT binary-packet exchange far enough to negotiate a pinned development algorithm set, reject unsupported algorithms with SSH disconnects, and keep the negotiated algorithm names out of any authority decision. The initial reviewed set should be exactly one modern KEX,ssh-ed25519host keys, one AEAD cipher/MAC pair, andnonecompression until rekey and broader algorithm policy exist. - [ ] Development key-exchange slice: complete the negotiated KEX, derive traffic keys from the shared secret, exchange hash, and session id per RFC 4253, callSshHostKey.signExchangeHashfor the SSH exchange hash, and complete the OpenSSH handshake without exposing private host-key bytes or raw entropy to the gateway’s child shell. Entropy is input for ephemeral KEX material, padding, and challenges; this remains non-production until host keys are durable and the entropy source has a reviewed production-quality policy. - [ ] OpenSSH public-key userauth slice: bind the OpenSSH userauth transcript toSessionManager.sshPublicKeyso the accepted key maps to the configured principal/profile, unknown keys are denied generically, and disabled password auth returns the expected SSH failure without invokingCredentialStore. - [ ] Channel policy slice: invokecapos-config::ssh_policyfor session-channel open, PTY, window-change, shell, exec, subsystem, forwarding, agent, X11, environment, and second-channel requests. The harness must prove the allowed shell path plus the denied feature requests with protocol-visible failures and sanitized audit reason codes. - [ ] SSH terminal launch slice: replace the plain-TCP terminal-host driver with the SSH channel-backed terminal path, launchcapos-shellthroughRestrictedShellLauncher, runsession,caps, andexitover OpenSSH, and prove disconnect cleanup for both client-close-before-shell and shell-exit-before-client-close. - Update
docs/proposals/shell-proposal.md,docs/proposals/boot-to-shell-proposal.md,docs/security/trust-boundaries.md, anddocs/proposals/index.mdwhen implementation begins so remote SSH login policy, terminal authority, and audit records stay aligned with the code.
Decomposed NIC Milestone
Move the NIC driver and TCP/IP stack out of the kernel into dedicated
userspace processes after the Telnet Shell Demo made the socket interfaces
capability-shaped. The Phase C userspace NIC driver and smoltcp network-stack
process have since landed and own the production socket path; make run-telnet and the other kernel-socket consumer smokes are retired rather
than preserved end-to-end, because the qemu-only kernel TCP listener and
socket owner were removed with that migration.
- Define first
DeviceMmio,DMAPool, andInterruptschemas (landed with the DDF capability surface). - Move virtio-net ownership into a userspace driver process holding only
DeviceMmio,Interrupt, andDMAPoolcaps (Phase C userspace NIC driver slices). - Split smoltcp into a separate userspace network-stack process that holds
the
Niccap from the driver and re-exports the Phase B socket interfaces (Phase C userspace network-stack process). - The kernel no longer depends on
smoltcp, and the userspace network-stack process re-exports the socket interfaces. Themake run-telnetend-to-end confirmation was retired instead of re-proven: the gateway demos sat on the removed kernel socket owner, and remote-shell coverage moved to the in-guest login surface and the future SSH gateway over the userspace stack.
Agent Shell / Agent Runner
The native shell’s agent mode must land before exposing the shell through a browser. The shell remains the trusted runner and session-cap holder. The model service receives prompts and returns structured tool calls, but never receives session caps, terminal caps, launcher authority, raw tokens, or secrets. Use a deterministic test model for the first proof.
Visible outcome: make run-agent-shell boots capOS in QEMU, grants
capos-shell a broker-issued LanguageModel cap plus per-tool permission map,
enters agent mode, exposes the current session bundle as typed tool
descriptors, executes one read-only tool call automatically, requires consent
or step-up for a mutating/admin-shaped call, handles user cancellation, and
records redacted audit output.
Ordered gates:
- Add the first agent-runner schema/interfaces:
LanguageModel,ModelInfo,ToolDescriptor,ToolCall,ToolResult, permission mode metadata, and bounded streaming/cancel semantics. Keep tool calls structured; do not parse model text as shell commands. - Extend
AuthorityBrokersession profiles so an operator shell can receive aLanguageModelcap and a per-tool permission map without receiving model-admin, model-catalog, or provider-token authority. - Add a deterministic in-tree
LanguageModeltest service that emits scripted tool calls for QEMU proofs. Do not block this milestone on large local model weights, remote providers, GPU, or storage. - Implement native shell agent mode: build the tool table from granted
session caps and schema metadata, stream model turns, gate each tool call
through
auto/consent/stepUp/forbidden, invoke only the capabilities held by the shell runner, and feed outcomes back into the loop. - Wire consent, step-up, cancellation, timeout, quota, and audit behavior. User interrupts beat model momentum; denied or cancelled tool calls become ordinary tool outcomes instead of hidden control flow.
- Add
make run-agent-shelland a scripted QEMU harness that proves read-only auto execution, denied forbidden/admin tool exposure, one consent or step-up prompt, cancellation, and redacted audit records. - Update
docs/proposals/llm-and-agent-proposal.md,docs/proposals/shell-proposal.md, anddocs/tasks/README.mdto record that WebShellGateway hosts this agent-capable shell/runner instead of defining a separate browser-side agent authority model.
WebShellGateway
Add the browser-hosted terminal and authentication gateway after both remote
TerminalSession proof and agent shell are in place. The gateway owns
HTTP/WebSocket or equivalent transport, TLS/origin/RP-ID validation, WebAuthn
challenge/response, terminal rendering, and session teardown. It launches the
same agent-capable native shell with the same broker-issued session profile.
Blocked by: Telnet Shell Demo for socket-backed TerminalSession, Agent
Shell / Agent Runner, passkey challenge/credential support in auth/session
services, and TLS/origin/RP-ID policy. OIDC is a follow-up path on the same
gateway, not a prerequisite for the first WebAuthn shell.
Visible outcome: make run-webshell boots capOS in QEMU with host-local
forwarding to the web gateway, a headless browser harness opens the terminal
UI with a virtual WebAuthn authenticator, authenticates, runs one shell or
agent command, logs out or closes the tab, and verifies clean
shell/process/session teardown plus a recorded transcript/proof line.
Ordered gates:
- Define the web terminal stream protocol over WebSocket or an equivalent browser transport: input, output, resize, paste, close, cancellation, flow control, session IDs, and bounded buffering.
- Add WebAuthn/passkey credential and challenge support: public-credential records, single-use bounded challenges, entropy fail-closed behavior, origin/RP-ID binding, user-presence/user-verification policy, sign-count handling, rate limiting, and redacted audit events.
- Add TLS and browser origin policy for QEMU and deployment modes. The first harness may use a local development trust path, but the gateway must have explicit Host/Origin/RP-ID checks and no production plaintext mode.
- Implement
WebShellGatewayas a terminal host service: accept browser sessions, authenticate, request the narrow shell/agent bundle fromAuthorityBroker, create or wrap a web-backedTerminalSession, spawncapos-shell, proxy terminal events, and release all session resources on logout, tab close, timeout, or shell exit. - Add
system-webshell.cueand manifest/grant wiring. The gateway gets only listen/TLS/auth/session/broker/restricted-launch grants needed for the job; the spawned shell does not receive raw network, raw auth material, model-provider tokens, or broad process-spawn authority. - Add
make run-webshellandqemu-webshell-harnesswith a headless browser virtual authenticator, transcript capture, login/command proof, logout/close proof, and assertions that failed auth and stale browser sessions do not leave a live shell. - Add optional OIDC authorization-code + PKCE login on the same gateway
after the OAuth/OIDC service exists. ID-token verification and
acr/amrmapping feedSessionManager/AuthorityBroker; raw tokens do not enter the shell or browser terminal transcript. - Update
docs/proposals/boot-to-shell-proposal.md,docs/proposals/shell-proposal.md,docs/proposals/llm-and-agent-proposal.md, and security trust-boundary docs with WebShellGateway authority, auth, terminal, audit, and teardown rules.
Network Usability And Post-smoltcp Backlog
This page decomposes the work that makes capOS networking usable after the Phase C userspace L4 stack exists. It deliberately sits beside the lower-layer Phase C track in Phase C Userspace NIC Driver Relocation and the cloud/Web UI chain in Hardware, Boot, and Storage.
The first public GCE Web UI path remains IPv4-first. Its network blockers are
Phase C userspace L4, DHCP/IPv4 configuration, ARP/default-route reachability,
private GCE proof, and the reviewed public HTTPS ingress posture. DNS,
ping, IPv6, packet tracing, and advanced transport policy improve usability
and diagnostics, but they do not block first public self-hosted Web UI unless a
later ingress policy explicitly chooses them as health or routing requirements.
Current State Boundaries
- Production non-
qemuL4 has a local Phase C 7c-ii(b) serve-from-userspace proof:cloud-prod-userspace-network-stack-smoltcp-local-proofboots the non-qemucloudboot manifest, grants an application client only a userspace-servedTcpListenAuthority, and completes one hostfwd TCP request/response through servedTcpListener/TcpSocketcaps. The qemu-only kernelsmoltcp/ virtio-net path still exists for local fixtures and transitional TCP/UDP caps; the legacy kernel socket owner is cleanup-only after the served-socket proof. - The current
Niccap is raw-frame oriented. It copies frames as inlineDatathrough manager-owned buffers and exposes no host-physical or device-usable address to userspace. - The landed
Nic.receive @1is single-frame per call: it posts one RX buffer, drains one frame (or resets the device on an empty poll), and frees the buffer – it keeps no pool of RX buffers armed between calls and has no non-resetting “no frame yet” path. Multi-frame asynchronous TCP needs a sustained, keep-armed receive, designed as thereceivePoll @4bounce-RX-pool primitive in Phase C Userspace NIC Driver Relocation and landed bycloud-prod-nic-driver-userspace-sustained-receive-pool-local-proof. That slice is the prerequisite for Phase C 7c-iii (TcpListener/TcpSocket). - The first local DHCP IPv4 configuration proof is done:
cloud-prod-network-stack-dhcp-ipv4-config-local-prooffollows the served userspace smoltcp/socket proof, acquires a DHCPv4 lease over theNiccap, installs IPv4 address/default-route state, resolves gateway and same-subnet ARP neighbors, and feeds userspace-servedNetworkManager.getConfig. Renewal/rebind/expiry lifecycle, DNS option publication, and operator-visible lease status remain follow-up work. - A POSIX DNS smoke exists:
demos/posix-dns-resolver/. It manually builds one DNS A query and sends it through the kernelUdpSocketcap to QEMU slirp DNS at10.0.2.3. It is not a system resolver service, not a typedDnsResolvercap, and not agetaddrinfo//etc/resolv.confbridge. - IPv6 is already decomposed as a separate lane in Hardware, Boot, and Storage. Do not duplicate that lane here; link to it when diagnostics or resolver work needs dual-stack behavior.
User-facing Stories
Usable networking means operators and ordinary services can answer concrete
questions without reading QEMU logs or proof tokens. Each story below maps to
the task record that owns it and is classified against the first public GCE Web
UI critical path stated above: Critical path items block the first public
self-hosted Web UI proof; Diagnostics and Completeness items improve
usability but do not block it unless a later ingress policy explicitly promotes
one (see the IPv4-first scoping in the page header and DHCP Plan below).
Two of these stories are satisfied by configuration proofs that already live in
the Current State Boundaries and DHCP Plan sections rather than by a
usability tool: an operator gets a non-fixture address, default route, and
userspace-served config status from
cloud-prod-network-stack-dhcp-ipv4-config-local-proof
(the first local DHCPv4/IPv4 config proof, critical path), and the basic
socket substrate a server binds against comes from the Phase C socket-cap and
TcpListener/TcpSocket proofs, with production manifest wiring owned by
cloud-prod-userspace-network-stack-smoltcp-local-proof (critical path).
The usability tasks below layer status, resolution, diagnostics, and server
semantics on top of those.
Operator stories
What an operator needs to observe and diagnose the running network without
holding raw NIC, DMA, or NetworkManager authority:
| Operator story | Owning task record | Web UI critical path |
|---|---|---|
| What interfaces exist, is link up, what MAC/address/prefix/default route/DNS config is active, and did it come from DHCP, static manifest, or a test fixture? | network-operator-status-tool-local-proof | Diagnostics (non-blocking) |
| Which sockets/listeners are active, which authority granted them, what peer/port is bound, and are calls blocked on accept/recv/send/backpressure? | network-operator-status-tool-local-proof over network-transport-status-cap-local-proof (done) | Diagnostics (non-blocking) |
| Does the stack publish DHCP-derived IPv4 address, default route, and gateway-neighbor state instead of a static fixture? | cloud-prod-network-stack-dhcp-ipv4-config-local-proof | Critical path |
Can a service bind a listener after boot without depending on the static QEMU 10.0.2.15 assumption? | cloud-prod-remote-session-web-ui-l4-local-proof over the done DHCP config proof and Phase C socket caps | Critical path |
| Is a DHCP lease active, and what are its renewal/rebind/expiry state and operator-visible status? | network-dhcpv4-lease-lifecycle-local-proof (done) | Completeness (non-blocking) |
Can an operator run bounded ping / route / DNS-lookup / socket-status checks? | network-ping-diagnostics-tool-local-proof (done), network-operator-status-tool-local-proof | Diagnostics (non-blocking) |
Can an operator run bounded IPv6 ping6? | network-ping6-diagnostics-tool-local-proof (over the IPv6 lane) | Diagnostics (non-blocking) |
| Can a debugging authority capture bounded per-interface packets/summaries without arbitrary NIC, DMA, or raw network-manager authority? | network-packet-trace-authority-local-proof | Diagnostics (non-blocking) |
Application stories
What an ordinary service or POSIX program needs to use the network through narrowly-scoped capabilities instead of raw socket/manager authority:
| Application story | Owning task record | Web UI critical path |
|---|---|---|
| Can a process resolve a hostname through a typed resolver capability instead of holding raw UDP socket authority? | network-system-dnsresolver-cap-local-proof (done) | Completeness (non-blocking) |
Can POSIX software call getaddrinfo and read resolver config through the adapter without owning a broader NetworkManager? | posix-getaddrinfo-system-resolver-bridge-local-proof | Completeness (non-blocking) |
| Can a long-lived server rely on readiness, cancellation, and backpressure instead of assuming every socket call eventually completes? | network-socket-readiness-poll-cancel-backpressure-local-proof (done) | Completeness (non-blocking) |
Can POSIX software wait for socket readiness through poll/select over the settled readiness model? | posix-socket-poll-select-bridge-local-proof (done) | Completeness (non-blocking) |
| Can a server set keepalive and connect/accept/recv timeouts? | network-transport-keepalive-timeout-policy-local-proof (done) | Completeness (non-blocking) |
| Can a server read connection state, backpressure depth, active keepalive/timeout, congestion controller, and interface MTU/MSS? | network-transport-status-cap-local-proof (done) | Completeness (non-blocking) |
DNS resolution is listed as Completeness rather than Critical path
because the selected public ingress can route to a backend by configured
address/load-balancer target; it becomes a deployment-policy dependency only
under the conditions in System Resolver Plan below.
DHCP Plan
DHCP belongs in the userspace network-stack process or a narrowly-authorized userspace configuration service, not in the kernel. The kernel should stage only the minimal capabilities needed to start the network stack and deliver socket/result caps. Lease parsing, renewal timers, rebind behavior, expiry, DNS/search-domain extraction, and status reporting are policy/state-machine work and should not be added to the qemu-only kernel smoltcp path.
The ordering is:
- Phase C slice 7a proves
smoltcpcan run in a userspace process over theNiccap. - Phase C 7b, 7c-i, 7c-ii(a), and 7c-iii prove the socket-cap and
TcpListener/TcpSocketsubstrate; 7c-ii(b) locally proves the production manifest through the selected serve-from-userspace path. cloud-prod-network-stack-dhcp-ipv4-config-local-proofis done. It implements the first local DHCPv4 lease/configuration proof: lease acquisition, IPv4 address, prefix/netmask, default gateway, and ARP neighbor proof.network-dhcpv4-lease-lifecycle-local-proofis done. It extends that first proof into the full DHCPv4 lease lifecycle. A deterministic in-process fixture DHCP/ARP responder drives the real userspacesmoltcpDHCPv4 client under a harness-controlled synthetic clock through initial lease acquisition, T1 unicast renewal, T2 broadcast rebind, and lease expiry; the servedNetworkManager.getConfigstatus surface reports a fail-closed zero state on expiry (never stale lease data) and resolves static-config precedence over a live DHCP lease; DNS server and search-domain options are extracted from the wire and held as resolver inputs without being exposed throughgetConfig. Proof:make run-network-dhcpv4-lease-lifecycle. The real-network initial acquisition over theNiccap stays proven bymake run-cloud-prod-network-stack-dhcp-ipv4-config.
System Resolver Plan
capOS should expose DNS through a typed resolver capability, not by making every
consumer hold NetworkManager or raw UDP authority. The first resolver should
be a stub resolver service, not a recursive resolver:
- Inputs: DHCP-provided nameserver/search-domain options from the IPv4 config path and optional static manifest resolver config.
- Authority: one narrowly-scoped UDP socket or resolver-upstream authority plus
Timer; no broad
NetworkManagerunless the slice explicitly justifies it. - Output: a typed
DnsResolvercap with bounded query names, record types, timeouts, response-size limits, negative/error mapping, and observable configuration provenance. - POSIX bridge:
getaddrinfoand a bounded/etc/resolv.confprojection call intoDnsResolver; POSIX callers should not parse raw DHCP state or own upstream sockets.
The typed resolver capability landed as
network-system-dnsresolver-cap-local-proof.
The POSIX bridge landed as
posix-getaddrinfo-system-resolver-bridge-local-proof:
libcapos-posix now implements getaddrinfo / freeaddrinfo / gai_strerror
over a granted dns_resolver endpoint (resolver status -> typed
addrinfo/EAI_*; no ambient UDP fallback), plus a read-only
/etc/resolv.conf projection derived from the resolver status (writes
fail-closed EACCES, absent without the cap). Proof: make run-posix-getaddrinfo. AAAA / sockaddr_in6, AI_* flags, and an
/etc/services table remain follow-ups (getaddrinfo fails closed on each:
EAI_FAMILY / EAI_BADFLAGS / EAI_SERVICE).
DNS does not normally block the first GCE Web UI proof because the selected public ingress path can route to a backend by configured address/load-balancer target. DNS becomes a deployment-policy dependency when capOS itself must resolve outbound names, when the public proof asserts a DNS hostname end to end, or when IPv6 ingress adds AAAA/certificate policy.
Beyond smoltcp
The near-term plan is not to replace smoltcp or hand-roll TCP algorithms.
Phase C should first move smoltcp out of the kernel, preserve the existing
socket contract, and make its behavior observable. The distinction this lane
keeps is between relocation (Phase C slices 7a-7c: run the selected
smoltcp build in userspace and preserve the socket contract) and transport
policy/status (the capOS control plane around that stack, decomposed below).
Relocation does not require any new transport mechanic; the policy/status work
starts only after the stack is observable.
What the selected smoltcp build actually exposes
smoltcp is pinned at version 0.13.0 (Cargo.lock). capOS does not build the
crate’s default feature set; it enables narrow per-proof subsets:
- The qemu-only kernel fixture (
kernel/Cargo.toml) enablesalloc,medium-ethernet,proto-ipv4,socket-tcp, andsocket-udp. - The early Phase C userspace 7a/7b demos
demos/cloud-prod-network-stack-process-smoltcp-skeleton-smokeanddemos/cloud-prod-network-stack-smoltcp-socket-caps-smokeenablealloc,medium-ethernet,proto-ipv4, andsocket-udponly. Those early demos are UDP-only and should not be read as the current full Phase C L4 status. - The later Phase C TCP proofs
demos/cloud-prod-network-stack-smoltcp-tcp-listener-roundtrip-smokeanddemos/cloud-prod-network-stack-smoltcp-tcp-socket-cap-ipc-smokeenablealloc,medium-ethernet,proto-ipv4, andsocket-tcp. The completedcloud-prod-userspace-network-stack-smoltcp-local-proofbuilds on that substrate and proves a local servedTcpListenAuthority/TcpListener/TcpSocketrequest/response through the userspace network stack. - The selected IPv4 Web UI path now has a local DHCP/IPv4 configuration proof
over smoltcp’s
socket-dhcpv4path in the Phase C userspace stack. Landed proof stops at config/status, route, and ARP neighbor evidence; the local bounded ICMPv4 Echo Reply proof is also done for diagnostics.socket-dns, the operator IPv4 ping tool, the localremote-session-web-uiL4 proof, private GCE reachability, and public ingress/TLS remain separate gates.
None of the IPv4 TCP builds cited above enables socket-tcp-reno or
socket-tcp-cubic. Those features are what compile smoltcp’s Reno and CUBIC
controllers into the
CongestionControl enum; without them the only available variant is
CongestionControl::None, which is also smoltcp’s default. capOS therefore
runs with no congestion control today as a consequence of its build
configuration, not as a reviewed policy choice. Selecting Reno (or CUBIC,
which uses f64) is a build-feature flip plus a set_congestion_control call,
not a custom algorithm.
For read-only status, smoltcp’s TCP socket already exposes the introspection
capOS would surface: connection state (state() over the TCP state machine),
local_endpoint()/remote_endpoint(), liveness predicates
(is_open/is_active/is_listening, may_send/may_recv,
can_send/can_recv), buffer sizes (send_capacity/recv_capacity), and the
current backpressure depth (send_queue/recv_queue bytes). Keepalive and idle
timeout are policy setters with matching getters
(keep_alive/set_keep_alive, timeout/set_timeout). There is no
per-socket getter for negotiated MSS, RTT, or retransmission counts in 0.13.0;
MTU is an Interface/phy::Device property, so MTU/MSS status must be sourced
from the interface and device capabilities, not from the TCP socket.
Status capOS must surface
Read-only transport status the socket/listener caps should expose, each backed by an existing smoltcp getter (or interface property) so it records selected behavior rather than asserting new mechanics:
| Status | smoltcp / interface source |
|---|---|
| Connection state | tcp::Socket::state() |
| Local / remote endpoint | local_endpoint() / remote_endpoint() |
| Send/receive backpressure depth | send_queue() / recv_queue() vs send_capacity() / recv_capacity() |
| Readiness / liveness | may_send/may_recv, can_send/can_recv, is_active/is_listening |
| Active keepalive / idle timeout | keep_alive() / timeout() |
| Active congestion controller | congestion_control() (today always None) |
| Interface MTU and configured-MTU source | Interface/phy::Device capabilities, manifest config |
| Listener backlog pressure | accepted-socket count vs configured backlog |
| Close / error / reset reason | socket close transition plus the cap/network.rs error mapping |
v0 classification
- v0 policy inputs (operator/service-settable): per-socket keepalive
interval and connect/recv/idle timeout (smoltcp
set_keep_alive/set_timeoutplus connect/accept/recv deadlines), and listener backlog bound. These map to existing smoltcp setters and to call-level deadlines. - v0 read-only status: the status table above — exposed through the socket,
listener, and
NetworkManager-side status surface without letting callers mutate stack internals. - Deferred until workload evidence: congestion-control algorithm selection, path-MTU discovery, TCP-mechanic tuning (window scaling, Nagle/quickack policy), and any stack replacement. The default is to observe and surface the selected stack’s behavior first.
Decomposed follow-ups
- Cancellation, readiness, close, and backpressure semantics are settled by
network-socket-readiness-poll-cancel-backpressure-local-proof(done); the POSIXpoll/selectbridge over that model is settled byposix-socket-poll-select-bridge-local-proof(done). The settled readiness states map to POSIX event bits in the sharedcapos-rt::pollselectcore (POLLIN/POLLOUT/POLLHUP/POLLERR/POLLNVAL, no stale readable/writable after close/release); thelibcapos-posixCpoll()/select()surface and<poll.h>/<sys/select.h>headers delegate to it and fail closed on unsupported event bits / badnfds/ closed fds. The proof is an in-process smoltcp fixture (harness=in-process-smoltcp-fixture,posix_surface=demo-local-model) plus thec-libc-surfaceC-surface checks;make run-posix-socket-poll-select. Blocking readiness (aPollablecap) is the follow-up lane, since the v0UdpSocket/Pipecaps expose no non-blocking readiness method. - Read-only transport status (the table above, including congestion-control
reporting and interface MTU/MSS reporting) is settled by
network-transport-status-cap-local-proof(done). The local proof is an in-process smoltcp fixture (harness=in-process-smoltcp-fixture,status_surface=demo-local-model); the production cap/schema wiring of the status surface is the follow-up lane. - Keepalive and connect/accept/recv timeout policy inputs are owned by
network-transport-keepalive-timeout-policy-local-proof(done). The local proof is an in-process smoltcp fixture (harness=in-process-smoltcp-fixture,policy_surface=demo-local-model); the production cap/schema wiring of these inputs is the follow-up lane. That lane should model connection-refused as its own terminal call outcome (the v0 demo’sDeadlineWaiteronly distinguishes timeout from a still-parked call, proving refused-vs-timeout distinctness at the socket layer rather than in the waiter abstraction). - Congestion-control evaluation is a deliberately deferred lane, not a
runnable task. It may only open after the read-only transport-status proof
lands and a workload produces evidence (loss/throughput/latency under a real
capOS network server) that the default
CongestionControl::Noneis inadequate. Its entry criteria are: a reproducible workload, a recorded baseline underNone, and a decision to flip thesocket-tcp-reno/socket-tcp-cubicbuild feature (configuration, still not a custom algorithm) before any hand-rolled TCP mechanic is even considered. Replacing smoltcp’s TCP mechanics remains speculative until that evidence exists.
Task Lanes
Docs/status lanes (both done 2026-06-03):
network-ux-user-story-groundingaudited current user stories, status gaps, and command/tool vocabulary, and mapped each operator/application story to its owning task record underUser-facing Stories.network-transport-policy-status-decompositiondecomposes congestion-control, timeout, keepalive, MTU/MSS, and transport status decisions against the actual stack version.
Blocked behavior/read-side lanes:
network-operator-status-tool-local-proofis done: it adds the operator-visibleip addr/ route / DNS / link / socket-state equivalent over the Phase C userspace stack. A network-stack server acquires a real IPv4 DHCP lease, snapshots link/MAC/address/prefix/ route/gateway-neighbour/DNS/search-domain/lease-state/socket state, and serves it over a read-onlynetwork_statusendpoint to a separately spawned status tool. The tool holds noNetworkManagercap, prints a bounded status table reflecting the live stack state (distinguishing available DNS from the unavailable search domain), and observes the fail-closed rejection of a forged socket-creation call. Proof:make run-network-status-tool. Promoting the demo-local status surface to a first-classNetworkStatusschema interface is deferred (it would cross the schema/generated-bindings conflict domain).network-dhcpv4-lease-lifecycle-local-proofis done: it extends the first DHCP config proof into a real lease lifecycle (renewal, rebind, expiry/fail-closed, static precedence, DNS option publication) viamake run-network-dhcpv4-lease-lifecycle.network-system-dnsresolver-cap-local-proofis done: it adds a typedDnsResolvercapability with a strict cross-process authority split. A resolver server owns the upstream-DNS authority (it runs the query over a realsmoltcpUDP socket against a configured upstream, with the upstream isolated in-process as a deterministic DNS responder under a synthetic clock), sources resolver config from a static-manifest entry plus a modelled DHCP option-6 entry with observable provenance, and serves a read-onlyDnsResolverendpoint. A separately spawned resolver tool holds only the endpoint – noNetworkManager,Nic, or UDP socket authority – so it resolves bounded A/AAAA hostnames through the cap and cannot resolve names by ambient network authority. The proof exercises a resolved A record, a resolved AAAA record (A/AAAA-capable API shape), NXDOMAIN -> not-found, a silent upstream -> typed timeout, fail-closedunavailablewith no upstream config, a status surface reporting config source/active upstreams/last error (no packet payloads or raw DHCP leases), and the fail-closed rejection of a forged raw-upstream call on the read-only endpoint. No schema, kernel, or capos-rt change: like the operator status surface, the resolver endpoint is an interface-agnostic protocol local to the demo, and promoting it to a first-classDnsResolverschema interface is deferred to avoid the schema/generated-bindings conflict domain. Proof:make run-network-system-dnsresolver.posix-getaddrinfo-system-resolver-bridge-local-proofbridges POSIXgetaddrinfo/ resolver configuration toDnsResolver.network-ping-diagnostics-tool-local-proofis done: it adds the bounded local IPv4 ping diagnostics tool over the done ICMPv4 Echo Reply lane, proving same-subnet and gateway-routed echo success, malformed-reply drop, timeout/unreachable classification, and retry/payload bounds. Proof:make run-network-ping-tool.network-ping6-diagnostics-tool-local-proofis done: it adds the bounded local IPv6 ping diagnostics tool over the existing IPv6/ICMPv6 lane without changing the IPv4-first Web UI critical path.network-socket-readiness-poll-cancel-backpressure-local-proofis done: it settles usable server semantics for readiness (accept/read/write/closed/reset/config-unavailable), parked-call cancellation, close/stale-waiter rejection, and send/receive backpressure. A single proof process drives two real userspacesmoltcpinterfaces wired by an in-process frame shuttle under a synthetic clock and asserts each case straight from real smoltcp getters (state,may_*/can_*,*_queuevs*_capacity). Proof:make run-network-socket-readiness. The POSIXpoll/selectbridge landed inposix-socket-poll-select-bridge-local-proof(done), which exposes the surface only once implemented and proven.network-transport-status-cap-local-proofis done: it surfaces read-only transport status (connection state, endpoints, send/recv backpressure depth, active keepalive/timeout, active congestion controller, interface link/IP MTU with MSS marked derived/not-exposed, listener backlog pressure, and the close/reset reason mapped onto the cap/network.rs NetworkError vocabulary) over the userspace stack, and proves the status read is strictly read-only (fingerprint unchanged, zero frames emitted). A single proof process drives two real userspacesmoltcpinterfaces wired by an in-process frame shuttle under a synthetic clock (harness=in-process-smoltcp-fixture,status_surface=demo-local-model). Proof:make run-network-transport-status; the production cap/schema wiring of the status surface is the follow-up lane.network-transport-keepalive-timeout-policy-local-proof(done) adds keepalive and connect/accept/recv timeout policy inputs over the userspace stack.network-packet-trace-authority-local-proofadds bounded per-interface packet/debug trace authority for diagnostics. The local proof (make run-network-packet-trace) feeds every transmit/receive frame of one real userspace-smoltcp DHCP bring-up path through a boundedPacketTrace: a fixed capture capacity (so the drop counter is exercised), a fixed per-packet header-only byte cap (payload_policy=header-only-no-body– at most the leading L2/L3/L4 header bytes of any frame are recorded, packet bodies are never captured), a direction filter, an expiry deadline, and capture/drop/admission counters with grant provenance (which authority enabled the trace and why). The captured trace is served over a read-only endpoint to a reader granted only a console plus that endpoint – noNic,DMAPool,DeviceMmio,Interrupt, orNetworkManager– so the diagnostic authority is strictly observe-only: forged transmit/reconfigure/open-socket calls are rejected fail-closed, and a sibling probe holding no trace cap cannot observe any packet. Payload-visibility policy: the trace exposes only bounded headers for protocol diagnosis (DHCP/ARP/IPv4-UDP classification), never application payload, and it transfers no device or socket authority – this is why the authority is diagnostic-only and is grounded in Debug, Trace, and Profiling Authority (the read-only sampler authority class, not the read/writeDebugSessionclass). Promotion to a first-classPacketTraceschema interface is deferred to avoid the schema/generated-bindings conflict domain, matching the sibling status/DNS diagnostic proofs.
POSIX Adapter Phase P1.4: Running dash
Long-form decomposition for the POSIX adapter Phase P1.4 dash port. Root task
records under docs/tasks/ select dispatchable POSIX work and link here; the
executable per-step checklist is in docs/proposals/posix-adapter-proposal.md
Task 4; the design rationale and validation smoke contract are in
docs/proposals/posix-adapter-proposal.md Phase P1.4 and Open
Questions §1 (shell candidate) + §7 (fd 0 backing). Open Question §6
(fork policy = Variant A recording shim) is already a final decision
in the proposal and does not gate P1.4.
What “Running dash” Means in v0
The validation smoke is make run-posix-shell-smoke. It boots a
focused manifest that grants:
- a
TerminalSessioncap for stdio, - a read-only bootstrap-granted
Directorycap rooted at a tiny in-rodata pseudo-fs (the resolver remainsNamespace-shaped for forward parity; the v0 manifest grants aDirectorybecause that is what Storage Phase 3 slice 2 ships as a kernelCapObject), - a
ProcessSpawnernarrowed to one allowed binary (ls-shim), - and a
Timercap.
tools/qemu-posix-shell-smoke.sh pipes the heredoc ls; echo done
into the shell’s fd 0, asserts done on the kernel log, asserts two
clean-exit log entries (shell + ls-shim), and asserts clean QEMU
exit. Stretch goal: cat foo | grep bar end-to-end against
demos/cat-shim/ and demos/grep-shim/, exercising the P1.3 Pipe
primitive through a shell pipeline.
This is intentionally narrow: no job control, no signal delivery, no
real filesystem persistence, no ulimit (a v0 chdir / cwd string with
cwd-relative resolution has since landed – see Slice 4 below). The point
is to prove that a real POSIX C program (not a capOS-native shell)
boots, parses scripts, dispatches subprocesses through
fork+execve, reads stdin, writes stdout, and exits cleanly under
QEMU.
Prerequisites Already Landed
- P1.1 libcapos C substrate (
fe5f5208,2026-05-05 13:28 UTC): Rust staticlib mirror ofcapos-rt,_startshim, fixed heap,malloc/free/calloc/realloc,console_write_line. - P1.2 UDP + DNS resolver smoke (
2026-05-05 21:21 UTC):libcapos-posixerrno TLS cell,clock_gettime/gettimeofdayoverTimer, fd-table dispatch shape,__errno_location(). - P1.3 Pipe + recording-shim fork-for-exec (
2026-05-07 09:55 UTC, fix-slice through05b528732026-05-07 21:07 UTC): kernelPipecap,ProcessSpawner.createPipe, fd-tableFdBacking::Pipe, recording-shimfork/execve/waitpid/_exit, directposix_spawn/posix_spawn_file_actions_*. The Variant A contract:execve()returns the synthetic child pid on success. - Storage Phase 3 slices 1-3 (slice 1
d06dff6bat2026-05-14 19:31 UTC, slice 2b11ec9e4at2026-05-14 22:30 UTC, slice 3804a3f41at2026-05-14 23:23 UTC): RAM-backedFile/Directory/Store/NamespaceCapObjects withKernelCapSource::file/directory/store/namespacegrant sources. These are the v0 backing for the dash smoke’s read-only in-rodata pseudo-fs. - WASI bounded env grant (
5f5028e7,2026-05-13 11:05 UTC): reference shape for a bounded text env grant oninitConfig.init(wasiEnv :Text). The dash port mirrors this for its env vector. setjmp/longjmpprecursor (libc-setjmp-longjmp,2026-05-25 21:11 UTC): the x86_64 SysVsetjmp/longjmpC-ABI primitive plusjmp_bufand a<setjmp.h>header. This was absent from the original P1.4 surface table, but dash’s exception/interpreter control flow is built onsetjmp/longjmpover a realjmp_buf(pervasive inerror.h/main.c/eval.c/parser.c/trap.c/ …), so it is a hard precursor for the dash build pipeline and shell smoke. Implemented inlibcapos/src/setjmp.rs(global_asm), exposed throughlibcapos-posix/include/capos/posix/setjmp.h, and proven in QEMU viamake run-posix-setjmp(direct call returns 0, alongjmpfrom a deep recursion resumessetjmpwith the passed value, andlongjmp(env, 0)returns 1). Nosigsetjmp/siglongjmp: dash uses only the plain primitive and the v0 signal layer has no asynchronous delivery.
The Phase 2 Open Question §1 (dash candidate) and §7 (fd 0 backing =
TerminalSession) are now promoted from working answers to final
decisions (docs/proposals/posix-adapter-proposal.md “## Open
Questions” §1/§7, Decided (P1.4 Slice 1, 2026-05-24 00:53 UTC));
that promotion was the first dispatch slice of P1.4 (Slice 1 below).
Decomposition
Slice 1: open-question closures (docs-only)
Status: closed (P1.4 Slice 1, 2026-05-24 00:53 UTC). Open Questions
§1 (dash 0.5.13.x) and §7 (fd 0 backing = TerminalSession) are now
Decided in docs/proposals/posix-adapter-proposal.md; the §1
candidate-survey cross-reference and the Phase P1.4 “Open question
closures” bullet are reconciled to match.
Two open questions in docs/proposals/posix-adapter-proposal.md must
become final decisions before any code lands:
- §1: confirm dash 0.5.13.x as the v0 candidate. Alternatives
surveyed: busybox
ash, oksh, toysh, custom Rust shell. dash wins on size, POSIX strictness, and single-purpose/bin/shposture. - §7: confirm
TerminalSessionas the canonical fd 0 / 1 / 2 backing for the v0 smoke. AnFdBacking::Terminalvariant inlibcapos-posix/src/fd.rsplusposix_inherit_stdio()adoption is the implementation shape.
Promotion = strike the “Working answer” phrasing in the proposal,
replace with “Decided (P1.4 Slice 1,
Slice 2: typed clients in capos-rt
Status: closed. The typed clients and interface-ID re-exports are available
from capos-rt; make run-posix-file now exercises them through
libcapos-posix.
TerminalSessionClient and the TERMINAL_SESSION_INTERFACE_ID
re-export already ship from capos-rt/src/client.rs and
capos-rt/src/lib.rs; no work there. The net-new wrappers, mirroring
the existing PipeClient / UdpSocketClient shape, are:
FileClient:read,write,stat,truncate,sync,closeover theFileinterface methods.DirectoryClient:open,list,mkdir,remove,subover theDirectoryinterface methods, returning typedFileClient/DirectoryClientprojections of the transferred result caps.
Add re-exports for the existing FILE_INTERFACE_ID /
DIRECTORY_INTERFACE_ID constants (already defined in
capos-config/src/lib.rs) from the pub use capos_config::{...}
block in capos-rt/src/lib.rs.
Slice 3: fd backing for File / Directory / Terminal
Status: closed. FdBacking::File, FdBacking::Directory, and
FdBacking::Terminal are present; read, write, close, lseek,
opendir, readdir, and closedir are wired for the RAM-backed root
Directory path. The stat / fstat / access / unlink metadata/remove
follow-up is closed by the posix-p1-4-file-metadata slice: stat / fstat
fill a struct stat (sys/stat.h) from File.stat, access is an
existence check (single-identity v0, mode ignored), and unlink resolves the
parent Directory and calls Directory.remove. Proven by the extended
make run-posix-file smoke. The file-resize follow-up is partially closed by
the posix-ftruncate-truncate-file-resize slice: ftruncate(fd, length) and
truncate(path, length) drive File.truncate @3 over the RAM-backed root and
are proven by the same make run-posix-file smoke (ftruncate shrink ok /
truncate by-path ok markers). The fsync(2) / fdatasync(2) C shims over
FileClient::sync_wait are implemented in libcapos-posix/src/file.rs;
writable-disk (writable_fs) truncate beyond the RAM-backed root remains open.
Extend libcapos-posix/src/fd.rs with three new FdBacking variants:
FdBacking::File { client: FileClient, pos: u64 }– the seek position lives in the fd table, not the kernelFilecap (the schema-level read/write take an explicit offset).FdBacking::Directory { client: DirectoryClient, iter: ... }– iteration state forreaddir.FdBacking::Terminal { client: TerminalSessionClient }.
Route the existing read / write / close C entry points through
these variants. Add file-path-only C entry points (open, lseek,
stat, fstat, access, unlink, opendir, readdir,
closedir) in libcapos-posix/src/file.rs and
libcapos-posix/src/directory.rs.
Slice 4: path resolver over root Directory
Status: closed for the bootstrap root Directory shape. A read-only
absolute-path resolver in libcapos-posix/src/path.rs:
- Input: an absolute UTF-8 path and a bootstrap-granted root
Directorycap. - Walk
Directory.sub()for each prefix segment; mint a leafFile/Directorycap directly withDirectory.open()/Directory.sub(). - A v0 per-process cwd string landed (
libcapos-posix/src/cwd.rs,make run-posix-cwd):getcwd/chdirplus cwd-relative resolution foropen/opendir/stat/access/unlink/mkdir.chdirvalidates the target directory through the resolver, stores the normalized absolute string, and drops the cap; cwd inheritance across spawn is still deferred. No..collapsing – escape is prevented by the kernelDirectorycap’s lack of a parent edge, not a resolver clamp. - Returns typed
File/Directoryresult caps that flow into the fd-table backing.
The future Namespace.resolve + Store.get shape remains planned for a real
filesystem service; the v0 dash smoke uses the bootstrap-granted root
Directory, no Store / content-addressed hashes.
Slice 5: stdio over TerminalSession
Status: closed. posix_inherit_stdio() adopts a bootstrap-granted
TerminalSession cap as fds 0 / 1 / 2 (FdBacking::Terminal), with the
pipe-backed inheritance path retained for posix_spawn-driven pipeline
children; proven by make run-posix-stdio-terminal-smoke.
posix_inherit_stdio() already adopts pipe-backed fds 0 / 1 / 2 from
the recording-shim execve path. Extend it to also adopt a
bootstrap-granted TerminalSession cap as fd 0 / 1 / 2 when the
manifest supplies one (the posix-pipe pipeline children stay on the
existing pipe path). The shell binary calls posix_inherit_stdio()
once from main() before reading the heredoc.
Slice 6: env vector + getenv / setenv / putenv
Mirror the WASI host adapter’s wasiEnv :Text shape:
-
- Add a bounded
posixEnv :Text(or per-key `posixEnvEntries - List(Text)
) grant oninitConfig.initinschema/capos.capnp. This is the only P1.4 schema touch; queue on the shared schema serial surface perdocs/backlog/index.mdConcurrency Notes when selected. Regenerate the checked-in capnp bindings;make generated-code-check` must pass.
- Add a bounded
- Read the grant from the bootstrap CapSet at startup; populate a
per-process env vector in
libcapos-posix/src/env.rs. - C entry points:
getenv,setenv,putenv,unsetenv. LaunchParametersremains a follow-on for non-v0 callers.
Slice 7: printf / string subset
Status: closed. The focused C library subset now ships from
libcapos-posix, and make run-posix-printf proves formatted output plus
string/mem, numeric conversion, and ctype behavior from a live capOS C
process.
A focused C library subset shipped from libcapos-posix (not a full
libc, not a musl port):
stdio.hsubset:printf,fprintf(fd 1 / fd 2 only),vprintf,vfprintf,snprintf,vsnprintf,putchar,puts,fputs,fputc. Nofopen/FILE *– those route through the fd-table surface.string.hsubset:memcpy,memmove,memset,memcmp,strlen,strcmp,strncmp,strchr,strrchr,strcpy,strncpy,strcat,strncat,strdup.stdlib.hsubset:atoi,strtol,strtoul. Process termination still uses the existing libcapos_exitpath;exit/abortstay outside this focused printf/string slice.ctype.hsubset:isspace,isdigit,isalpha,isalnum,isupper,islower,tolower,toupper.
malloc / free / calloc / realloc already ship from libcapos.
Slice 8: signal stubs
Status: closed for the v0 dash-port surface. signal() and sigaction()
validate and store handlers in a per-process table, but handlers are never
delivered. kill() fails closed with EPERM because this POSIX layer has no
target ProcessHandle authority, and raise() fails closed with ENOSYS
because self-delivery is not implemented. make run-posix-signal-time proves
the documented behavior from a live capOS C process.
Header-and-stub-only signal, kill, sigaction, plus a TLS-stored
handler table that accepts handler registration but never delivers a
signal. dash registers a SIGCHLD handler at startup; the stub
records the handler pointer and returns 0. Documented out of scope:
real SIGCHLD / SIGTSTP delivery, job control, controlling
terminals.
Slice 9: time additions
Status: closed. time(2), nanosleep, and sleep reuse the existing
Timer cap path already used by clock_gettime / gettimeofday.
make run-posix-signal-time proves monotonic-since-boot time() output,
bounded nanosleep(), and one-second sleep() from a live capOS C process.
time(2), nanosleep, sleep over the existing Timer cap;
clock_gettime / gettimeofday already landed under P1.2 Phase B.
Slice 10: identity stubs
Status: closed for the ready-task surface (getpid, getuid, getgid) at
commit 1a8a9896 (2026-05-23 06:51 UTC). getpid returns the stable
capos-rt bootstrap pid for the current process, and the recording-shim child-pid
allocator avoids colliding with the caller’s pid. getuid / getgid return
the hardcoded single-identity uid/gid 0. The geteuid / getegid alias
follow-up is closed (task posix-geteuid-getegid): both delegate to
getuid / getgid since the effective ids equal the real ids under the v0
single-identity model, declared in unistd.h, and asserted by
run-posix-identity via the printed euid=0 egid=0 fields.
Slice 11: dash vendoring + Variant A patch
Status: closed (posix-p1-4-dash-vendor, 2026-05-24 19:40 UTC). dash
v0.5.13.4 is vendored mirror-as-is under vendor/dash/ (full upstream tree,
byte-identical) with vendor/dash/VENDORED_FROM.md. The Variant A fork-exec
patch set lives under vendor/dash/patches/ as two .patch files
(0001-execve-return-synthetic-pid.patch over src/exec.c/src/exec.h;
0002-vforkexec-adopt-synthetic-pid.patch over src/jobs.c), cumulative diff
45 changed lines (< 50). Design evidence only – nothing compiles or runs at
this slice; the C-build slice (posix-p1-4-c-multifile-build) and shell smoke
(posix-p1-4-dash-shell-smoke) prove the behavior end-to-end.
- Vendor dash 0.5.13.x under
vendor/dash/at a pinned tag, mirror-as-is. Addvendor/dash/VENDORED_FROM.mdrecording the upstream URL, commit, tag, and refresh procedure (mirror the existingvendor/dns-c-wahern/VENDORED_FROM.mdshape). - Apply the Variant A per-call-site patch: at each fork-exec site,
capture
execve()’s synthetic pid return value, bail on-1, and assign back tochild. Patches live undervendor/dash/patches/with one.patchper call site; the cumulative diff against upstream is < 50 lines. - Inter-call
dup2/closebetween fork and execve already records throughlibcapos-posixand needs no per-call patching. - Carried into Slices 12-13: patch
0001de-noreturnsshellexec(), so the two no-fork exec-replace callers (src/eval.cevalcommand()EV_EXITpath andexeccmd(), each with/* NOTREACHED */) now fall through under the recording shim. A single non-interactive command (dash -c '/bin/echo hi') takes theEV_EXITpath, notvforkexec(). Slice 12/13 must disable theEV_EXITin-place-exec optimization under the recording shim (fork-exec-then-exit) or add an exec-replace-then-exit patch before the binary runs. Details invendor/dash/VENDORED_FROM.md.
Slice 12: C-build pipeline for vendored multi-file C sources — CLOSED
The existing c-build helper compiles single-file demos/*/main.c
smokes against libcapos.a + libcapos_posix.a. dash is a
multi-translation-unit C codebase (main.c, eval.c, exec.c,
expand.c, input.c, jobs.c, mail.c, memalloc.c, miscbltin.c,
mystring.c, nodes.c, options.c, output.c, parser.c,
redir.c, show.c, trap.c, var.c, plus generated tables).
Closed by posix-p1-4-c-multifile-build: the Makefile gained the reusable
capos-c-multitu-elf define (instantiated via $(eval $(call ...))) that
- accepts a list of
.cfiles, - compiles each to an object with
clang --target=x86_64-unknown-none-elf -nostdlib -static -I libcapos/include -I libcapos-posix/include, - links the objects with
libcapos_posix.a+libcapos.a, - produces a userspace ELF without dragging in an external libc.
The proof demo demos/c-multifile/ (main.c + greet.c + greet.h) builds
through the rule; greet.c uses libcapos-posix strlen/memcpy, so the link
resolves symbols from both archives. make run-c-multifile boots the two-TU
ELF and asserts the greet=/checksum= line computed in the helper TU,
proving the cross-TU call executed. The rule is reusable for future C ports
(busybox utilities, dash).
Slice 12.5: dash build pipeline — LANDED
Status: landed (2026-05-26 05:11 UTC, task
posix-p1-4-dash-build-pipeline). The sysroot/libc precursor landed first
(2026-05-25 22:23 UTC, task libc-dash-sysroot-surface).
The build pipeline lives under vendor/dash/capos/ (outside the mirror-as-is
src/): config.h (pinned autotools config) and gen-tables.sh (stages a
patched source copy under target/dash/src and runs dash’s six host
generators into target/dash/gen). The Makefile dash target funnels
dash_CFILES + the five generated tables through capos-c-multitu-elf against
libcapos_posix.a + libcapos.a in the -nostdinc sysroot include mode,
producing target/dash/dash.elf (statically linked, 0 undefined symbols,
_start from capos-rt, the two Variant A fork-exec patches compiled in). A
clean tree (rm -rf target/dash && make dash) regenerates deterministically;
the mksignames signal-name table is the one host-<signal.h>-derived table
(cosmetic on capOS v0). Runtime behavior (including the EV_EXIT residual) is
the dependent posix-p1-4-dash-shell-smoke. Config derivation +
host-table caveat: vendor/dash/VENDORED_FROM.md.
Original precursor notes (posix-p1-4-dash-build-pipeline is now ready):
The build-pipeline mechanics were validated by a -nostdinc compile/link
probe over the full vendored dash TU set (branch
posix-p1-4-dash-build-pipeline, gitignored target/dash-probe/):
- A pinned capOS
config.h(SMALL=1, JOBS=0,HAVE_*mostly undefined,_PATH_*literals,PRIdMAX "lld",USE_TEE/USE_MEMFD_CREATE0) drives the preprocessor and gates. - All six generators run deterministically and emit the tables:
mktokens(token.h/token_vars.h),mksyntax(syntax.c/syntax.h, needstoken.hon its compile include path),mknodes(nodes.c/nodes.h),mksignames(signames.c),mkbuiltinsover a preprocessedbuiltins.def(builtins.c/builtins.h), andmkinitover the 27-file TU list (init.c).mkinitandmkbuiltinstake their inputs as separate arguments — mind shell word-splitting in the Makefile recipe.
The blocker was that dash includes bare POSIX headers (<unistd.h>,
<fcntl.h>, <signal.h>, …), which resolved to the host /usr/include under
the existing flags, plus a broad missing libc surface. libc-dash-sysroot-surface
closed both:
- Sysroot.
libcapos-posix/sysroot/include/holds bare-name headers (stdio.h,unistd.h,sys/types.h,termios.h,wchar.h, …) that forward to thecapos/posix/*source of truth. Consumed with-nostdinc- four
-isystemroots (clang freestanding builtins, the sysroot, and the two capOS namespaces) via the existingcapos-c-multitu-elfrule (CAPOS_C_SYSROOT_INCLUDEin the Makefile). The focused proof ismake run-c-libc-surface(qsort / strerror / umask / strtoll / strstr /S_IS*, all through bare includes).
- four
- Surface. The inventory plus several items the original table understated:
the C/POSIX-locale multibyte layer
expand.cneeds unconditionally (mbrtowc/mbrlen/mbsrtowcs/wcschr/wctype/iswctype+ theisw*family,<wchar.h>/<wctype.h>),strpbrk,lstat,getgroups,wait3,vfork,htonl/htons/ntohl/ntohs(used bybltin/printf.c), theS_IS*file-type macros,DT_LNK/…,environ, and thesys_siglistarray dash’s ownstrsignalreads.
A -nostdinc compile of the full vendored TU set (27 hand-written + 3
bltin + 5 generated, using the probe’s generated headers) against the real
sysroot now reports 0 errors, and a symbol audit shows 0 unresolved
libc symbols once the dash objects and the two capOS archives are combined
(evidence: ~/capos-evidence/libc-dash-sysroot-surface/).
Config.h the pipeline slice must pin (these are autotools/feature flags,
not libc surface — they belong to posix-p1-4-dash-build-pipeline): the probe
set already documented (SMALL=1, JOBS=0, _PATH_*, PRIdMAX "lld",
USE_TEE/USE_MEMFD_CREATE 0) plus HAVE_ALLOCA_H 1 (so <alloca.h> is
included; the sysroot provides it as a __builtin_alloca alias), HAVE_WAIT3 1
(dash’s #else branch is a non-compiling 4-arg waitpid; with the flag it
uses the wait3 symbol the surface provides), HAVE_ISALPHA 1 (capOS
<ctype.h> declares the classifiers, so dash uses them directly instead of its
_isXXX rename shims), and the stat64/lstat64/fstat64/open64/readdir64/
dirent64/glob64* → unsuffixed #define fallbacks. Do not define
HAVE_STRSIGNAL or HAVE_SYSCONF — dash provides those itself (and consumes
sys_siglist/its noreturn sysconf).
Slice 13: ls-shim + manifest + smoke harness
Status: closed (2026-05-27 09:36 UTC) by make run-posix-shell-smoke:
a real vendored dash boots as PID 1, reads the heredoc off its fd 0
TerminalSession, creates two entries in its bootstrap RAM root Directory
(> /alpha, > /beta), opens that directory as fd 3 (exec 3< /), dispatches
/ls-shim through fork/execve, prints done, and exits; ls-shim lists the
inherited directory (alpha, beta) over the shared terminal and both
processes exit cleanly. The earlier block (2026-05-27 00:46 UTC) was the
fd-inheritance premise conflict (vanilla dash forwarded no capability to
ls-shim); it was resolved by the posix-recording-shim-full-fd-inherit +
posix-terminal-session-forwardable + posix-open-directory-fd precursors, so
assembly needed only three minimal additions: a vendor/dash/patches/ runtime
bootstrap (synthesize argv[0] + posix_inherit_stdio(); the runtime entry
passes argv=NULL and wires no POSIX stdio), a basename map in the
recording-shim spawn (the kernel matches the manifest binary name, so
/ls-shim resolves to ls-shim; a no-op for the bare-name smokes), and the
ls-shim / manifest / harness assembly. No dash EV_EXIT patch was needed: the
heredoc never makes an external command the last command, so /ls-shim takes
vforkexec(). Full historical analysis: see the completed task record
docs/tasks/done/2026-05-27/posix-p1-4-dash-shell-smoke.md.
Finding 1 (vanilla dash forwards no cap) resolved 2026-05-27 by
posix-recording-shim-full-fd-inherit (done): the recording shim now inherits
the parent’s full live fd table by default (POSIX fork+execve), so dash’s
open stdio flows to ls-shim with no dup2, and a held read-only Directory
fd inherits as the child’s cwd source. Its kernel precursor
posix-terminal-session-forwardable (done) lets the terminal forward
non-destructively (Raw), so dash keeps its own terminal. Close-on-exec is
enforced and an aliased non-destructive backing Copy-shares; proof
make run-posix-fd-inherit-default. The remaining Slice 13 items are the
secondary gaps: posix-open-directory-fd (open(dir, O_RDONLY) ->
FdBacking::Directory, done 2026-05-27, proof make run-posix-open-dir-fd;
needed only if a N</ redirection is used – a dirfd(opendir()) forward also
works), the slash-bearing /ls-shim PATH-stat workaround, and the dash
EV_EXIT in-place exec-replace residual (see Slice 11).
demos/ls-shim/main.c: open a hardcoded in-rodata directory path, iterate withopendir/readdir/closedir, print each entry name, exit cleanly. This is the only allowed spawn target in the smoke.system-posix-shell.cue: a focused-proof manifest (own CUE package, importscapos.local/cue/defaults) grantingTerminalSession, a read-onlyDirectoryover an in-rodata pseudo- fs containing exactly the entries the heredoc references, aProcessSpawnernarrowed tols-shim, and aTimer.Makefilevendor-dash,libcapos-posix-shell,manifest-posix-shell.bin,capos-posix-shell.iso, andrun-posix-shell-smoketargets.tools/qemu-posix-shell-smoke.shhost harness: pipels; echo doneheredoc into fd 0, assertdone, two clean-exit log entries (shell +ls-shim), the scheduler halt line, and QEMU exit status 1 (isa-debug-exit).
Slice 14 (stretch): cat | grep pipeline — DONE (2026-05-27)
Drives cat foo | grep bar end-to-end through dash’s pipeline parser:
demos/cat-shim/main.c: writes an in-rodata three-line corpus to stdout (only the middle line containsbar).demos/grep-shim/main.c: reads stdin line by line, writes lines containingargv[1]to stdout. The initial Slice 14 proof bakedbaras a compile-time fallback because child argv did not yet cross the recording-shimexecveboundary; Slice 20 below now seeds grep-shim through the privateposix_argvpipe, and the fallback no longer matches the corpus. The shell smoke therefore fails if/grep-shim bardoes not deliverbaras argv.system-posix-shell.cue: both shims asProcessSpawnertargets.tools/qemu-posix-shell-smoke.sh: assertsmatch bar herereaches the terminal, the two non-matching corpus lines do not, and ≥4 clean child exits (dash + ls-shim + cat-shim + grep-shim).
This proves the P1.3 Pipe primitive end-to-end through dash’s own pipeline
parser, not just the recording-shim posix_spawn_file_actions path.
Reconciliation (posix-dash-pipeline-exec-reconcile, DONE 2026-05-27). The
premise conflict the blocked attempt found – a real cat foo | grep bar
page-faulted dash after the first element because evalpipe sets EV_EXIT
on every element and the recording-shim patch set had only reconciled
vforkexec – is resolved by dash patch
0004-pipeline-evexit-recording-shim.patch plus a libcapos-posix wildcard
reap:
0004(eval.c/jobs.c/jobs.h).evalcommand()’sEV_EXITin-placeshellexec()stashes the synthetic pid (capos_exec_pid) andbreaks instead of faulting throughcase CMDBUILTIN.evaltree()suppresses itsEV_EXITexraise(EXEND)while that pid is pending.evalpipe()’s child arm callsevaltree()(not the__noreturn__evaltreenr()) so it returns under the no-separate-address-space recording fork, then adopts the pid into the pipeline job via the now-exportedforkparent(), re-suppressing interrupts to balance the child-armINTON.forkshell()setsvforkedaroundfork()/forkchild()so the recording-shim “child” does notfreejob()the pipeline job. The cumulative dash patch budget was raised past 50 lines for this; seevendor/dash/VENDORED_FROM.mdfor the decision and the remaining residuals (the single trailing external command residual is closed by Slice 16 below, and theexec foobuiltin residual by Slice 17; compound pipeline elements remain unsupported under the recording shim).libcapos-posixwildcardwaitpid(-1)/wait3. dash reaps pipeline children withwait3->waitpid(-1); the v0 surface now reaps any tracked child (blocking) and honorsWNOHANGas “no child ready”, which is allwaitforjob->dowaitneeds.
Proof: make run-posix-shell-smoke (extended with the pipeline line).
Regression-clean: make run-posix-pipe-smoke, make run-posix-execve-inherit-smoke.
Slice 15: PID-1 argv channel (posixArgs) — DONE (2026-05-30)
Closes the “capOS delivers no argv” gap for the manifest-launched binary, without a schema or kernel change:
libcapos-posix/src/args.rs:posix_args(int *argc)returns a process-lifetime, NUL-terminatedchar **built frominitConfig.init.posixArgs(aCueValuetext list), read off the grantedbootBootPackage cap. It mirrorscapos-wasm/src/payload.rs::read_wasi_argsand reuses theposixEnvblob-streaming/CueValuehelpers (Slice 6, promoted topub(crate)). Bounded by 32 entries / 4096 bytes-per-entry / 8192 bytes-total, fail-open to empty argv on any malformed or absent grant. Delivery is opt-in: the Cmain(argc, argv)trampoline inlibcapos/src/entry.rsis untouched (still0, NULL), so bare-name demos are unaffected.- dash patch
0003: whenargv == 0, pullposix_args()and use it forprocargs()when non-empty, keeping the{"sh", 0}fallback. A manifest seedingposixArgs: ["dash", "-c", "CMD"]now reachesevalstring(minusc, ...), sodash -cis invokable. - Proof:
make run-posix-args-smoke(manifest seeds["posix-args-smoke", "alpha", "beta"]; the C process printsargc=3and eachargv[i]). Regression:make run-posix-shell-smoke(theargv==0fallback path is unchanged).
Follow-up closed by Slice 20 below: cross-execve argv inheritance to
recording-shim children, needed before a spawned grep-shim could receive
argv[1].
Slice 16: trailing top-level external command waits and exits (sh -c) — DONE (2026-05-30)
Closes the largest posix-dash-pipeline-exec-reconcile runtime residual: a
single trailing top-level external command (the sh -c 'cmd' shape), unblocked
by the Slice 15 argv channel.
- Premise correction. The blocked task assumed
0004’scapos_exec_pidguard left the trailing child orphaned-but-spawned. In fact dash’sEV_EXIToptimization execs in place without forking, and the recording shim only spawns from anexecve()inside afork()-opened record window — so an unforked top-level in-placeshellexec()returnsENOSYS(process.rs::execvewithrecording.active == false) and dash exits 126, never spawning the child. The fix is therefore to fork, not to wait on a pid that was never produced. - dash patch
0005-evexit-trailing-extcmd-wait.patch(eval.c): acapos_pipe_armcounter, raised byevalpipe()around its child-armevaltree()call, gates the in-placeEV_EXITshellexec()to pipeline arms only (capos_pipe_arm > 0, whereforkshell()already opened the window). A top-levelEV_EXITcommand (capos_pipe_arm == 0) takes the forkingvforkexec()path, whose existingwaitforjob()blocks for the child and returns its status;evaltree()thenexraise(EXEND)s and exits with it. Substantive change: four lines; the rest are explanatory comments. system-posix-shell.cue+tools/qemu-posix-shell-smoke.sh: the proof moves off the fd-0 heredoc onto a manifestposixArgsdash -cscript (granted thebootBootPackage cap). The script keeps the directory-setup, pipeline-parser (/cat-shim foo | /grep-shim bar), and successful-listing (/ls-shim 3< /) proofs, then ends with a trailing/ls-shimthat lacks fd 3 and exits 31. dash (the last process to exit, after waiting for that child) exits 31 — a value it has no other code path to produce, so it is a non-tautological wait-and-exit discriminator (an orphan-and-continue would have exited 0).- Proof:
make run-posix-shell-smoke. Regression-clean:make run-posix-args-smoke,make run-posix-pipe-smoke,make run-posix-execve-inherit-smoke.
Remaining residual at the time: the execcmd() / exec foo command
(process-image replace) path, closed by Slice 17 below. The later
cross-execve argv inheritance gap is closed by Slice 20 below.
Slice 17: exec foo command builtin forks, waits, and exits (execcmd) — DONE (2026-05-31)
Closes the last posix-dash-pipeline-exec-reconcile in-place-shellexec()
residual: the exec builtin’s command form (exec foo command), which must
replace the shell with foo and exit with foo’s status.
- Why
0005did not cover it.execcmd()(eval.c) runs as aCMDBUILTIN(EXECCMD) throughevalbltin(), which is dispatched fromevalcommand()’scase CMDBUILTIN— bypassing thedefault:-caseEV_EXITfork gate0005added. So0005’scapos_pipe_armgate never sees theexecbuiltin; its in-placeshellexec()->execve()has nofork()-opened record window and returnsENOSYS, andexeccmd()ignores it andreturn 0s, continuing the script. - dash patch
0006-execcmd-fork-replace-wait.patch(eval.c): theargc > 1form ofexeccmd()now forks viavforkexec(NULL, argv + 1, pathval(), 0)(which opens the record window),waitforjob()s for the replacement child, setssavestatusto its status, andexraise(EXEXIT)s — the exact shell-exit channel theexitbuiltin (exitcmd) uses, soEXITRESETcopiessavestatusintoexitstatusand the shell exits with the replacement command’s status.n == NULLis safe (forkchild()early-returns undervforked;forkparent()readsnonly whenjobctlis set, never in a non-interactivedash -c). The no-command form (exec 3< /,argc == 1) keeps itsreturn 0andpopredir()redirection-permanence path unchanged. Substantive change: five lines; the rest are explanatory comments. system-posix-shell.cue+tools/qemu-posix-shell-smoke.sh: the proof replaces the trailing bare/ls-shimwithexec /ls-shim(no fd 3, exits 31) followed by a poison tail> /gamma; /ls-shim 3< /. Correct exec-replace exits dash with 31 and the poison tail never runs (its[ls-shim] listed 3 entries/entry: gammamarkers are absent); a buggy ignore-and-continue would instead create/gamma, list 3 entries, and exit 0. The directory-setup, pipeline-parser, and successful-listing proofs are kept intact.- Proof:
make run-posix-shell-smoke. Regression-clean:make run-posix-args-smoke,make run-posix-pipe-smoke,make run-posix-execve-inherit-smoke.
Remaining residual closed by Slice 20 below: cross-execve argv inheritance.
Slice 18: read VAR builtin reads a line off fd 0 TerminalSession — DONE (2026-05-31)
Closes the one interactive-stdin path every prior P1.4 smoke skipped: dash’s
read builtin (miscbltin.c readcmd()) consuming a line off its fd 0
TerminalSession cooked-mode line discipline and binding it to a shell
variable. run-posix-shell-smoke drives a dash -c script and feeds no stdin
(its harness only sleeps); the fd-0 -> TerminalSession.readLine read path was
fully wired but never exercised through dash’s own read.
- No dash patch and no libcapos-posix change needed.
readcmd()reads via the bufferedpgetc()->preadbuffer()->preadfd()->read(0, buf, BUFSIZ)path.input_init()keys the buffering mode offtcgetattr(0), which libcapos-posix synthesizes as canonical (c_lflag & ICANON), sostdin_bufferable()is true and dash takes the plainread()branch (not the Linuxtee()/splicehistory path, which would otherwise return a non-EINVALerror and be misread as EOF). libcapos-posixread()overFdBacking::Terminalreturns exactly one line plus a synthesized\nper call — the canonical-tty contract dash expects — so the two sequentialreads each consume one line. system-posix-read-builtin.cue: a focused manifest grantingterminal(TerminalSession),timer, andboot(forposixArgs), with thedash -cscriptprintf 'rb-ready\n'; read NAME; printf 'got=[%s]\n' "$NAME"; read -r RAW; printf 'raw=[%s]\n' "$RAW".printf %s(notecho) echoes the bound values so the asserted bytes bypass echo’s backslash-escape interpretation.tools/qemu-posix-read-builtin-smoke.sh+run-posix-read-builtin: thedrivestep handshakes rather than blind-sleeps. The kernel line discipline has no inter-read input buffer (it consumes UART bytes only while areadLineis pending) and the UART carries no EOF, so a line fed before userspace is draining is lost and the blockedreadhangs to the QEMU timeout.drivetails the terminal-UART log: feed line 1 after therb-readybanner, feed line 2 after thegot=[echo (so the two distinct lines never collide in the single shared UART FIFO), then hold stdin open until theraw=[echo. A byte arriving just before itsreadLineis posted is still caught byhandle_read_line()’s synchronous FIFO drain, so the banner/echo gates are a sufficient ordering guarantee.- Proof:
make run-posix-read-builtin— observedgot=[hello world]and the byte-preservedraw=[raw\back\slash]on the terminal UART, dash exit code 0, scheduler halt. Non-tautological: the echoed values are the harness-fed fd-0 bytes, which the script has no other source for. Regression-clean:make run-posix-shell-smoke(theexec/pipeline/$?paths are untouched; it still feeds no stdin).
Remaining residual closed by Slice 20 below: cross-execve argv inheritance.
Slice 19: test/[ file-test builtin stats the root Directory — DONE (2026-05-31)
Closes the last unexercised reachable-cap dash builtin path: test -e/-f/-d/-r FILE (and [ ... ]), the single most common shell file-predicate, reaching
libcapos-posix stat/lstat over the bootstrap root Directory and
discriminating file vs directory vs absent. Every prior P1.4 smoke exercises
stdio, pipelines, exec-replace, argv, or interactive read, but none drives the
test/[ builtin against the filesystem.
- No dash patch and no libcapos-posix change needed. dash’s
src/bltin/test.cfilstat()callsstat64(nm, &s)/lstat64(nm, &s)(capos/config.hmapsstat64->stat,lstat64->lstat) and switches onS_ISREG/S_ISDIR/FILEXIST;testcmd()registers bothtestand[(src/builtins.def.in).HAVE_FACCESSATis unset incapos/config.h, so-r/-w/-xroute throughfilstat()->test_access()(a dash-internal check on the stat result, no extra libc call); under capOS’s single-identity euid=0test_access(R_OK)short-circuits to true on any successful stat (before thest_moderead), sor=yesproves-rreached a real stat of/alpha, distinct from the absent-path miss, not the mode-bit comparison itself. libcapos-posixstat()(src/file.rs) resolves the path against the bootstrap rootDirectory, fillingS_IFREG|0644for files (write_file_stat) orS_IFDIR|0755for directories (write_dir_stat, incl. the root viais_root_path);lstat()andaccess()are landed too, and theS_IS*macros ship fromlibc-dash-sysroot-surface. system-posix-test-builtin.cue: a focused manifest grantingterminal(TerminalSession),timer,boot(posixArgs), androot(source: {kernel: "directory"}). Noprocess_spawner—test/[are in-process builtins, no fork/exec; the> /alpharedirect is dash’s own open(O_CREAT) over the root Directory (already proven bysystem-posix-shell.cue). Thedash -cscript creates/alpha, then runs the six predicates withprintfmarkers, using the&& ... || ...form for-d /alphaand-e /nopeso the negative-branch markers prove real discrimination.tools/qemu-posix-test-builtin-smoke.sh+run-posix-test-builtin: a blind boot+sleep harness (dash feeds no fd 0, so no stdin handshake — same shape asrun-posix-shell-smoke). Theassertchecks all six markers, the absence of the true-branchd=alpha-dir(a blanket-truetestwould emit it), the clean exit, and the scheduler halt.- Proof:
make run-posix-test-builtin—e=yes,f=yes,d=alpha-notdir,root=dir,absent=yes,r=yeson the terminal UART, dash exit 0, scheduler halt. Non-tautological: each marker gates on a distinct stat result the script has no other source for; thed=alpha-notdir/absent=yeselse-branches and the absentd=alpha-dirprove file/dir/absent discrimination, not a blanket-true builtin. Regression-clean:make run-posix-shell-smoke.
Remaining residual closed by Slice 20 below: cross-execve argv inheritance.
Slice 20: recording-shim execve argv inheritance — DONE (2026-06-07)
Closes the remaining recording-shim child-argv gap without changing the
generated ProcessSpawner.spawn(name, binaryName, grants) surface:
libcapos-posix/src/process.rs:execve(path, argv, envp)snapshots the C argv vector before consuming the fork-recording window, rejects over-budget or malformed vectors before fd-action replay, writes a bounded binary argv record into a private kernelPipe, and grants only the read end to the child asposix_argv. The existing full-fd-table inheritance path is unchanged:stdio_<N>grants still carry inherited fd backings, and directposix_spawn()continues to accept but ignore argv/envp until a broader LaunchParameters design lands.libcapos-posix/src/args.rs:posix_args()first looks for theposix_argvpipe grant and decodes it into the same process-lifetimechar **store used by manifestposixArgs; when the grant is absent it falls back to the manifestbootBootPackage path. The recording-shim payload is capped by the existing 4 KiB Pipe transport, so it is narrower than the manifest 8 KiB-totalposixArgschannel but still uses the same 32-entry / bounded C-string shape.demos/posix-execve-inherit-*+tools/qemu-posix-execve-inherit-smoke.sh: the focused smoke now proves both sides: an over-budget argv vector is rejected withE2BIGbefore the recordeddup2mutates the parent fd table, and the successful child prints inheritedargv[0..2]before listing the inherited Directory entries.demos/grep-shim/main.c+run-posix-shell-smoke: grep-shim now callsposix_args()and usesargv[1]as the filter pattern. Its fallback does not match the corpus, so the existingcat foo | grep barshell proof now depends on/grep-shim barcrossing the recording-shimexecveboundary.
Proofs: cargo build --features qemu, make run-posix-execve-inherit-smoke,
make run-posix-shell-smoke.
Conflict Surface Coordination
P1.4 does not touch kernel/src/cap/, kernel/src/sched.rs, or any
device-driver foundation file. The schema half is limited to the
optional posixEnv bounded text grant on initConfig.init (Slice 6);
queue on the shared schema serial surface per docs/backlog/index.md
Concurrency Notes when that slice dispatches. Every other slice is
parallel-safe with the current selected milestone and DDF follow-up kernel
surfaces because it avoids the kernel-core device-driver files.
Out of Scope for P1.4
- Job control, real signal delivery, controlling terminals.
ulimit.- A userspace
Store/Namespaceservice over a real backing store – that remains the next Phase 3 item in the storage proposal and is not required for the v0 dash smoke. - Real filesystem persistence (block device, virtio-blk, FAT).
- A POSIX terminal line discipline owned by
libcapos-posix– cooked-mode line discipline still lives kernel-side until networking proposal Phase C. - Hosted C++. Tracked separately in
docs/proposals/userspace-binaries-proposal.md.
Success Criteria
make run-posix-shell-smokeexits cleanly under QEMU. A real dash runs a manifestdash -cscript that drives the directory listing, thecat | greppipeline, and anexec /ls-shimwhose status (31) dash replaces-and-waits with (the poison tail after it never runs); the harness asserts the listing, the pipeline filter, dash’s exec-replace-and-wait status, the absent poison-tail markers, the clean-exit children, and the scheduler halt line.- The vendored dash source under
vendor/dash/is mirror-as-is at a pinned tag with aVENDORED_FROM.mdand apatches/directory whose cumulative diff vs upstream is < 50 lines. libcapos-posixexposes the file / dir / stdio / env / printf / string / signal / time / identity surface listed above; the surface ships from headers underlibcapos-posix/include/capos/posix/with no dependency on an external libc.make workflow-check,make fmt-check,make generated-code-check,cargo test-config,cargo test-lib,cargo build-demos-capos,make capos-rt-check,make run-smoke,make run-c-hello,make run-posix-dns-smoke, andmake run-posix-pipe-smokeall remain green.- The proposal stamps the phase closeout with merge SHA and a minute-precision timestamp.
Go VirtualMemory Contract
Design slice for the review finding “Go-style VirtualMemory
reserve/commit/decommit semantics are missing.” This file does not change the
selected milestone; it records the contract that the Go/runtime allocator
implementation must satisfy.
Implementation status as of 2026-04-26 18:51 EEST: the kernel, schema,
generated bindings, capos-config, capos-rt, host tests, and QEMU proof
coverage implement this contract. The closure summary and verification gates
are recorded in the done task records and commit history.
Design Context
- Current manual pages:
Memory Management for the implemented
VirtualMemory/MemoryObjectbaseline and Userspace Runtime for the runtime client surface that allocator code uses. - Owning proposal: Go Runtime, especially the memory management syscall-equivalent section.
- Related policy proposals: OOM Handling and Swap for memory pressure outcomes and Resource Accounting and Quotas for separate virtual-reservation and physical-commit ledgers.
- Validation reference: Verification Workflow for the evidence expected before closing the review finding.
- Research grounding: LLVM Target for Go runtime OS hooks and Zircon for VMO/VMAR precedent.
Grounding
Project docs and code read for this slice:
docs/tasks/README.mddocs/roadmap.mddocs/proposals/go-runtime-proposal.mddocs/architecture/memory.mddocs/architecture/userspace-runtime.mddocs/proposals/oom-and-swap-proposal.mdschema/capos.capnpcapos-config/src/lib.rscapos-rt/src/client.rskernel/src/cap/virtual_memory.rskernel/src/mem/paging.rs
Relevant research files:
docs/research/llvm-target.mddocs/research/zircon.md
docs/research/llvm-target.md records that the Go runtime path depends on
mapping sysAlloc, sysReserve, sysMap, and sysUnused/madvise-like
behavior onto VirtualMemory. docs/research/zircon.md is relevant prior art
because Zircon separates virtual address regions from memory objects and names
commit/decommit as range operations on VMOs; capOS should keep the same
separation of virtual reservation authority from physical backing.
Current Gap
The current schema exposes only:
interface VirtualMemory {
map @0 (hint :UInt64, size :UInt64, prot :UInt32) -> (addr :UInt64);
unmap @1 (addr :UInt64, size :UInt64) -> ();
protect @2 (addr :UInt64, size :UInt64, prot :UInt32) -> ();
}
The current implementation allocates zeroed physical frames during map,
records ownership per committed anonymous page, charges the caller’s
frame_grant_pages ledger immediately, rejects non-readable protection, and
frees frames during unmap. This is a useful baseline, but it is not the
contract Go expects:
sysReserveneeds address-space reservation without physical frames.sysMap/sysUsedneed explicit physical commit inside a prior reservation.sysUnusedneeds decommit that releases frames while preserving the virtual reservation.sysFreeneeds unmap-style reservation release so returned arenas do not leak virtual quota or address-space ranges.- Stack and arena guard pages need
PROT_NONEsemantics that reliably fault or fail validation without implying the reservation is gone. - Virtual reservation pressure and physical commit pressure need separate quotas and separate auditability.
Contract
The future schema should preserve existing method ids and add explicit reservation operations:
interface VirtualMemory {
map @0 (hint :UInt64, size :UInt64, prot :UInt32) -> (addr :UInt64);
unmap @1 (addr :UInt64, size :UInt64) -> ();
protect @2 (addr :UInt64, size :UInt64, prot :UInt32) -> ();
reserve @3 (hint :UInt64, size :UInt64) -> (addr :UInt64);
commit @4 (addr :UInt64, size :UInt64, prot :UInt32) -> ();
decommit @5 (addr :UInt64, size :UInt64) -> ();
}
Protection constants become:
#![allow(unused)]
fn main() {
pub const VM_PROT_NONE: u32 = 0x0;
pub const VM_PROT_READ: u32 = 0x1;
pub const VM_PROT_WRITE: u32 = 0x2;
pub const VM_PROT_EXEC: u32 = 0x4;
}
VM_PROT_NONE is the only valid zero-bit protection value. Unknown bits are
rejected. Any non-NONE protection must include VM_PROT_READ; write-only or
execute-only user mappings are rejected rather than silently upgraded, because
x86_64 cannot represent a present user page that lacks read access. Writable
and executable mappings remain rejected. VM_PROT_NONE must be represented by
ledger state plus a non-present user PTE, not by relying on hardware “no read”
permission.
map remains as a compatibility/convenience operation equivalent to
reserve(hint, size) followed by commit(addr, size, prot), with atomic
rollback if the commit or result serialization fails. Existing runtime clients
can keep using it until the Go allocator switches to the explicit reserve path.
Semantics
All sizes are rounded up to 4 KiB pages after rejecting zero-size ranges and
overflow. All non-zero addresses must be page-aligned, and the entire rounded
range [addr, addr + size) must fit at or below USER_ADDR_LIMIT without
overflow. Ranges overlapping the capability ring or CapSet page remain invalid
for reserve, map, commit, decommit, unmap, and protect.
reserve(hint, size):
- Reserves a contiguous virtual range in the caller’s address space.
- Allocates no physical frames and installs no user-accessible PTEs.
- Charges only the virtual reservation ledger.
- With
hint == 0, chooses a free range in the user address space. - With
hint != 0, acts as fixed no-replace placement: overlap with any live reservation, committed page, object mapping, ring page, or CapSet page fails. - Returns the base address of the reservation.
commit(addr, size, prot):
- Requires the whole range to lie inside existing anonymous reservations owned
by the same address-space-bound
VirtualMemorycap. - Requires every page in the range to be currently uncommitted.
- Allocates zeroed physical frames, charges the physical commit ledger, and records the committed state per page.
- Installs present user PTEs for readable/writable/executable protections.
- For
VM_PROT_NONE, allocates and charges frames but leaves user PTEs non-present while retaining the frames in the reservation ledger. - This is for committed inaccessible memory whose contents must survive a later protection restore. Pure stack or arena guard pages should stay reserved but uncommitted so they consume virtual quota without consuming physical commit budget.
- Is all-or-nothing: allocation, page-table updates, ledger charge, and TLB completion reservation must either all become visible or all roll back.
decommit(addr, size):
- Requires the whole range to lie inside existing anonymous reservations owned by the same cap.
- Allows committed and already-uncommitted pages in the range.
- Removes any present PTEs, releases frames for committed pages, drops physical commit charges, and preserves the virtual reservation.
- Leaves every page in the range in the uncommitted reserved state.
- Must perform the same local flush and remote shootdown discipline as unmap and protect before a frame can return to the allocator.
protect(addr, size, prot):
- Requires the whole range to be committed anonymous pages owned by the same cap. It does not commit uncommitted reserved pages.
- May set
VM_PROT_NONE; the kernel keeps the committed frames charged and associated with the pages, removes present user PTEs, and denies user access until a laterprotectrestores readable permissions ordecommitreleases the frames. - Preserves existing zeroed/data contents when moving between
VM_PROT_NONEand accessible protections. - Keeps W^X enforcement and rejects unknown bits.
unmap(addr, size):
- Releases the reservation for the whole range.
- Frees committed frames and physical commit charges for committed pages.
- Releases virtual reservation charges for every page.
- Fails if the range is not wholly covered by anonymous reservations owned by the same cap.
Page faults and validation:
- Access to an unreserved page is an ordinary unmapped access.
- Access to a reserved uncommitted page is a reservation fault. The initial Go contract should fail closed; demand commit on fault is a later policy choice, not implicit behavior in this slice.
- Access to a committed
VM_PROT_NONEpage is a protection fault and must not release the reservation or physical frame. - A pure guard page is a reserved uncommitted page, not a committed
VM_PROT_NONEpage, unless the runtime deliberately needs hidden retained contents. - Kernel user-buffer validation and copy helpers must treat reserved
uncommitted pages and committed
VM_PROT_NONEpages as inaccessible.
Ledgers
The implementation needs two ledgers of record:
- Virtual reservation pages: charged by
reserve, released byunmap, and unchanged bycommit,decommit, orprotect. Compatibilitymapcharges this ledger because it creates an implicit reservation. - Physical commit pages: charged by
commitormap, released bydecommitorunmap, and unchanged byprotect.
The current ResourceLedger::frame_grant_pages can continue to represent
physical commit pressure if the implementation gives anonymous committed pages,
held MemoryObject caps, and borrowed object mappings one shared physical-page
budget. Virtual reservations need a separate process-owned quota; do not hide
virtual reservation pages in the physical frame ledger.
Address-space ownership tracking must become reservation-based instead of a
flat list of committed anonymous pages. The reservation ledger must be sparse:
Go-scale reservations can be terabytes, so reserve must not allocate one
metadata entry per reserved page. A minimal host-testable model should track
non-overlapping reservation intervals and sparse committed state inside those
intervals, such as committed subranges or a committed-page map keyed only by
pages that currently hold frames.
ReservedCommitted { frame, prot }
MemoryObject borrowed mappings stay outside anonymous reservations for this
slice. Any future design that allows object mappings inside sub-reservations
must explicitly define ownership and teardown interaction.
Implementation Gates
- Add
VM_PROT_NONE,reserve,commit, anddecommitto schema, generated bindings,capos-config, andcapos-rtclients while preserving current method ids. - Replace committed-page-only anonymous ownership tracking with a sparse
reservation ledger that can represent large uncommitted intervals plus
committed accessible and committed
VM_PROT_NONEpages without allocating per-page metadata for every reserved page. - Add a virtual reservation quota separate from the physical frame-grant ledger and make quota errors distinguish virtual exhaustion from physical commit exhaustion.
- Rework
VirtualMemoryCapmap/unmap/protect around the reservation ledger, including rollback paths, TLB shootdown completion, and process-exit cleanup. - Keep ring and CapSet virtual pages reserved outside caller control.
- Update
capos-rtso allocator paths can use caller-owned scratch buffers for reserve, commit, decommit, protect, and unmap without allocating during heap growth. - Add host tests for overlap rejection, fixed no-replace hints, partial
decommit, recommit zero-fill,
VM_PROT_NONEprotect/restore, quota accounting, rollback, and process teardown. - Add QEMU proof coverage before closing the review finding:
reserve-without-frame-commit, commit and write, protect to
VM_PROT_NONE, restore and preserve contents, decommit and recommit zero-fill, unmap reservation release, virtual-quota exhaustion, and physical-commit quota release after decommit.
Non-Goals
- Demand paging on first access.
- Swap or overcommit policy.
- File-backed mappings.
- Copy-on-write snapshots.
- Hierarchical VMAR/sub-address-space capabilities.
- Sharing anonymous reservations across processes.
Those can build on the reservation ledger later, but the Go allocator contract must not depend on them.
Memory Authority Model Backlog
This backlog turns
Memory Authority Model into
reviewable work. It does not replace the selected milestone in
docs/tasks/state.toml. Use it when a task touches memory authority,
VirtualMemory, MemoryObject, SharedBuffer, pins, DMA, swap, OOM, or
page-table mutation semantics.
Grounding
Project files read while creating this backlog:
docs/architecture/memory.mddocs/backlog/go-virtual-memory-contract.mddocs/proposals/oom-and-swap-proposal.mddocs/proposals/resource-accounting-proposal.mddocs/dma-isolation-design.mddocs/architecture/park.mddocs/architecture/scheduling.mddocs/architecture/userspace-runtime.mddocs/proposals/storage-and-naming-proposal.mddocs/proposals/go-runtime-proposal.mddocs/security/verification-workflow.mddocs/research/capability-systems-survey.mdREVIEW.mddocs/tasks/README.md
Relevant research grounding:
docs/research/zircon.mddocs/research/genode.mddocs/research/sel4.mddocs/research/eros-capros-coyotos.mddocs/research/llvm-target.md
Validation Expectations
- For docs-only slices, run a documentation build or the narrowest available link/check command; QEMU is not required unless behavior changes.
- For implementation slices, add host tests, Kani, QEMU, or targeted instrumentation according to the proof table in the proposal.
- Behavior changes should record concrete design grounding and verification evidence in the changed proposal, backlog, review note, or workplan entry.
Slice A: Memory-State Inventory
Goal: make current memory transitions auditable before changing behavior.
- Inventory anonymous VM operations in
kernel/src/cap/virtual_memory.rsandkernel/src/mem/paging.rs: reserve, commit, protect, decommit, unmap, address-space drop, and rollback. - Inventory
MemoryObjectoperations inkernel/src/cap/frame_alloc.rs: allocation, result-cap publication, map, unmap, protect, cap release, borrowed mapping teardown, and result serialization rollback. - Inventory page-table mutation and TLB shootdown paths in
kernel/src/mem/paging.rs,kernel/src/arch/, and scheduler residency tracking. - Inventory user-buffer validation/copy/read paths and classify which ones already hold the address-space stability guarantee.
- Inventory ParkSpace cleanup interactions with
VirtualMemory.unmap,VirtualMemory.decommit,MemoryObject.unmap, process exit, and future shared waiters. - Record a compact state-transition table in
docs/architecture/memory.mdor a follow-up design note.
Exit criteria:
- The inventory names every current state transition, authority object, ledger, lock, and cleanup path relevant to user memory.
- Any missing proof becomes a concrete backlog item rather than a vague TODO.
Slice B: Host-Testable VM Ownership Model
Goal: move the parts of memory ownership that are pure logic into stronger host-test coverage where practical.
- Decide whether sparse anonymous reservation interval logic should live in
capos-libor stay kernel-local with mirrored tests. - Add tests for fixed no-replace hints, overlap rejection, middle reservation split, tail split, adjacent split behavior, and full-range release.
- Add tests for committed-page bookkeeping under partial decommit,
VM_PROT_NONEprotect/restore, and recommit zero-fill assumptions. - Add tests for borrowed mapping provenance: anonymous reservations and object-backed mappings must not overlap, and object-specific unmap must reject a different backing object.
- Add ledger tests that virtual reservation, physical commit, held object backing, and borrowed mapping charges release exactly once on success, error, rollback, and process exit.
Exit criteria:
- Pure memory ownership rules are tested without QEMU when they do not need hardware page tables.
- Any remaining kernel-only rule is documented with the reason it cannot be moved into host-testable logic.
Slice C: Shared Mapping Identity and Pins
Goal: unblock future shared park words and real SharedBuffer APIs without
using raw virtual addresses as authority.
- Define the mapping identity record for
MemoryObject-backed user pages: object id, object generation or backing epoch, page offset, mapping generation, address-space id, and address-space generation. - Decide whether shared waiters need explicit object pins, mapping pins, or a validation/use critical section around key derivation and wait registration.
- Define how object pins are charged, released, and revoked, and which ledger owns the pin count or pinned page count.
- Extend ParkSpace design only after shared key derivation can prove object identity and stale mappings cannot wake new owners.
- Define service-owned
SharedBuffermetadata for producer/consumer rings, notification, bounds, and role-specific permissions before file/network APIs consume it.
Exit criteria:
- Shared wait/wake and service-owned shared buffers have an object-identity rule that survives unmap, remap, transfer, release, and reuse.
- Reviewers can reject any future shared-memory API that relies only on a raw user virtual address.
Slice D: TLB and Frame-Reuse Proof
Goal: make stale CPU observers part of the proof, not a local implementation assumption.
- Identify all paths that remove or weaken PTEs and later free or reuse frames.
- Add targeted counters or QEMU diagnostics showing local flush and remote generation completion before frame return on an address space resident on multiple CPUs.
- Exercise
VirtualMemory.decommit,VirtualMemory.unmap,VirtualMemory.protect,MemoryObject.unmap, process exit, and failed rollback under SMP where possible. - Record which paths only need local flush because the address space cannot be resident remotely.
- Cover huge-page (1 GiB / 2 MiB) frame teardown when huge mappings are
eventually introduced. Today the
Drop for AddressSpacewalk inkernel/src/mem/paging.rs(huge-page branches at lines 450 and 462) skipsHUGE_PAGEPTEs with aTODOpass-through, so once huge pages are mapped into a user address space the backing 1 GiB / 2 MiB frames would leak on process exit. The work is blocked until huge-page support is added but must be filed against any branch that introduces huge user mappings.
Exit criteria:
- A branch that changes page-table mutation can cite a proof that frames are not reused while stale TLB entries can still access them.
Slice E: OOM Boundary Normalization
Goal: make memory failures distinguish validation, quota, global pressure, and fatal execution failure.
- Audit
VirtualMemory,MemoryObject,FrameAllocator, andProcessSpawnerallocation failures for inconsistentfailedvsoverloadedbehavior. - Define typed result or exception mapping for virtual quota exhaustion, physical commit exhaustion, global frame pressure, and result-cap publication failure.
- Add hostile exhaustion tests for each allocation boundary that can be reached by an untrusted process.
- Add process-exit status design for future OOM page-fault termination.
Exit criteria:
- Capability calls return predictable typed memory failures, and execution faults have an explicit lifecycle path rather than generic panic text.
Slice F: DMA and Swap Preconditions
Goal: keep later device and swap work blocked on the memory model pieces they actually require.
- Before userspace DMA drivers, implement or prove device-owner states, generation-checked handles, stale interrupt/completion handling, resident unswappable DMA pages, and scrub-before-reuse.
- Before swap, define page eligibility bits, slot metadata, encrypted and authenticated page storage, per-boot keying, and faulting-process termination on restore failure.
- Keep
MemoryObject, shared IPC pages, ring/CapSet pages, secret pages, and DMA pages out of phase-1 swap unless a later proposal explicitly changes the model and adds proofs.
Exit criteria:
- DMA and swap implementation branches have explicit prerequisite checklists and cannot merge by relying on generic frame ownership alone.
Session-Bound Invocation Context
Selected milestone backlog for replacing caller-selected endpoint identity without continuing the Service Object Identity Migration.
The detailed design lives in Session-Bound Invocation Context.
Design Target
The final model has one live session context per process:
Process.session_contextis immutable after spawn.- Endpoint calls deliver privacy-preserving caller session metadata to the server. Subject details are not disclosed unless the caller explicitly asks for disclosure through the service call and a broker/service disclosure scope allows the requested fields.
- Broker-granted capabilities decide which service roots/facets a process may invoke.
- Services key user-facing state by caller session plus service-local records.
- Request payload fields are data and cannot select authority.
- Cross-session raw transfer is governed by cap transfer scope:
same_session,cross_session_shareable, orservice_regrant_only. If a cap crosses sessions, the receiver session supplies the future invocation subject context.
The existing service-object routing proof remains historical coverage for receiver-cookie spoofing, lifecycle, and transfer behavior. It is not the application authority model.
Gate 1: Process Session Invariant
Visible proof: a focused QEMU session/process smoke shows a spawned shell and child process have exactly one immutable session context, inherited by default, while an attempt to inject or use a second independent invocation subject fails.
Implementation scope:
- Add process-owned session context metadata with explicit system/service session support.
- Make
ProcessSpawnerselect the child session context through inherit or a trusted broker/session-manager path. - Prevent ordinary processes from holding or using multiple independent
UserSessionvalues as ambient invocation subjects. - Keep
SessionContextinternal to process/session mechanics. Do not expose principal, profile, account, role, tenant, auth-factor, external-claim, or display fields through endpoint defaults or proof-only shortcuts. - Add host tests for spawn/session validation and QEMU proof output for child inheritance.
- Hostile proof cases must show that copied
UserSessioncaps, payload data, shell strings, and manifest grant data cannot install a second process session or select another child session outside the trusted broker/session path. - Define the fail-closed freshness rule used by later endpoint work: normal endpoint calls from dead, revoked, or stale workload sessions fail except explicit recovery, logout, or renewal caps.
- Preserve existing anonymous/operator shell behavior while making guest shell behavior explicitly manifest-gated and narrow.
Verification gate:
make fmt-checkcargo test-config- relevant host tests for session metadata
- focused QEMU process/session proof
- one existing login or shell proof touched by the session path
Status 2026-04-28 17:01 UTC: the kernel now gives each process an immutable
SessionContext, ProcessSpawner inherits the caller context by default, and
trusted broker/session paths can mint launchers fixed to a validated child
context. make run-session-context proves a copied UserSession cap cannot
relabel the child invocation context and that a broker profile mismatch fails
closed. It also proves an expired guest session cannot refresh a broker shell
bundle. Endpoint-delivered caller-session metadata, payload spoofing, and
field-granular disclosure remain Gate 2 work.
Gate 2: Endpoint Caller Session Metadata And Disclosure
Visible proof: an endpoint server receives only an opaque service-scoped caller
session reference by default, rejects payload attempts to spoof user,
session, role, or participant identity, and receives bounded subject
details only when the caller explicitly requests disclosure and a
broker/service disclosure scope permits the requested fields.
Implementation scope:
- Extend endpoint delivery metadata with an opaque service-scoped caller session reference and minimal freshness/liveness information.
- Add an explicit disclosure mechanism for bounded subject fields, such as a
per-call disclosure flag or a
SessionDisclosurecap, and require a matching broker/service disclosure scope before fields are delivered. - Decide the first freshness enforcement point needed to close the open session expiry review finding.
- Keep endpoint receiver metadata internal and non-authority-bearing.
- Add hostile endpoint tests proving request bytes cannot override caller session context or force subject disclosure.
- Add transfer-scope tests proving a
same_sessioncap cannot cross into another session, while across_session_shareablecap invokes under the receiver’s session context after transfer. - Default transfer scope is fail-closed for cross-session movement:
user/session-local caps use
same_sessionorservice_regrant_only, whilecross_session_shareablemust be explicitly chosen by the service or broker. - Add expiry/revocation cases before shared-service migration: broker refuses fresh bundles for stale sessions, stale normal endpoint invocations fail or report the documented freshness failure, and service-scoped session refs cannot be replayed as authority.
Verification gate:
make fmt-checkcargo test-libcargo test-config- focused endpoint/session QEMU proof
make run-spawn
Status 2026-04-28 17:43 UTC: commit 687511a implements the first Gate 2
slice. Endpoint delivery includes only a service-scoped opaque caller-session
reference, epoch, and live/stale flags by default; it does not expose principal,
profile, account, role, tenant, auth-factor, external-claim, display-name, or
source-network fields. Normal endpoint calls from stale process sessions fail
closed before transfer preparation or enqueue. make run-session-context
proves a live child endpoint call carries nonzero opaque metadata despite
spoofed user, session, and role payload labels, then proves the same
child cannot invoke the endpoint after its session expires.
Status 2026-04-28 18:38 UTC: commit f0cb74b implements Gate 2 transfer-scope
enforcement. Cap holds now distinguish same_session,
cross_session_shareable, and service_regrant_only transfer policy.
Endpoint IPC, endpoint returns, and spawn grants reject cross-session movement
unless the scope permits it; fixed-session broker/launcher paths can regrant
service_regrant_only caps. make run-session-context proves same-session
spawn denial, raw IPC denial for a service-regrant-only UserSession, and
receiver-session invocation after an allowed endpoint-cap transfer. Remaining
Gate 2 work at that checkpoint was the explicit field-granular disclosure
mechanism.
Status 2026-04-28 19:33 UTC: commit 0f92d77 completes the Gate 2 explicit
disclosure mechanism. CALL SQEs carry a field-granular disclosure request
mask, capability holds carry service/broker disclosure scope, and endpoint
delivery exposes only the requested-and-allowed subject fields. The focused
QEMU proof covers all three privacy cases: request without scope exposes no
fields, scope without request exposes no fields, and request plus matching
scope exposes only allowed fields while narrowing broader requests. Gate 3 is
the chat session-keyed migration.
Gate 3: Chat Session-Keyed Migration
Visible proof: make run-chat shows chat membership keyed by an opaque
service-scoped caller session reference and broker-granted chat capability,
with no user-facing badge or receiver selector. Payload identity spoofing,
unauthorized subject disclosure, and unauthorized cross-session participant id
reuse fail closed.
Implementation scope:
- Replace legacy chat receiver-selected member identity with session-keyed records.
- Treat
ChatRootpossession plus caller session context as sufficient for join, subject to broker/profile policy. - Keep global principal/account metadata private by default. If chat needs display name or guest/operator class, obtain it through explicit disclosure with a matching disclosure scope. If it only needs narrower behavior, use a broker-granted chat facet that encodes policy without revealing subject fields.
- Add a narrower moderator facet if moderator behavior is needed; do not use payload roles or generic rights bits.
- If chat supports multiple participant records per session, make returned participant ids server data scoped to the caller session, not transferable authority.
- Decide chat cap transfer scope explicitly. Plain
ChatRootmay be same-session or broker-shareable; participant-like state must not raw-transfer across sessions unless chat defines a share/regrant method. If chat accepts a share, future calls use the receiver session as the invocation subject. - Update shell examples and chat docs.
Verification gate:
make fmt-checkcargo test-configcargo test-libmake run-chat- hostile chat spoofing QEMU coverage
Status 2026-04-28 20:06 UTC: commit dc7ece4 implements the Gate 3 chat
session-keyed migration. chat-server now serves with endpoint caller
metadata, derives an opaque live caller-session key, and uses that key for
member records, channel membership, sends, leaves, and polls. Calls without a
live session key fail closed. system-chat.cue no longer assigns static chat
badges to the shell or bot, and make run-chat proves normal chat runs through
operator-session chat-client processes while the attempted delegated endpoint
relabel remains rejected. The handle join field is request data only, not
membership authority. After review, chat-visible sender labels are also
service-assigned member-N values, so request handles do not drive displayed
sender identity.
Gate 4: Shared-Service And Legacy Cleanup
Visible proof: normal shared-service demos no longer expose caller-selected service-visible identity, and service-object identity planning is retired from the active path.
Implementation scope:
- Applied the session-keyed model to shared service state and terminal/stdio
bridges that previously depended on legacy receiver metadata as identity.
Aurelian ordinary player state is keyed by live endpoint caller-session
metadata, and the focused adventure manifest grants NPC/chat authority
through service or manifest capabilities without caller-chosen selectors.
- Stdio bridges bind parent-side servicing to opaque live endpoint caller-session metadata and reject a bridge that later changes caller session, without asking the child to disclose global subject fields.
- Remove normal shell and manifest syntax that lets a caller select a badge or receiver selector.
- Keep low-level receiver metadata only as internal endpoint transport state or hostile-test fixture.
- Update
docs/capability-model.md,docs/architecture/ipc-endpoints.md,docs/security/trust-boundaries.md, demos, and status pages.
Verification gate:
make fmt-checkcargo test-libcargo test-configmake run-smokemake run-spawnmake run-chatmake run-adventuremake docs
Status 2026-04-28 20:48 UTC: the guest-bundle cleanup slice narrows one Gate 4
identity/policy leak without touching adventure content. SessionManager.guest
now requires an explicit manifest guest seed, AuthorityBroker.shellBundle
returns no default guest service endpoints, and guest launchers use a
resource-profile launcherProfile instead of the full manifest binary list.
The default guest profile has an empty launcher; the session-context proof uses
a dedicated one-binary guest profile for session-context-child; and the
default smoke proof covers manifest-without-guest-seed denial.
Status 2026-04-28 21:36 UTC: the session-expiry review finding is closed for
current shell/broker authority. Endpoint CALLs already required live caller
sessions. Retained broker-issued non-endpoint bundle caps now expire at their
bound session boundary: RestrictedLauncher rejects spawn/list calls after the
minted session expires, and broker-issued SystemInfo caps are session-bound
wrappers. Raw process-spawner, capability-manager, and process-handle control
caps opt into live caller-session dispatch for any path that still exposes
them. make run-local-users proves an expired operator shell cannot keep
launcher authority through an already-issued bundle, and make run-session-context proves the narrow guest proof launcher also fails closed
after expiry.
Status 2026-05-01 08:47 UTC: default password-authenticated local operator sessions no longer use fixed wall-clock expiry. The expiry enforcement proof still exists through manifests that set a non-default operator lifetime, and guest/anonymous/focused proof sessions remain short-lived.
Status 2026-04-28 22:02 UTC: the normal shell parser now rejects explicit
client @... badge N grants and preserves delegated client endpoint identity
when badge syntax is omitted. Default MOTD and adventure docs use omitted-badge
launches, while hostile selector fixtures remain in low-level smoke coverage.
Status 2026-04-29 05:59 UTC: the focused chat manifest now routes the same
kernel singleton chat_endpoint through init to the resident chat server that
the broker facets into operator shell bundles. The focused chat shell no longer
receives the resident chat-server export directly from system-chat.cue; the
normal shell path uses the broker-issued operator bundle chat endpoint, while
the resident bot keeps its manifest service grant.
Status 2026-04-29 06:17 UTC (the socket-backed SocketTerminalSession and
TcpSocket.intoTerminalSession were later retired with the kernel socket
owner, 2026-06-10; the UART-backed gate remains): terminal output is now
behind the same live caller
session dispatch gate as terminal input. Both UART-backed TerminalSession
and socket-backed SocketTerminalSession require a live caller session for
write, writeLine, and readLine, so stale shell sessions cannot keep a
terminal bridge useful through write-only calls. TcpSocket.intoTerminalSession
continues to return a move-only terminal cap, but the result hold is explicitly
cross-session shareable because the Telnet gateway converts an accepted socket
in its service session and then grants the terminal to the broker-minted shell
session.
Status 2026-04-29 09:00 UTC: shell-serviced stdio bridge waits now bind to
opaque live endpoint caller-session metadata during the active child wait and
reject mismatched callers without asking the child to disclose global subject
fields. Normal StdIO.close exits cleanly, rejected calls drain transferred
caps before returning, and make run-session-context covers a transfer-bearing
cross-session rejection. demos/service-common no longer exposes a
badge-serving helper or badge field on EndpointCaller; new shared endpoint
loop code uses EndpointUserData, with the old badge-named user-data alias
kept only as a source-compatible alias after checked-in adventure code moved
onto caller-session metadata.
Status 2026-04-29 09:44 UTC: the non-adventure endpoint caller-session
reference is widened to 128 bits while keeping scoped_ref as the low 64-bit
compatibility half and adding scoped_ref_hi as the high half. Endpoint
delivery fills both halves from independently domain-separated, nonzero hashes,
keeps epoch separate, and non-adventure service/session-context/stdout bridge
guards now require and compare both halves. This remains proof-grade opaque
reference derivation; a true keyed secret, scope-key rotation, and rotation
lifetime policy are still deferred.
Status 2026-04-29 10:20 UTC: endpoint caller-session references now use an
entropy-backed boot secret and HMAC-SHA256 over a non-reused endpoint
service-scope id plus the kernel session id. scoped_ref remains the low ABI
field, but it is no longer value-compatible with the old unkeyed low-half hash;
scoped_ref_hi is the high ABI field of the same keyed opaque reference.
epoch stays a separate field and is also domain-separated under the boot key
so stale/freshness audit correlation rotates with boot-key and endpoint-scope
changes. References rotate on reboot and endpoint object replacement. Stable
service-audit identity across service upgrades remains future work.
Status 2026-04-29 11:00 UTC: the session-context QEMU proof now calls two distinct endpoint service scopes from one child process/session before expiry and asserts their opaque caller-session reference tuples differ while both remain live. This covers endpoint-object replacement/scope changes at the demo-proof level; stable service-audit identity across upgrades remains future work.
Status 2026-04-29 20:33 UTC: the session-bound proposal and shared-service backlog distinguished landed Aurelian ordinary player-state migration from the then-remaining adventure NPC/service-authority cleanup. The server keys ordinary player records by live endpoint caller-session metadata.
Status 2026-04-29 21:40 UTC: Gate 4 implementation and verification are closed
for mainline. make fmt-check, cargo test-lib, cargo test-config,
make run-smoke, make run-spawn, make run-chat, make run-adventure,
focused make docs, and git diff --check passed after commit faeff80
hardened the docs PDF render path for automated builds. The
focused adventure manifest check rejects legacy badge: selectors, and
make run-adventure covers selector-free Adventure/chat service grants plus
the resident scenario test. The follow-up paper/status alignment records this
as landed C1 evidence in docs/paper/evidence-gaps.md,
docs/paper/plan.md, and papers/schema-as-abi/main.typ.
Follow-Up: Session Lifecycle, Logout, And Renewal
The completed milestone closed the stale-session authority hole for current shell and endpoint paths, but it did not make short fixed wall-clock expiry a complete interactive session UX. Follow-up work belongs with identity, runtime-network-shell, and local-user management rather than reopening this completed milestone.
Target:
- Keep one immutable
SessionContextper process. - Add a trusted mutable session liveness cell keyed by session id/epoch with
states
live,logged_out,revoked,expired, andrecovery_only. - Move liveness checks from timestamp-only immutable metadata toward session-manager state that can be logged out, administratively revoked, expired, or renewed without relabeling a running process.
- Implement
UserSession.logoutand make owner-shell exit / gateway disconnect close the sessions they own. - Add a narrow
SessionManager.renewor broker refresh path that is allowed only for explicit renewal/recovery methods after expiry. - Make renewal mint fresh grant leases or wrapper caps when policy needs a new decision. Renewal must not make stale ordinary grants fresh by accident.
- Preserve explicit revocation as stronger than renewal, except for separately audited recovery policy.
- Treat password-authenticated local operator shells as logout/connection/ process-tree/admin-revoke driven by default, with idle lock, renewal prompt, or configured hard maximum as policy choices. Guest, anonymous, remote, federated, and elevated grants can remain short-lived.
Status 2026-05-02 08:43 UTC: the remote-session lifecycle slices add the
kernel live/logged_out liveness cell for SessionManager-minted sessions,
wire UserSession.logout through the remote CapSet gateway, and reject
already-admitted endpoint returns after caller logout/session death. Broker and
restricted-launcher reconstruction now resolves the existing kernel liveness
cell by minted session id and fails closed when it is missing or logged out.
Endpoint RETURN rechecks the caller session after target CQ-space checks and
before copying result bytes or installing result caps; stale callers receive an
invoke-failed completion when possible, the in-flight call is canceled rather
than restored, and prepared result-cap move sources roll back. This closes
explicit remote logout, connection-close propagation, and already-admitted
endpoint result delivery after session death for the current kernel endpoint
path. A 2026-05-11 follow-up also makes clean local owner-shell exit call the
held UserSession.logout() path before process exit, with the shell smoke
asserting the scheduler-observable hook. The full lifecycle target still needs
renewal, administrator revocation, live remote proxy object cleanup, and
complete audit reason separation.
Verification gates before this is closed:
- host tests for liveness cell state transitions and renewal denial after revoke;
- QEMU proof that
exit/terminal close on an owner shell logs out the session and prevents future broker bundle refresh. The logout propagation half is complete for clean shell exit; broker refresh refusal remains covered by the existing logged-out liveness checks and should be re-proven when renewal or replacement shell UX lands; - QEMU proof that pre-expiry renewal keeps the shell usable while old ordinary grant epochs do not silently refresh;
- QEMU proof that post-expiry normal calls fail while explicit renew/logout paths remain available;
- QEMU proof that result-cap delivery after session expiry does not install fresh caps into a stale caller, including a move-source rollback case;
- audit output distinguishing expiry, explicit logout, renewal, administrator revoke, process-exit cleanup, and stale-use denial.
Deferred Work
- Remote capability transport and network transparency.
- Durable account store and external identity binding persistence.
- Full quota service and scheduling-context donation policy.
- Mutable session liveness cells, explicit logout/close propagation, and renewal/recovery paths for usable long-running shells.
- Explicit cross-session sharing UX and audit workflow.
- Stable service-audit identity for endpoint caller-session references across intentional service replacement or upgrade.
- Delegated-subject / act-on-behalf-of context. See
docs/proposals/delegated-subject-context-proposal.md.
Service Object Identity Migration
ARCHIVED — superseded by Session-Bound Invocation Context and the active backlog Session-Bound Invocation Context (2026-04-28). This file is retained as historical context only; do not select work from it. The Big Chunk 2 subject/proof-root and shared-service migration here are NOT on the active mainline path.
Status: superseded on 2026-04-28 14:35 UTC by
Session-Bound Invocation Context
and the active selected backlog
Session-Bound Invocation Context.
The Big Chunk 1 synthetic routing/lifecycle proof remains useful historical
coverage, but Big Chunk 2 subject/proof root opening and shared-service
service-object migration should not proceed on the active mainline path.
Historical plan for replacing caller-selected service-visible identity with kernel-routed service object capabilities and userspace-verified subject capabilities.
This backlog intentionally uses large implementation chunks. Each chunk should land as a coherent reviewed branch with one focused end-to-end QEMU proof plus the affected host tests, rather than splitting the transition into dozens of small branches that each require full verification.
Design Target
The final model has two separate authorities:
- Subject/proof capabilities: issued by trusted userspace services such as
SessionManager, service-principal issuers, workload-identity issuers, anonymous/guest issuers, orAuthorityBroker. - Service object capabilities: minted by a trusted service root/factory after it validates subject/proof authority and policy context.
The kernel enforces generic capability mechanics only:
- live generation-tagged cap-table entries;
- endpoint/object routing;
- receiver immutability across copy, move, IPC transfer, and spawn;
- trusted mint authority;
- revocation/lifetime checks;
- generic queue, byte, cap-count, and scheduling bounds.
Userspace services enforce policy:
- trusted issuer selection;
- subject facts, roles, sessions, guests, service accounts, and workloads;
- external subject admission and local/pseudonymous principal mapping;
- audit context;
- quota bucket selection;
- application object records and facets;
- whether a subject must stay live for a given object.
Request fields remain data. They must never select service authority.
External Subject Alignment
This migration must preserve the identity model already described in
docs/proposals/user-identity-and-policy-proposal.md,
docs/proposals/oidc-and-oauth2-proposal.md, and
docs/backlog/local-users-management.md.
External subjects enter capOS only through an admission pipeline:
- An external verifier validates the provider assertion, such as an OIDC ID token, passkey assertion, certificate chain, cloud workload token, or remote gateway-authenticated claim.
- Admission normalizes the external key as provider kind, issuer, tenant, and
subject, then either maps it through
ExternalIdentityBindingto an existing local principal or admits it as an explicitly configured pseudonymous/guest/service principal. SessionManager, a service-principal issuer, or a workload-identity issuer mints the local subject/proof capability. Imported provider groups, roles, tenant IDs,acr,amr, device posture, and token age are ABAC inputs for this mint decision, not downstream object authority.- A service root validates the local subject/proof capability and policy context before minting a service object capability.
Consequences:
- Service objects store verified local subject facts and audit context, not raw external tokens or provider-specific claim bags.
- A provider claim can influence the object minted at open time only through trusted admission, broker, or verifier capability paths.
- A stale, disabled, or unbound external subject must fail before a service object is minted.
- Remote gateways translate connection authentication into local subject/proof caps; ordinary application services should not authorize directly from network connection identity.
- The same object-capability migration should work for local password/passkey sessions, OIDC users, cloud workload identities, service accounts, anonymous/guest sessions, and future remote cap transports.
Network Transparency Alignment
The first implementation is local to one kernel and one endpoint object graph, but it must not block future network-transparent capability transport.
Design constraints for this migration:
- Do not serialize kernel receiver selectors, cap-table handles, endpoint object ids, generation values, or server cookies as portable object names.
- Treat service object capabilities as live references. A future remote bridge should export/import them through connection-local tables, not through global URLs or raw selector strings.
- Preserve Cap’n Proto-style disconnect behavior: if the local endpoint, server, or remote connection dies, imported references become broken and calls fail explicitly rather than silently rebinding to a new server.
- Keep persistent restore separate from live object routing. If a service needs a durable object reference, restore should go through a capability-bearing persistence/naming service that can authorize and mint a fresh live object.
- Keep subject admission separate from transport identity. A remote bridge may authenticate a TLS/OIDC/certificate/session channel, but application services should still receive local subject/proof caps and service object caps.
- Keep object equality out of the first implementation. If future remote transport needs equality, expose it through a deliberate service or transport protocol rather than assuming global kernel object identity is comparable across hosts.
The local receiver-cookie model should therefore be an implementation detail
behind local ServiceRef capabilities. The portable concept is the authority to
call a typed object reference, not the routing selector used by one kernel.
Non-Negotiable Invariants
- Only trusted mint paths create a new service object identity.
- Ordinary copy, move, IPC transfer, and spawn preserve the same service object.
- A child that receives an object capability acts through that same object, unless a service method explicitly mints a new delegated object.
- Endpoint routing is derived from the invoked capability, not from request bytes, shell text, manifest user input, process id, user name, role name, or numeric labels.
- The kernel never interprets users, accounts, roles, tenants, service accounts, rooms, NPCs, moderators, file owners, or workload names.
- Server cookies used for object dispatch must be generation-safe and must not be raw pointers in the first implementation.
- Move transfer remains transactional in capOS: a failed delivery or canceled receive rolls back reserved source authority rather than silently dropping it before adoption.
- Application rights should be represented by typed interfaces or narrower object capabilities, not by generic permission bitmasks.
Big Chunk 1: Core Service-Object Routing And Lifecycle
Visible proof: a synthetic service in QEMU mints two distinct object capabilities, routes calls by kernel-delivered receiver cookie, transfers one object through IPC and process spawn, and proves copied/moved/spawned handles still reach the same object record. The proof also injects forged identity and selector-like bytes into request payloads and shows they do not affect routing.
Implementation scope:
- Make service-object terminology explicit in cap-table and endpoint code while preserving compatibility with current hold metadata.
- Introduce or formalize endpoint-scoped receiver records with generation-safe server cookies.
- Add a trusted mint interface/path owned by endpoint owner or explicit mint authority.
- Deliver receiver cookie plus interface/method/payload/cap grants to the server.
- Preserve receiver identity across copy, move, IPC transfer, and spawn.
- Add lifecycle behavior for receiver close/revoke and stale-generation reuse sufficient for the synthetic proof. Broader release/exit cleanup remains in Big Chunk 4.
- Add a synthetic service-object demo, manifest, shell/host harness, and hostile checks.
Verification gate:
make fmt-checkcargo test-libcargo test-config- focused QEMU proof target for the synthetic service-object routing demo
make run-spawnor another focused spawn/transfer proof that exercises the modified grant path
Review notes:
- 2026-04-28:
workplan/service-object-routing-coreadded the first focused service-object routing proof, but does not close this whole chunk. The branch introducedCapGrantMode.serviceObjectas the explicit spawn-grant spelling for endpoint-scoped service object facets, keptclientEndpointas compatibility spelling, added receiver-cookie preservation host checks, and addedmake run-service-object-routingfor a synthetic two-object QEMU proof with payload spoofing, copy and move service-object IPC transfer, and nested spawn delegation. The proof also rejects service-object minting through the legacy ProcessSpawner endpoint-result facet path, keeping that compatibility exception scoped toclientEndpoint. At that checkpoint, generation-safe server cookie representation beyond fixed demo constants and explicit receiver lifecycle/close/revoke coverage still remained. - 2026-04-28 14:10 UTC: commit
a4655f0completed Big Chunk 1. The focused demo now encodes service receiver cookies as receiver-index plus generation, stores service-side object records, proves close and revoke rejection for later calls, and queues a stale alpha call before reusing the alpha record slot so the stale generation is rejected instead of dispatching to the reused record. - Review must inspect
capos-lib/src/cap_table.rs,kernel/src/cap/endpoint.rs,kernel/src/cap/ring.rs,kernel/src/cap/transfer.rs,capos-config/src/ring.rs,capos-rt/src/ring.rs,capos-rt/src/client.rs, and the new demo. - Do not migrate chat, adventure, or stdio in this chunk. The synthetic proof should isolate the kernel/runtime semantics first.
Big Chunk 2: Subject/Proof Authority And Service Root Opening
Visible proof: a root service accepts a trusted local subject/proof capability, validates it through a verifier or broker capability, mints a service object, and rejects fake same-shape subject objects, expired/stale proofs, wrong audience, and payload identity spoofing. A spawned child that receives only the object cap can use the object but cannot open sibling objects.
Implementation scope:
- Add minimal schema/runtime surface for a local subject/proof verifier. Keep it local and bounded; do not require full remote cryptographic identity yet.
- Model the verifier result in the same shape used by external admission: local or pseudonymous principal id, principal kind, auth strength, policy profile, resource profile, audit context, and optional claim-derived ABAC attributes.
- Bind subject/proof data to audience, purpose, and freshness enough for the proof.
- Add service-root open semantics over the core service-object mint path.
- Store verified subject, audit context, quota placeholder, policy mode, and optional liveness link in service-owned object records.
- Add hostile checks for fake subject providers and request-field identity spoofing.
- Add explicit delegation behavior: raw object-cap transfer preserves the same object; explicit service delegation mints a new object only through a service method.
Verification gate:
make fmt-checkcargo test-config- relevant host tests for subject/proof encode/decode and validation
- focused QEMU proof target for subject/proof service-root open
- one existing session proof such as
make run-loginormake run-ssh-public-key-authif touched by the subject path
Review notes:
- Avoid broad cryptographic protocol work in this chunk. The target is local issuer-verifiable subject/proof authority, not production remote federation.
- Keep application role policy out of the kernel and out of generic rights bits.
- Do not bypass
ExternalIdentityBindingor admission policy when adding external-subject tests. If a fixture models OIDC, passkey, cloud, or certificate input, it must first resolve to a local subject/proof cap before any service object opens.
Big Chunk 3: Shared-Service Migration
Visible proof: existing shared-service demos run without caller-selected service-visible identity. Chat, stdio/terminal child bridges, and adventure receive service object capabilities directly or open them through root/factory interfaces. Existing shell workflows still work, but children cannot choose or rewrite the object identity they receive.
Implementation scope:
- Convert
Chatinto root/object interfaces such asChatRootandChatParticipant, with subject binding at the root boundary. - Convert stdio or terminal child bridges that depend on endpoint-client identity into service object caps or a narrowed terminal/session object.
- Convert adventure player/NPC authority to service objects, including room speech over migrated chat object caps.
- Update shell launch examples and spawn grant parsing so ordinary grants name existing capabilities only.
- Preserve compatibility only where a focused legacy smoke still needs it, and mark it as transitional.
- Add hostile smokes proving request-field identity spoofing, child relabeling, and unauthorized sibling minting fail.
Verification gate:
make fmt-checkcargo test-configcargo test-libmake run-chatmake run-adventure- focused stdio/terminal or shell proof touched by the migration
- one hostile service-object delegation QEMU proof
Review notes:
- This is deliberately a large branch. Avoid stopping after only chat unless the branch becomes too risky to review coherently.
- If adventure or stdio exposes an implementation blocker, record it as a task
under
docs/tasks/with concrete remediation before merging partial migration.
Big Chunk 4: Legacy Compatibility Retirement And Naming Cleanup
Visible proof: no normal shell, manifest, shared-service demo, or docs path exposes caller-selected service-visible identity. Internal names match the implemented model where the field is only an endpoint-scoped receiver selector.
Implementation scope:
- Rename internal fields, docs, and diagnostics from legacy identity language to receiver-selector terminology where behavior has migrated.
- Remove compatibility grant syntax and manifest fields that can no longer be used by supported smokes.
- Remove the default MOTD adventure launch commands that still expose explicit legacy receiver selectors, or replace them with service-object-safe commands after the shared-service migration.
- Tighten validation so service object authority cannot be constructed from user input.
- Add release/exit cleanup coverage for service object caps with queued calls, in-flight returns, server-owned object records, and receiver revocation.
- Update
docs/capability-model.md,docs/architecture/ipc-endpoints.md,docs/security/trust-boundaries.md,docs/proposals/service-object-capabilities-proposal.md,docs/proposals/user-identity-and-policy-proposal.md,docs/status.md, and relevant backlog files, including notes about future network-transparent import/export and persistent restore boundaries.
Verification gate:
make fmt-checkcargo test-libcargo test-configcargo test-ring-loomif ring metadata changesmake run-smokemake run-spawnmake run-chatmake run-adventuremake docs- generated-code check if schema or generated bindings changed
Review notes:
- This branch should close the compatibility migration or explicitly preserve only low-level hostile-test fixtures.
- Do not leave user-facing syntax or docs that imply clients may choose service object identity.
Deferred Work
- Remote capability transport and network-transparent object references.
- Production cryptographic subject proof protocols.
- Persistent restore of service objects across server restart.
- Full quota service and scheduling-context donation policy.
- Cross-host federation and external identity mapping.
Stage 6 Capability Semantics Backlog
Detailed decompositions for Stage 6 follow-up work. docs/tasks/README.md links here
but should not inline these subtasks.
Notification Objects
Implement a lightweight signal/wait primitive for interrupts and event delivery without full endpoint message overhead.
- Define schema/ABI and wait semantics.
- Add kernel object plus ring operations or methods.
- Add QEMU smoke for signal, wait, timeout, and revoke/drop cases.
Promise Pipelining
Implement promised-answer targeting for CALL SQEs after transfer/result-cap insertion is stable.
- Define promised-answer IDs, dependency encoding, and failure rules.
Existing design decision:
pipeline_depis the process-local promised-answer ID allocated by the runtime, andpipeline_fieldis a zero-based sidebandCapTransferResultrecord ordinal in that answer’s completion. It is not a Cap’n Proto schema field or payload path. Unsupported mappings fail closed, with concrete transport error codes left to the implementation slice before the kernel acceptsCAP_SQE_PIPELINE. - Resolve dependency chains in the kernel without userspace round-trips.
- Add runtime placeholders and an IPC pipeline smoke. The smoke must prove
pipeline_depis the promised-answer ID,pipeline_fieldresolves the selected sideband result-cap ordinal, and mismatched result payload bytes do not affect kernel dependency resolution.
CapabilityManager
Add management-only introspection and grant helpers after transfer/release semantics are stable.
- Define list/grant schema and authority boundaries.
- Implement read-only cap table introspection.
- Add grant smoke and hostile checks for non-manager callers.
Session-Bound Invocation Context
Replace caller-selected endpoint identity with session-bound invocation context
as described in
docs/proposals/session-bound-invocation-context-proposal.md.
The selected 2026-04-28 migration plan lives in
docs/backlog/session-bound-invocation-context.md.
Current status: Gate 0 delegated-client relabeling containment, the
transitional representation substrate, the synthetic service-object
routing/lifecycle proof, Gate 1 process-session invariant, Gate 2
privacy-preserving endpoint caller-session metadata, and Gate 3 chat
session-keyed migration have landed.
Existing code still has a badge-named u64 field in several transport
structs, but the active design treats that field as legacy receiver metadata,
not as service capability. Commit a4655f0 at 2026-04-28 14:10 UTC
completed the historical service-object routing proof with generation-checked
receiver cookies, service-side object records, close/revoke rejection,
stale-cookie rejection after record reuse, receiver-cookie routing despite
spoofed request bytes, copy/move IPC transfer, and nested spawn delegation.
Gate 4 in docs/backlog/session-bound-invocation-context.md is implemented
and verified on mainline: shared-service legacy cleanup has moved normal chat,
adventure, and terminal/stdio paths off caller-selected receiver metadata. Do
not continue the superseded subject/proof root-opening path from
docs/backlog/service-object-identity-migration.md unless the selected
milestone changes again.
Paper prerequisite. Gate 2 endpoint caller-session metadata, Gate 3 chat
session-keyed migration, and Gate 4 shared-service cleanup have landed. The
paper/status closeout for whitepaper claim C1 (“schema-typed methods replace
parallel rights”) remains peer-owned: docs/paper/evidence-gaps.md,
docs/paper/plan.md, and the matching #todo block in
papers/schema-as-abi/main.typ still need to reflect the landed evidence.
Gate 0: delegated-client relabeling containment
This is the first Telnet Shell Demo blocker. It must land before shell launch can be exposed through any network-backed terminal.
- Add hostile coverage proving an ordinary shell or delegated endpoint
client cannot re-label a client endpoint by choosing a different
identity in a spawn grant. Cover explicit
badge N, the legacy badge-zero encoding that old omitted syntax used to produce, and current omitted shell syntax preserving the delegated source identity. Worker B checkpoint: normal shell help and smoke-help assertions no longer advertisebadge N. Worker C checkpoint: init spawn hardening now mints a nonzero delegated client facet into a child init process and asserts that explicit-badge and badge-0 relabel spawn attempts fail. - Change
ProcessSpawnersoClientEndpointgrants from delegated client facets preserve the source identity and reject attempts to set a different value. Endpoint owners and trusted parent endpoint result caps remain the only transitional paths that may mint a new client identity. - Remove arbitrary
badge Nfrom normalcapos-shellhelp and smoke-help launch examples; keep legacy manifest/debug syntax only where the kernel enforcement still rejects delegated-client relabeling. The default MOTD adventure launch commands now omit explicit legacy selectors; Gate 4 indocs/backlog/session-bound-invocation-context.mdstill owns retiring remaining manifest-level selector compatibility after session-bound chat and adventure migration. - Document the containment in
docs/architecture/ipc-endpoints.mdand trust-boundary docs before exposing shell launch through Telnet.
Historical Gate 1: service object representation
- Define the transitional kernel/runtime representation for existing
endpoint-backed service facets: target endpoint, interface id, and
legacy receiver metadata.
2026-04-25 18:31 UTC checkpoint: the first representation slice reuses
CapHold { object_id, interface_id, badge }as endpoint object, service interface id, and endpoint-scoped receiver selector for existing endpoint-backed service objects. Dispatch and spawn now preserve the held metadata for ordinary delegation; explicit trusted minting remains open. - Complete the transitional representation replacement with explicit
generation-safe receiver records and lifecycle coverage for the
synthetic proof. Big Chunk 1 now covers trusted service-object minting,
receiver-cookie dispatch, receiver-preserving copy/move IPC transfer and
spawn, request-byte spoofing checks, generation-safe server cookies, and
close/revoke/stale-generation rejection.
2026-04-28 14:10 UTC checkpoint: commit
a4655f0added generation-checked receiver cookies, service-side object records, close/revoke rejection, and stale-cookie rejection after record reuse. - Add the minimum trusted mint path needed for the synthetic service-object
proof: endpoint owner or explicit mint authority creates the initial
service object cap; ordinary clients only copy or move it.
2026-04-28 checkpoint:
CapGrantMode.serviceObjectlets endpoint owners mint copy-transferable endpoint-scoped service object facets for child processes while delegated service object caps cannot relabel the held interface or receiver cookie. The legacy ProcessSpawner endpoint-result facet exception remains scoped toclientEndpointand is rejected forserviceObject. - Scope receiver selectors to the target endpoint and keep them out of shell syntax, manifest user fields, and service policy labels.
- Preserve the current held receiver metadata across copy and move transfer. Ordinary transfer must not mint a sibling object.
- Prove receiver identity preservation across copy, move, IPC transfer, and
spawn in the synthetic service-object QEMU proof.
2026-04-28 checkpoint:
make run-service-object-routingexercises copy-transfer and move-transfer of service object caps through IPC, nested spawn delegation, and hostile payloads that try to name the other receiver. - Enforce that client-held service object caps cannot use endpoint receive/return authority unless a separate server-facing interface grants that authority.
- Deliver endpoint metadata so servers can dispatch current object-shaped
calls without treating it as caller-selected identity.
2026-04-25 18:45 UTC checkpoint: trusted manifest/init minting now uses
explicit
CapabilityAsspawn grants to request a service interface from endpoint exports, validation rejects the same override for non-endpoint exports, andsystem-spawn.cueproves a non-Endpoint service interface plus selector reaches the server receive metadata. - Rename or wrap server delivery surfaces around receiver-selector/server- cookie terminology once the behavior is receiver-selector-only.
Gate 2: process session invariant
- Add process-owned immutable session context with explicit system/service session support.
- Make child spawn inherit the parent’s session by default and require trusted broker/session-manager authority for different child sessions.
- Add host and QEMU coverage proving ordinary processes cannot inject or use a second independent session subject.
Gate 3: endpoint caller session metadata
- Deliver opaque service-scoped caller-session references and freshness results to endpoint servers.
- Add an explicit subject-disclosure path so global principal/profile details are not revealed to services by default.
- Add hostile coverage proving request bytes cannot spoof session identity or force disclosure.
Gate 4: shared-service demo migration
- Convert chat identity from legacy receiver selectors to broker-granted chat roots/facets plus service-scoped caller-session references.
- Finish adventure NPC/service-authority cleanup and any remaining stdio/terminal child bridge paths that depend on caller-selected endpoint identity. Aurelian ordinary player state is already keyed by live endpoint caller-session metadata.
- Retire normal user-facing badge/receiver-selector syntax after chat, adventure, stdio, and endpoint smoke paths no longer depend on it.
Scheduling Context And Resource Donation
Convert the roadmap’s priority/budget donation and session-quota ideas into a measured design before adding new scheduler policy.
- Record current direct-switch IPC timing and priority-inversion risks.
- Define scheduling-context donation metadata.
- Define resource donation parameters for session-creating caps.
Init ELF Embedding
Done 2026-05-25 23:26 UTC. The init ELF ships inside the kernel binary via
include_bytes!, not as a manifest entry or separate Limine module.
kernel/build.rs reads the prebuilt init/ artifact (CAPOS_INIT_ELF, with a
conventional-path fallback) and emits a kernel::boot::INIT_ELF: &[u8] static;
kernel bootstrap parses it through the existing capos_lib::elf loader. Init
stays a standalone crate with its own linker script and code model. Embedding is
byte packaging, not linker merging.
Landed as a hybrid keyed on the reserved selector rather than an
always-embedded init: initConfig.init.binary is a generic “which binary is
PID 1” selector, and most boots run a non-init binary as PID 1 (run-smoke’s
shell, ~70 focused test-as-PID-1 manifests). So embedding applies only when
init.binary == capos_config::RESERVED_INIT_BINARY_NAME ("init"): then PID 1
loads from INIT_ELF with no binaries resolution, and manifest validation
(capos-config/mkmanifest) rejects any binaries entry named "init". Any
other selector still resolves PID 1 from SystemManifest.binaries exactly as
before. The real-init manifests (system.cue via the shared _baseBinaries
plus the focused init.binary == "init" manifests) drop their init binaries
entry; run-smoke and the test-as-PID-1 manifests are unchanged.
Because the embedded image is the canonical init, child spawns that reference
the init binary by name (e.g. system-spawn.cue’s spawn-hardening fixtures)
keep working: run_init injects the embedded bytes into the ProcessSpawner
binary set under the reserved name when init is embedded (the BootPackage cap
serves only the serialized manifest bytes), so the spawnable set matches the
pre-embedding state without init appearing in the serialized manifest.
Proof: make run-init-embedding (minimal system-init-embedding.cue: PID 1
from INIT_ELF, no reserved binaries entry) and make run-smoke (PID 1 =
shell, unchanged). cargo test-mkmanifest / cargo test-config cover the
reserved-name rejection and the init-ref skip.
Reference: docs/proposals/service-architecture-proposal.md section
Init Binary Embedding.
Remote Session CapSet Client Backlog
Detailed decomposition for the remote host app path described in
Remote Session CapSet Clients.
docs/tasks/README.md should point here when selecting implementation slices; it should
not inline the details.
Visible Outcome
make run-remote-session-capset-interop boots capOS in QEMU, starts a
loopback-scoped remote session gateway, runs a regular host-side Rust client
on the host, authenticates or exercises an explicitly configured
guest/anonymous denial path, obtains a RemoteSession, lists a broker-issued
RemoteCapSet, gets typed capabilities by name/interface ID, calls at least
two granted capabilities, proves missing/wrong-interface denials, logs out or
disconnects, and observes stale proxy calls fail closed.
The first harness can be a small CLI because it is easy to script. The product shape should also support a native desktop GUI, a Tauri app whose Rust backend holds the remote CapSet, or a webapp whose trusted server/gateway holds the remote CapSet and exposes only UI frames, command descriptors, or bounded tool requests to browser JavaScript. The UI path can be bidirectional: the host UI may grant a narrow UI-surface capability back to capOS-side services or agents so they can propose task-specific panes, command palettes, visualizations, theme hints, and layout changes without receiving arbitrary host UI authority.
The ordinary operator run story is: start capOS with make run, note the
printed remote CapSet: tcp 127.0.0.1 <port> -> guest :2327 line, then start
one of the host clients against that endpoint. make run injects the host
USER as the default operator account name on the capOS side; the CLI
may take --user (or CAPOS_REMOTE_SESSION_USER) as an explicit operator
override, but the web bridge keeps the login username field empty by
default to avoid leaking host identity hints into the page before
authentication. The current repo-local commands are:
make run
cargo run --manifest-path tools/remote-session-client/Cargo.toml \
--target x86_64-unknown-linux-gnu \
--bin remote-session-client -- --host 127.0.0.1 --port <printed-port>
CAPOS_REMOTE_SESSION_PORT=<printed-port> make remote-session-ui
The CLI also accepts --launch-adventure for the default-manifest proof that
starts the Adventure service graph through serviceLaunch and requires a
running status. --adventure-status follows a successful Adventure launch
with bounded Adventure.status, Adventure.look, and Adventure.inventory
calls through the session-bound worker; --adventure-go <direction> adds the
first mutable typed DTO call by invoking bounded Adventure.go(direction) and
checking the returned text/room response. The same CLI path now accepts bounded
--adventure-take <item>, --adventure-use <item>, and
--adventure-drop <item> controls for simple item interactions. The focused
positive proof is
make run-remote-session-adventure-interop; the existing
make run-remote-session-capset-interop fixture remains a launch-denial proof
shared with the browser UI smoke path.
The CLI and trusted local web bridge are development tools in this repo. The
repo-local Tauri path reuses the same Rust backend boundary by loading the
loopback remote-session-ui surface in a desktop webview:
CAPOS_REMOTE_SESSION_PORT=<printed-port> make remote-session-tauri
By default that target first runs a policy preflight over the reviewed
check/dev scaffold, then checks Tauri CLI and Linux build prerequisites,
reports dependency/scaffold status, and runs a deterministic wrapper
cargo check when the host has those prerequisites. Set
CAPOS_REMOTE_SESSION_TAURI_MODE=dev to launch cargo tauri dev. Missing
host Tauri packages fail with explicit diagnostics and point operators back to
make remote-session-ui; the Tauri wrapper is not a different authority
model. CAPOS_REMOTE_SESSION_TAURI_MODE=policy tools/remote-session-tauri.sh runs only the scaffold guardrail and does not
need Tauri system packages or a desktop session. package and automation
modes are intentionally blocked until distributable packaging and desktop
automation receive reviewed designs.
The first visible proof keeps QEMU host forwarding and a development
transport. The current implementation uses length-prefixed schema-framed
Cap’n Proto DTOs for remote login, session summary, CapSet list/get, calls,
denials, and logout. Standard capnp-rpc framing and live object proxies
remain the transport direction, but the first proxy slice is now explicitly
dual-stack: host-backend-only capnp-rpc proxy objects over the existing DTO
gateway first, then guest-wire replacement after the capOS userspace runtime
decision.
Implementation Status
Implemented and active slices:
-
The capnp-rpc transport DTO surface is pinned in
schema/capos.capnpahead of the transport rewrite:RemoteAuthStart,RemoteAuthStep,RemoteServiceGrantRequirement,RemoteServiceExport,RemoteServiceProfile, plus theRemoteSessionGateway,RemoteAuthFlow,RemoteSession,RemoteCapSet,RemoteServiceCatalog, andRemoteServiceRunnerinterfaces. Round-trip coverage lives incapos-config/tests/remote_capnp_rpc_dto_roundtrip.rs. -
Runtime placement decision:
capnp-rpcv0.25 isstd-only and needs a futures executor, whiledemos/remote-session-capset-gateway/src/main.rsis a#![no_std]#![no_main]gateway with a synchronous accept/recv/handle/send loop. Therefore the first proxy implementation is host-backend-only. The trusted Linux Rust backend may host a localcapnp-rpcfacade/proxy layer for chat or Adventure and translate those calls into the existingRemoteGatewayRequest/RemoteGatewayResponseDTO transport. The gateway, schema, generated bindings, kernel services, browser API, and browser view models stay unchanged for that slice. This is a temporary dual-stack period: the backend proves proxy semantics and exception mapping over the DTO wire, but it must not be documented as live standardcapnp-rpcsupport inside the capOS guest. The gateway wire replacement remains gated on a reviewed capOS userspace async runtime or a reviewed sync-friendly Cap’n Proto RPC adapter. The completed task filedocs/tasks/done/2026/remote-session-host-backend-capnp-rpc-facade.mdrecords the implementation metadata and validation for the host-backend slice. -
Host-backend
capnp-rpcfacade forChatlanded2026-05-13 08:29 UTC.tools/remote-session-client/src/rpc_facade.rscreates a localcapnp-rpcChatclient/server object in the trusted Linux backend and translatesjoin,leave,send,who, andpollcalls into the existing synchronousRemoteGatewayRequest/RemoteGatewayResponseDTO operations. The CLI client and trusted web bridge now route chat calls through the same backend-only facade. Browser JavaScript still receives only view models, typed results, typed denial envelopes, and redacted transcript rows; it does not receive raw capOS caps, local cap ids, endpoint owner handles, result-cap slots, process handles, or proxy table positions. Denials remain DTO/domain results at the browser boundary, and transport disconnects keep the existing reconnect-required mapping. This proves backend proxy semantics over the DTO transport, not live standardcapnp-rpcsupport inside capOS or on the guest gateway wire. -
The capOS SDK transitional
RemoteTransportnow uses the same trusted host DTO backend as a host-sidestdtransport for shared typed clients. The first proof maps a forwardedsystem_infocap obtained throughCapSetGetto a synthetic host-side cap id and drivesSystemInfoClient::motd_waitthrough the currentsystemMotdDTO. This is still backend proxying over the length-prefixed DTO gateway, not live guest-wirecapnp-rpc. -
make runstarts the remote-session CapSet gateway in the default manifest and forwards guest port2327to a host-local loopback port. The helper prefers127.0.0.1:2327but selects a free fallback when another QEMU run or developer process already occupies the port, unless the port is explicitly configured. -
make run-remote-session-capset-interopboots a focused manifest, runs a Linux Rust host client, authenticates as the configured operator by default, lists the broker-shaped remote CapSet, callssession,system_info, and the first endpoint-backedchatservice through a per-session worker proxy, proves wrong-interface/unknown/stale denials, and records a redacted transcript. -
make run-remote-session-adventure-interopuses a focused manifest with the Adventure server, companion NPC binaries, andremote-session-adventure-workerembedded. The operator client launches the Adventure graph, callsadventureStatus,adventureLook,adventureInventory, and the first mutableadventureGo(direction)DTO plus boundedadventureTake(item),adventureUse(item), andadventureDrop(item)controls, proves stale failure after logout, and preserves the same transcript authority-leak checks. -
RemoteAuthMethodadvertises password and anonymous as enabled methods plus disabled public-key, OIDC, and passkey/WebAuthn entries so the protocol and client are not password-shaped. -
The capOS gateway uses manifest-scoped
TcpListenAuthorityon guest port2327, plusSessionManagerandAuthorityBroker. It does not receive rawNetworkManager,TcpListener, orTcpSocketauthority, and the manifest does not grant service endpoint caps directly to the gateway. The gateway asks the broker for a narrower remote-client bundle, exposes broker-held service endpoints such asadventureandchatas remote CapSet descriptors, and starts the firstchatendpoint proxy through a session-bound worker when the client callschatSend. Adventurestatus,look,inventory, boundedgo(direction), and boundedtake/drop/useitem calls now have matching service-specific worker/client slices after the Adventure graph is launched. Other mutable Adventure methods and Paperclips direct methods still wait for their service-specific worker/client slices. Login source metadata is derived by the gateway from the accepted socket and a gateway-generated connection event id rather than from client-supplied fields. -
The host Rust client crate is UI-neutral and can back a CLI, native GUI, Tauri backend, or trusted web gateway.
-
A first trusted local web bridge now exists as
remote-session-ui. It serves a loopback-only browser UI whose Rust backend holds the TCP connection and remote session state.make run-remote-session-capset-uiboots the focused gateway-only fixture, drives every visible button in the browser UI, and captures a screenshot plus a redacted transcript. The current web UI uses a dedicated full-window sign-in view with compact endpoint/auth controls and no full persistent technical header. Login includes a visible username field that is empty by default – the bridge does not pre-fill fromCAPOS_REMOTE_SESSION_USER, hostUSER, or any other host-side identity hint, because pre-filling would leak operator/account hints to anything observing the page before authentication. The browser sends only username/password for password login;operatorand other resource-profile names are not user-typed system details. For the current legacy DTO protocol, the trusted Rust backend maps an omitted password profile to the default operator profile before calling the gateway; gateway-side profile policy/picker support remains future work for manifests with multiple user-meaningful choices. Authenticated users land in a Services-first SPA workspace with Services, CapSet, Diagnostics, Transcript, and Session views rather than seeing every technical panel at once. The UI smoke tracks visible buttons across login and workspace states and fails when any visible button is not exercised. -
The current UI slice makes Services the task-oriented SPA action hub for the default-manifest service surface. It should use the catalog and launcher view models to show runnable profiles, required grants, launch status, denials, and generic/simple service panels without moving capOS authority into browser JavaScript.
-
A read-only DTO service catalog now advertises currently available remote DTO services (
session,system_info) plus backend-held endpoint services such aschatandadventurewhen the broker returns them for the authenticated profile. A companion launcher catalog describes service-runner profiles, required grants, and exported service descriptors. Adventure is the active default-manifest launch profile; Paperclips remains a future profile until its authoritative server path is available to the default remote session. The catalogs are browser-safe view models only: no rawProcessSpawner, process handle, endpoint owner, local cap id, or result-cap slot is exposed. -
The launch DTO/probe slice is complete. It exposes the remote-safe
serviceLaunchrequest/status path for cataloged profiles. The request carries only a profile id plus explicit grant names; the status reports support state, accepted grant names, a message, and exported or planned service descriptors. The completed probe contract does not call spawn, create/own endpoint receivers, return process handles, or attach new service caps to the remote CapSet. -
The current Adventure
serviceLaunchslice implements the actual restricted backend launch for the defaultmake runmanifest. The trusted backend/gateway startsadventure-serverplus simple NPC companion processes through an approved service-runner profile and attaches or retains backend-held descriptors/caps for the Adventure/chat-facing services. Browser JavaScript still receives only view models, launch status, service descriptors, denial diagnostics, and typed results. Real direct Chat.send now runs through the first per-session worker/proxy proof; Adventurestatus,look,inventory, boundedgo(direction), and boundedtake/drop/useitem actions use the same pattern after launch, while richer Adventure controls remain later client layers over the same backend-held capability boundary. -
The launch-denial proof is implemented for the currently exposed remote gateway paths. Focused CLI and UI QEMU harnesses drive operator missing-grant, wrong-interface, and disallowed-binary
serviceLaunchdenials; the CLI QEMU harness also drives stale-session and anonymous/no-runnerserviceLaunchdenials. Smoke checks require explicit error codes/messages, backend teardown, no Adventure server or companion process spawn in the denial-only fixture, and no raw process-handle, endpoint-owner, local-cap, result-cap, capability-manager, process-spawner, terminal-authority, or network-authority markers in browser-visible envelopes, UI reports, or redacted transcripts. The separaterun-remote-session-adventure-interopfixture embeds the Adventure binaries, requires the Adventure process graph to spawn, and verifies directAdventure.status,Adventure.look,Adventure.inventory, mutableAdventure.go(direction), and bounded itemtake/use/dropresponses through the worker. Guest admission shipped on2026-05-08 03:59 UTCasRemoteAuthMode::Guestplus theRemoteGatewayRequest.guestLogin @24union arm; the gateway routes it throughstart_guest_sessionand the sharedvalidate_guest_admissionlib-level helper, which refuses any attempt to acquire a non-guest profile (e.g.operator,anonymous) via the guest method and any session whose minted principal is notGuest. The QEMU interop harness now exercises a guest happy-path proof and a guest-profile-mismatch denial; theRemoteErrorCode::DisabledAuthMethodpath is covered through the bridge host-test layer (a manifest with no guest seed makes the kernelSessionManager.guest()return failure, which the gateway maps to that code). -
Rust-level backend/account-store denial coverage now proves inactive accounts (
disabled,locked, andrecovery-only), unknown principals, and missing or retired resource profiles cannot produce remote-client bundle plans. Focused SessionManager account-selection coverage records that unknown, inactive, non-operator, or no-console-password account paths do not become password-login candidates suitable for later broker use. The live CLI QEMU gateway proof now drives failed password proof, unknown account, wrong password requested profile, and anonymous profile mismatch cases; each denied client completes asauth-deniedwith no session start, CapSet list/get, session info, or service-launch activity. Denied re-login clears prior per-connection gateway state plus cached host-client and web-bridge session view state instead of leaving stale authority usable after denial. -
Kernel-backed remote logout is implemented for the DTO gateway. Each
SessionManager-mintedUserSessionregisters a kernel-private liveness cell keyed by the minted session id. Reconstructed broker and launcherSessionContextvalues resolve that existing cell and fail closed if it is absent or logged out; they do not create fresh live state fromSessionInfobytes. Explicit remote logout callsUserSession.logout, and connection teardown logs out the owned live remote session before dropping the backend session cap.UserSession.info, session-boundSystemInfo, endpoint call admission, and normal service-cap dispatch go stale after logout;UserSession.auditContextremains available for audit attribution. Endpoint returns now recheck the caller session at the return commit point: if the caller logged out, expired, or otherwise went stale after admission, the kernel rolls back prepared result-cap move sources, cancels the in-flight call instead of restoring it, posts an invoke-failed caller completion when the caller CQ can accept it, and rejects the server RETURN without copying result bytes, application-exception payloads, result-cap records, or returned caps into the stale caller. -
Gateway idle-disconnect bug fixed (operator-reported regression on the trusted web bridge). Symptom: after some time of using
make remote-session-uiagainstmake run, the next routine action – often a periodic or user-drivensessionInforefresh – failed withgatewayDisconnectedcarrying the message “remote gateway closed the connection during sessionInfo; retry login to reconnect”, forcing the operator to log in again. Root cause was gateway-side: the per-frame TCP recv on the accepted remote-session socket used a 5-second timeout (WAIT_NS = 5_000_000_000) insiderecv_exact/recv_frame. Routine inter-request idleness on the bridge – which is reactive, not driven by a background poller – exceeded the 5 s budget, the gateway treated the timeout as a fatal recv failure, exited the per-connection loop, ranclose_remote_session_state(issuingUserSession.logoutand the “remote session stale” / “connection teardown” audit lines) and dropped the TCP connection, then accepted the next host TCP attempt fresh. The bridge’s next request hit the closed socket and surfaced the disconnect throughgateway_io_error. Fix: useRECV_FRAME_WAIT_NS = CAP_ENTER_WAIT_FOREVERfor the per-frame recv loop. The kernel-side TCP recv waiter still resolves on data arrival, on clean peer FIN as a 0-byte completion (treated as graceful peer teardown), and on transport-level errors (treated as fatal recv failure); only the spurious 5-second idle timeout is removed. Regression test:recv_frame_wait_is_forever_to_survive_idle_remote_clientsindemos/remote-session-capset-gateway/src/lib.rspins the policy constant. Aconst _: () = assert!(...)in the gateway main keeps the lib constant and the runtimeCAP_ENTER_WAIT_FOREVERsentinel in lockstep so the value cannot drift back to a finite timeout. The short-lived smoke harnesses (make run-remote-session-capset-interop,make run-remote-session-capset-ui) finish well within the previous 5 s budget and so did not catch this – the bug only fires under realistic interactive operator pacing. Future work: when SSH Shell Gateway lands, audit the equivalent recv-loop policy on that path before borrowing the shape from this gateway. -
Partial-frame DoS proof closed
2026-05-07 08:37 UTC. The forever-wait fix above survives quiet remote peers but, taken alone, also lets a peer that sends a frame header and then stalls (or dribbles a few bytes per minute) keep the gateway accept loop pinned on a single connection. The gateway recv now uses a two-phase wait policy: byte 1 of an idle frame waits forever (RECV_FRAME_WAIT_NS = CAP_ENTER_WAIT_FOREVER) with up toTCP_RETRY_ATTEMPTS = 1024EAGAIN retries, while bytes 2..N of an already-started frame use the boundedWAIT_NS = 5_000_000_000(5 s) wait with no EAGAIN retry, and the per-frame recv-call count is capped atMAX_FRAME_COMPLETION_RECVS = 64, bounding a slow-dribble peer at roughly 5 minutes per frame before the gateway closes the connection. Proven byrun_partial_frame_probeintools/qemu-remote-session-capset-harness.sh, which opens a TCP connection, sends a 4-byte header declaring an 8192-byte payload followed by only 4096 payload bytes, and observes the gateway closing the connection within 20 seconds; the QEMU smoke (tools/qemu-remote-session-capset-smoke.sh) asserts the proof lineremote-session partial-frame proof: started payload closed after bounded wait.
Default Run And Game Server Story
The default operator manifest is system.cue, layered on
cue/defaults/defaults.cue. Today it boot-launches standalone init; init
starts chat-server, remote-session-capset-gateway, remote-session-web-ui,
and the foreground shell. The default binary catalog embeds Adventure server,
Adventure NPC, Adventure client, and the terminal Paperclips binary. Adventure
is not boot-started automatically, but the current remote-session slice makes
the default-manifest serviceLaunch path start adventure-server plus simple
NPC companions through a restricted backend service-runner profile and attach
or retain backend-held Adventure/chat-facing service descriptors/caps.
Paperclips launch remains future. The default remote-session gateway receives
only console, scoped TCP listen authority for guest port 2327,
SessionManager, AuthorityBroker, and narrowly approved backend launch
authority; it does not expose raw ProcessSpawner, raw network-manager/socket
authority, endpoint owner caps, process handles, local cap ids, or result-cap
slots. The remote-session-web-ui service receives scoped TCP listen authority
for guest port 8080, SessionManager, AuthorityBroker, console, and the
read-only system manual cap.
make run forwards guest port 8080 to a loopback host port and prints
remote self-served UI: tcp 127.0.0.1 <port> -> guest :8080 so the operator
can open the self-served UI in a browser directly from the default operator run.
Current game-server proofs live in focused manifests:
make run-adventureusessystem-adventure.cue, which startschat-server,adventure-server, Adventure NPC companion processes, anadventure-scenario-test, and the shell. The Adventure server exports theadventureendpoint, consumes a client facet ofchat, owns room/player state, and keys player access by the live caller-session reference.make run-paperclipsusessystem-paperclips.cue, which startspaperclips-serverandpaperclips-proof-serverservices exportingPaperclipsGameendpoints, then launches the terminalpaperclipsclient with explicitStdIO, game endpoint, timer, and optionalproof_acceleratorgrants. The server owns generated content, game state, timer cadence, command descriptors, status snapshots, project entries, unlock checks, and game-rule mutation.
The remote UI direction is therefore not “open a terminal and type the MOTD
commands.” The completed DTO/probe slice can describe and probe runnable
game-server profiles without side effects. The current Adventure implementation
gate is the real restricted service-runner/catalog surface for the default
manifest: it starts the approved Adventure server graph and attaches or retains
the capabilities those processes export or receive to the backend-held remote
CapSet. The service-panel UI can expose this as launch state, descriptors,
denials, and generic/simple surfaces. Chat now has the first worker-backed
method proof; Adventure status, look, inventory, bounded go(direction),
and bounded take/drop/use item calls have a service-specific per-session
worker/client context after launch. Paperclips stays future until the
server-owned Paperclips profile is available to the default remote session.
Host UI UX Direction
For the high-level synthesis of UI scope, invariants, and architecture,
read docs/proposals/remote-session-capset-client-proposal.md ->
“UI Scope And Architecture”. This section keeps the operator-story
guidance for day-to-day UX work.
The host UI should optimize for the ordinary operator stories instead of mirroring protocol objects one-for-one:
- Connect and sign in: start with a dedicated OS-like authentication
view. The username field is visible and empty by default – the
web bridge does not pre-fill from
CAPOS_REMOTE_SESSION_USER, hostUSER, or any other host-side identity hint, because pre-filling leaks operator/account hints to anything observing the page before authentication. The CLI may take--useras an explicit operator override; the web UI does not. Endpoint/auth method controls remain available but secondary; retryable login/transport errors stay in the login view without losing the configured endpoint. Resource-profile names such asoperatorare not requested from the user during password login; they are filled only by the trusted Rust backend for the current legacy DTO. A gateway-side policy choice or post-auth profile picker should appear only when multiple manifest-published profiles are meaningful to the user. - Auth method advertising: the gateway forwards the auth methods the system supports, narrowed only by explicit manifest policy. Disabled methods stay listed and clearly marked (so the protocol is not password-shaped); the gateway does not silently hide methods the system supports.
- Understand session health: after login, keep the active profile, principal, expiry, recent result, and logout in a Session view so common service work does not start on a protocol summary.
- Use granted services: make Services the action hub for runnable profiles and remote-proxyable service descriptors. It should show availability, required grants, denial reasons, launch status, and generic command/status forms. When a descriptor is not directly callable yet, the panel should say so instead of implying method success. Service-specific rich clients (real Chat panel, Adventure rich client, Paperclips client, future agent-shell services) layer on top of the same backend-held caps.
- Terminal panels are allowed when granted: the CapSet UI is not
defined as a terminal emulator and works without one, but when
the broker grants a
TerminalSessioncap (for native shell, POSIX shell, or any StdIO-based service expecting a terminal on the other side), the UI may host a terminal panel for that cap. Terminal bytes flow through a backend-heldTerminalSession; the browser renders frames it receives, never opens a raw shell or holds aProcessSpawner. - Agent-shell-exposed capabilities are first-class: the CapSet UI
does not contain the LLM loop, model client, or tool-execution
runner, but agent-shell-exposed services (e.g. “send message to
running agent”, “approve queued action”, “audio stream to/from
agent”) are services the broker can bundle, exposed through the
same per-session worker / typed view-model pattern as Chat or
Adventure. Whether some of those agent surfaces should themselves
be layered on
Chatrather than distinct caps is the cross-cutting refinement task tracked indocs/tasks/. - Inspect capabilities: keep CapSet as an explicit inspection view for users who need names, interface IDs, policies, and descriptor selection.
- Diagnose calls: isolate low-level probes, stale-session proofs, MOTD, and raw result JSON in Diagnostics so common service use is not buried under transport details. The session-summary diff control belongs in Session/Diagnostics, not in the main Services flow.
- Audit and export: keep transcript review/export in its own view, with redaction status visible and raw authority material absent.
Modernization should build on that navigation shape: no full persistent technical header on the login view, a compact authenticated app shell, clear loading and denial states, empty states with next actions, searchable service/capability lists, command forms generated from typed descriptors, side panels for details, keyboard-friendly controls, responsive layouts, and service-specific rich clients layered over the same backend-held capabilities. Adventure and Paperclips should eventually have rich client views, but the minimum viable UI must still expose their available server capabilities through simple generic forms first.
Service-Runner And Catalog Path
Staged path:
- The first reader-facing service catalog is implemented in the DTO gateway and UI. It lists available DTO calls plus service-runner profiles and exported capability descriptors for the current session.
- The remote-safe launch DTO/probe contract for those profiles is complete. The request names a catalog profile and explicit grants; the probe/status result reports support state, accepted grant names, a message, and planned exported descriptors. This slice is intentionally side-effect-free: it does not start a process, allocate endpoint owners, return process handles, or attach caps.
- The current Adventure slice implements a restricted service-runner surface
behind the broker for the default
make runmanifest. It may use local spawn authority internally, but the remote session receives only catalog descriptors, launch requests, launch status, and returned remote capability descriptors. RawProcessSpawner, process owner handles, endpoint owner caps, local cap IDs, result-cap slots, and process handles stay inside capOS or the trusted backend. - The CLI and
remote-session-uibackend can call the runner and attach or retain the returned backend-held descriptors/caps. Browser JavaScript receives view models, launch forms, progress, denials, command/status descriptors, and call results for methods that are actually callable through the current DTO path; it does not receive raw capOS capability objects. - Start with simple generic panels. Adventure now exposes launch plus
status/look/inventory, bounded mutable
go(direction), and simple boundedtake/drop/useitem controls over the backend-heldAdventureendpoint and chat-facing descriptors. The first direct chat call and these Adventure controls run through session-bound worker proxies; broader Adventure verbs and Paperclips calls still need service-specific worker/client layers before richer clients sit on top of the same backend-held CapSet. Paperclips can exposePaperclipsGame.commands,status,projects, andcommandonce the server profile is available to the default remote session. - Keep hardening the repo-local Tauri wrapper. The current
make remote-session-tauricommand policy-checks, dependency-checks, or launches a scaffolded desktop wrapper over the same Rust/backend authority boundary as the web bridge and uses the printedmake runremote CapSet port. The policy check fails closed if bundling, window URLs, default capabilities, app-specific invoke handlers, Tauri commands, ortauri-plugin-*usage drift from the reviewed check/dev scaffold. Distributable packaging and desktop automation remain future polish.
Remaining major gaps:
- Continue expanding the first host UI beyond the current
session,system_info, and worker-backedchatproof while still reusing the Rust backend boundary and DTO gateway. A later Tauri package can wrap the same backend when the goal is a distributable desktop app. - The first richer service client is a session-summary diff. The pure Rust
helper lives in
tools/remote-session-client/src/session_diff.rsand compares two snapshots of the remote session view (CapSet plusSessionInfoSummary) intoCapSetDiff/SessionSummaryFieldDiffrecords keyed on(name, interface_id)and visible session fields. The trusted web bridge stores the raw snapshots backend-side and exposes/api/call/session-diff-refresh, which returns a redactedSessionSummaryDiffVm. The browser renders the diff in a dedicated “Last refresh diff” pane on the Session view, with the newsession-diff-refreshbutton exercised twice by the focused UI smoke (first call captures a baseline withhasBaseline=false; the second call reports the diff against the previous snapshot withhasBaseline=true). Backend host tests cover the baseline + no-change path and an added-cap + expiry-change path. - Make the remote UI capable of discovering and presenting the full
remote-proxyable functionality granted to the authenticated session in the
default
make runmanifest. The first pass may use generic/simple panels for demo services such as chat, Adventure, and Paperclips, but users should not have to switch tools merely because a capability is part of their default remote session bundle. Rich game-specific clients are a later UI layer on top of the same backend-held CapSet, not a reason to narrow the first UI to onlysessionandsystem_info. - Extend the implemented Adventure service-runner slice beyond the first
mutable control. The current host backend can start the allowed
default-manifest Adventure server graph through the restricted launch path,
discover the resulting descriptor in the backend-held remote CapSet, and
call
Adventure.status,Adventure.look,Adventure.inventory, boundedAdventure.go(direction), and boundedAdventure.take/Adventure.drop/Adventure.usethrough a per-session worker. Next work is broader Adventure command coverage and richer game-specific clients on top of that same worker-held boundary. - Keep Paperclips launch future until the authoritative Paperclips server profile is available to the default remote session. The UI may show Paperclips as planned/not remote-proxyable rather than claiming launch support.
- Replace the DTO transport with standard
capnp-rpcframing and live typed remote proxy objects. - Expand auth adapters beyond password and anonymous.
- Use the generalized per-session worker lifecycle manager for future
endpoint-backed services. Chat
sendand Adventurestatus/look/inventory/go(direction)/take/drop/usenow share worker spawn validation, logout/close teardown, graceful shutdown, forced termination fallback, and release flushing; broader Adventure controls, Paperclips worker/client protocol, and live-proxy lifecycle hardening remain future work. - Gateway response writes now fail closed per connection: a send-side host
disconnect or invalid send byte count breaks the connection loop, then drops
backend-held session state and terminates any session-started Adventure
processes instead of aborting the gateway process. Direct Chat.send is no
longer called from the gateway process; it runs through the first
session-bound worker proxy. Adventure
status,look,inventory, boundedgo(direction), and boundedtake/drop/useitem methods now receive the same treatment; broader Adventure methods remain later. - Add resource limits, TLS/mTLS, renewal, revocation, and UI-composition surfaces.
Design Constraints
- Do not serialize local capOS cap IDs, cap-table slots, endpoint receiver selectors, endpoint generations, result-cap indexes, server cookies, or global session identifiers as portable authority.
- Do not treat password auth as the only remote path. The schema and docs must leave room for public key, OIDC, passkey/WebAuthn, mTLS, guest/anonymous, and service/workload admission.
- Keep the session-bound invocation invariant. Remote post-auth calls run under the remote session’s capOS worker context or an equivalent reviewed context.
- Keep default remote bundles narrower than operator shell bundles.
- Keep browser JavaScript and model providers away from raw capOS caps. Browser and agent paths use gateway-side tool/cap proxies.
- Keep the first CapSet UI distinct from WebShell. It can inspect and call currently implemented remote session capabilities without launching a shell, terminal emulator, shell-runner policy engine, or model agent.
- Treat raw
ProcessSpawnerand browser-held capOS capabilities as explicit non-goals for the remote UI path. A service-runner may hold launch authority inside capOS, but browser and webview code see only catalog entries, launch forms, service descriptors, view models, and typed results. - Service launch from the remote UI must go through a restricted,
session-bound launcher or broker service-runner profile. The browser must not
receive raw process handles, local cap ids, endpoint owner handles, or a raw
ProcessSpawner; it receives only view models, launch plans, service descriptors, and typed call forms/results. - Keep UI composition declarative and bounded. A capOS service may propose layout/theme/view updates only through an explicit UI capability; it cannot inject arbitrary JavaScript/CSS, spoof trusted chrome, or persist UI state without a settings/profile cap.
- Keep listener and transport authority scoped; no raw
NetworkManageror broadProcessSpawnerin the long-term gateway. - Preserve the error split: transport/CQE errors, capability infrastructure exceptions, and domain result unions remain distinct.
Related Proposal Updates
The planning update that introduced this backlog aligned these documents:
remote-session-capset-client-proposal.md: owning design.shell-proposal.md: remote clients are peer clients of broker-issued bundles, not shell transports.boot-to-shell-proposal.md: web/remote login feeds the same session manager and broker, and must support non-password admission.ssh-shell-proposal.md: SSH remains a terminal transport, while public-key auth records can also feed non-shell remote clients through a domain-separated protocol.user-identity-and-policy-proposal.md: broker bundles need a remote-client profile shape in addition to shell bundles.browser-capability-proposal.md,llm-and-agent-proposal.md, andinteractive-command-surface-proposal.md: UI composition, browser/agent front ends, and typed command surfaces remain capability-mediated rather than raw browser or shell authority.roadmap.mdanddocs/tasks/README.md: the old chat-only interop item is reframed as remote session CapSet interop without changing the selected threading milestone.
Grounding Files
Relevant design and research grounding:
docs/proposals/session-bound-invocation-context-proposal.mddocs/proposals/user-identity-and-policy-proposal.mddocs/proposals/boot-to-shell-proposal.mddocs/proposals/shell-proposal.mddocs/proposals/ssh-shell-proposal.mddocs/proposals/certificates-and-tls-proposal.mddocs/proposals/oidc-and-oauth2-proposal.mddocs/proposals/capos-service-proposal.mddocs/proposals/interactive-command-surface-proposal.mddocs/proposals/browser-capability-proposal.mddocs/proposals/llm-and-agent-proposal.mddocs/research/cloudflare-capnproto-workers.mddocs/research/spritely-captp-ocapn.md
Ordered Gates
Gate 0: Rename The Target
- Rename the planning target from chat interop to remote session CapSet interop while preserving the existing chat proof as a historical transport slice.
- Add docs that say the remote client is a regular host app and does not
use
capos-rt, the capOS ring page, or the local CapSet page. - Keep the existing
make run-capnp-chat-interoptarget until a successor proof exists; do not remove useful evidence.
Gate 1: Host Rust Cap’n Proto RPC Client
- Add a host-built Rust client crate or tool using generated schema
bindings. The first slice uses length-prefixed schema-framed Cap’n Proto
DTOs; standard
capnp-rpcremains open. - Keep the client library UI-neutral so it can back a CLI harness, a native GUI, or a Tauri backend without changing the capOS protocol.
- Connect through QEMU host forwarding to the capOS gateway.
- Verify schema version/interface ID mismatches fail with explicit diagnostics.
- Add a host-side transcript that records successful connect, bootstrap, session info, CapSet list, calls, denials, and logout.
Gate 1A: First Host UI Client
- Build a thin Tauri or trusted-local-web UI over
tools/remote-session-client, without changing the capOS gateway protocol. Prefer Tauri when the goal is a distributable desktop app whose Rust backend can hold the remote session; prefer a local web bridge when browser iteration speed matters more than app packaging. - Document and support the repo-local operator paths:
make runfor capOS/QEMU,cargo run --manifest-path tools/remote-session-client/Cargo.toml --target x86_64-unknown-linux-gnu --bin remote-session-client -- --host 127.0.0.1 --port <printed-port>for the CLI, andCAPOS_REMOTE_SESSION_PORT=<printed-port> make remote-session-uifor the trusted local web bridge. The Makefile target wraps the sameremote-session-uiRust backend and defaults tohttp://127.0.0.1:3337/. The Tauri wrapper layers over the same backend, not a separate authority model. - Add a bounded repo-local Tauri wrapper command:
CAPOS_REMOTE_SESSION_PORT=<printed-port> make remote-session-tauri. It checks Tauri CLI and Linux build prerequisites, includingxdoandopensslpkg-config modules, reports dependency/scaffold status, and either runs a deterministic wrapper check or launchescargo tauri devwhen requested. Missing prerequisites fail with explicit diagnostics and point operators back tomake remote-session-ui. - Add the actual repo-local Tauri wrapper over the existing backend. The
wrapper shares the same
tools/remote-session-clientbackend boundary by loading the loopbackremote-session-uisurface; webview code receives view models and user events, not replayable capOS handles. Distributable package bundling remains disabled until the sidecar/backend lifecycle is reviewed. - Add a policy-only Tauri wrapper preflight:
CAPOS_REMOTE_SESSION_TAURI_MODE=policy tools/remote-session-tauri.sh. The guardrail proves the current wrapper remains check/dev only:bundle.active=false, the TauridevUrland singlemainwindow URL stay pinned tohttp://127.0.0.1:3337, default permissions stay exactly["core:default"], and app-specificinvoke_handler,generate_handler,#[tauri::command], andtauri-plugin-*drift is rejected. This does not prove distributable packaging or desktop automation. - Keep capOS authority in the backend. Browser/webview JavaScript receives session summaries, auth-method descriptors, CapSet entries, capability call forms, transcript rows, and denial diagnostics, but no replayable capOS handles.
- Implement the first UI views for endpoint configuration, auth-method
inventory, password/anonymous login, session summary, CapSet list/get,
sessionInfo,systemMotd, denied-chat probe, logout, stale-call proof, and redacted transcript export. The first web bridge now uses a dedicated full-window sign-in view and authenticated SPA navigation so the common workflow is not a single technical page. - Implement selectable remote UI themes based on the committed concept
assets in
tools/remote-session-client/ui/assets/: a space login theme usingbg-space.2k.webpanddesign-mockup-space-login.webp, a mountain login theme usingbg-mountain.2k.webpanddesign-mockup-mountain-login.webp, a light login theme usingdesign-mockup-light-login.webp, and a hacker terminal theme usingdesign-mockup-operator-console.webp. The hacker theme should use a black/deep-teal background, phosphor-green monospace typography, thin terminal-grid borders, subdued binary side texture, bracketed primary action text, and a footer status line such as “Secure connection established” with a lock indicator, without keeping a persistent global header above the login or workspace views. Treat the mockups as visual references, not runtime screenshots. The implementation should expose a bounded theme selector in the trusted local web UI, persist the selected theme locally, keep browser JavaScript limited to UI state and backend view models, preserve the existing authenticated SPA workflow, and prove contrast, focus, small-screen layout, and screenshot coverage for every theme. The trusted web UI now serves only the committed theme assets by fixed name, stores theme choice in browser-local UI state, drives the selector in both login and workspace modes, and captures desktop plus mobile screenshots for both login and workspace views of each theme in the focused UI smoke. The login view is styled as a focused OS-style sign-in surface without a persistent header; endpoint configuration, auth method inventory, anonymous login, and theme choice remain accessible as compact secondary controls. - Ensure the UI discovers every granted remote CapSet entry in the default
make runoperator session and offers at least a generic/simple surface for each remote-proxyable service exposed by that bundle. Call forms are only for methods the current DTO/proxy path can actually invoke. The first endpoint-backed chat call is now callable through the session-bound worker proxy, and Adventurestatus,look,inventory, boundedgo(direction), and boundedtake/drop/useitem actions are callable after the Adventure service graph is launched. Game surfaces can start with a simple chat send/probe form, a generic Adventure panel when the service is callable remotely, and Paperclips status/command panels when its server capabilities are exposed. Rich game clients remain a later layer over those same capability bindings. The gateway now lists broker-held endpoint descriptors fromservice_endpoints, so operator sessions includesession,system_info,adventure, andchat; the focused QEMU proof asserts those CapSet entries and the web UI exposes them through CapSet and Services surfaces. - Add a task-oriented “Services” view for default-manifest operator
sessions: list broker/launcher-advertised runnable services, show which
grants are required, start allowed game server processes through a
remote-safe restricted launcher/service-runner API, and attach or retain
the returned exported descriptors/caps in the backend-held remote CapSet.
The first Adventure flow should be able to start
adventure-serverplus required NPC/server companion processes with their manifest-shaped grants, then show the resulting Adventure/chat descriptors through generic/simple panels. Chat method success now runs under the authenticated session through the first per-session worker proxy; direct Adventurestatus,look, andinventorynow have matching service-specific worker/client paths, and bounded mutablego(direction)plustake/drop/useitem paths use the same worker. Broader Adventure method success remains later work. The Paperclips flow may stay simple until the authoritative Paperclips server backlog lands, but the UI direction is server-owned game state and remotely callable game capabilities, not terminal text scraping. The web bridge refreshes CapSet, service catalog, and launcher catalog view models after a successfulserviceLaunchso the SPA reflects post-launch descriptors immediately; the focused UI fixture still treats missing Adventure binaries as an explicit planned/denied state. - Add browser/UI automation for the chosen client: start a gateway-only
QEMU fixture, such as
run-remote-session-capset-interop-vmwith explicit hostfwd/pid/log handling or a new focused UI fixture target, then drive login, CapSet inspection, capability calls, denials, logout, and transcript redaction, and capture screenshots or traces for review. Do not drive the UI againstmake run-remote-session-capset-interopbecause that wrapper starts the scripted CLI client and shuts QEMU down. - Keep WebShell-specific work out of this gate. No terminal emulator, shell process delegation, shell-runner policy, agent tool execution, or UI-composition cap is required for the first CapSet UI.
Gate 1B: Self-Served capOS Web UI
Gate 1A is host-served bridge work: make remote-session-ui serves the
browser UI from the trusted host Rust backend while capOS exposes the remote
CapSet gateway over QEMU host forwarding. Gate 1B adds the first self-served
capOS web UI proof: a capOS-side service serves the browser UI entry point and
same-origin backend path itself.
Task records:
remote-session-self-served-web-ui-designselected the capOS-side hosting boundary, listener authority, asset source, session/admission path, asset integrity/update story, and browser-safe view model boundary.remote-session-self-served-web-uiimplemented the first self-served proof with a focused immutable UI shell and browser automation against the capOS-served origin.remote-session-self-served-web-ui-default-runintegrated the self-served path into ordinarymake run. The default manifest now auto-startsremote-session-web-uiandmake runprintsremote self-served UI: tcp 127.0.0.1 <port> -> guest :8080. Completed 2026-05-14 09:07 UTC.remote-session-self-served-full-ui-bundlereplaces the immutable proof shell with the reviewed fixed-name boot-resource UI bundle. The capOS service now serves/,/app.js,/styles.css,/feature-flags.js,/themes/retro.css, the icon/background/logo assets,/ui-config.js, and/bundle/manifest.jsonfrom the capOS-owned origin with explicit content types, no directory traversal, and a build-time digest pinned indemos/remote-session-web-ui/ui-bundle.digest. The focused proof verifies every served asset byte-for-byte against the manifest and then drives the operator workspace views, logout, stale failure, transcript redaction, and system-manual view models.cloud-prod-remote-session-web-ui-l4-local-proofconsumed the landed Phase C userspace L4 and DHCP/IPv4 config proofs. It provesremote-session-web-uithrough the non-qemucloudboot socket path locally with the full fixed-name UI bundle, password login, backend-heldSystemInfo, logout/stale failure, manual viewer, and browser-boundary checks. Completed 2026-06-09 01:49 UTC (ff769a5c) as local QEMU/cloudboot evidence only; it does not claim private GCE reachability, public ingress, TLS, or production browser readiness.cloud-prod-network-stack-web-ui-slow-client-boundshardened the userspace network-stack server that backs the L4 Web UI listener (Review C medium: a single-writer accept loop and fatal recv/accept/send budgets let one idle or held-open unauthenticated client crash the network stack or block every other connection). The server now keeps a bounded multi-socket listen backlog, hands out only data-ready connections (idle held-open ones are left for the reaper), reaps idle/half-closed backlog connections after a short idle window, and treats every budget expiry as non-fatal (abandon the offending connection and re-arm instead of exiting).make run-cloud-prod-remote-session-web-ui-l4adds a slow-client bound proof in two phases: several idle held-open clients that send no request bytes (kept out of the serving path by the reaper) and one partial-request (Slowloris) client that sends incomplete headers then stalls (served, then abandoned when the recv budget expires). In both phases a concurrent/healthzkeeps completing and the server survives, and the kernel log shows the backlog config, idle reaping, and the recv-budget abandon. Serving is still serial, so a data-ready partial-request client adds a bounded head-of-line delay (one recv budget) to the next connection rather than blocking it indefinitely; that bound is the accepted limit for this research demo. This is the server-side prerequisite forremote-session-web-ui-connection-bounds, which layers per-connection deadlines in theremote-session-web-uiRPC client on top.remote-session-web-ui-connection-boundscompleted the client side of that boundary (Review C medium). Theremote-session-web-uiservice replaced its retry-count spin budgets with per-connection wall-clock deadlines on the monotonic clock: a request-read deadline (6 s, anchored at accept, covering request line, headers, and body together) and a response-send deadline (30 s, anchored conservatively at request dispatch, before routing), neither of which resets on byte progress, so total accept-loop occupancy per connection is bounded regardless of client pacing. Deadline expiry abandons only the offending connection fail-closed with an explicit console evidence line. This closes the case the server-side per-call budgets cannot see: a drip-feed client that delivers one header byte at a time keeps every server recv budget fresh while never completing the request.make run-cloud-prod-remote-session-web-ui-l4adds a third slow-client phase driving exactly that drip-feed client and asserts the web-ui abandons it at the read deadline and/healthzstill completes afterwards, alongside the existing held-open-vs-concurrent-/healthzand Slowloris phases. Connection admission limits (the bounded listen backlog and idle reaping, which cap all pre-login connections) remain server-owned in the network-stack listener layer.remote-session-web-ui-session-hardeningclosed Review C high (predictablecapos_remote_sessiontokens and missing browser-session enforcement). Theremote-session-web-uiservice now mints an unpredictable, opaque server-side session id (one-way SHA-256 over the kernel-CSPRNG backend session id, base64url, never the accept counter) and a domain-separated per-session double-submit CSRF token; rotates both on login, re-login, and logout (clearing the browser cookies and failing closed on a replayed rotated-out id); enforces idle and absolute lifetime bounds before request dispatch; validatesHost(DNS-rebinding) andOriginand requires theX-CSRF-Tokendouble-submit cookie/header on state-changing requests; and marks the session cookieSecurewhenX-Forwarded-Proto: httpsreports HTTPS ingress (the plaintext loopback proof stays explicitly non-Secure). This aligns the in-capOS server with the committed operator-bundle and host-bridge CSRF contract (tools/remote-session-client/{ui/app.js,src/web_security.rs}).make run-cloud-prod-remote-session-web-ui-l4extends the self-served proof with stale-token, CSRF (missing/mismatch), Origin (missing/cross-site), Host, and idle/absolute expiry denial paths plus a login/re-login rotation check, all failing closed before any backend-held capability call. Local QEMU/cloudboot evidence only; it does not claim private GCE reachability, public ingress, or TLS.- The public-ingress browser hardening set is done on the same
make run-cloud-prod-remote-session-web-ui-l4gate (all local QEMU/cloudboot evidence, no public exposure): in-guest login peer-gate and failure-backoff hardening, the single public-origin policy (one manifest-grantedpublic_origin.<host>marker fixes the only accepted public origin on the trusted forwarded-scheme HTTPS path), the IAP-aware SameSite cookie policy (Strict by default, Lax only under the manifest IAP marker with a cross-site GET provenance gate), the JSON content-type guard (typed 415 on every state-changing/api/*POST before backend dispatch), the security response headers and strict CSP (uniform header set plus a no-unsafe-inline CSP proved violation-free in a real Chromium), the GFE-range-pinned forwarded-scheme trust (X-Forwarded-Protoauthoritative only from130.211.0.0/22/35.191.0.0/16, implementing the firewall-bounded forwarded-scheme trust rule below), and the public/healthzhealth-check contract (bounded anonymous JSON body, no session state, Host-allowlist exempt for by-IP provider health checkers). - Two browser-boundary local proofs remain dispatchable task records under
docs/tasks/, not landed: the public-deployment loopback gate (reject loopbackHost/Origin/Refereracceptance and loopback-shaped source hints under the configured public-origin load-balancer posture while preserving the local QEMU loopback proof) and the consolidated browser-visible forbidden-marker matrix proof across success, denial, health, manual, and error response classes, including hostile browser-supplied authority fields. Both extendmake run-cloud-prod-remote-session-web-ui-l4locally and do not authorize private GCE reachability or public exposure. cloud-gce-legacy-virtio-webui-serving-local-proofclosed the legacy-virtio serving gap locally (2026-06-11): a persistent kernel-brokered legacy virtio 0.9 runtime backs the typedNiccap, andmake run-cloud-gce-legacy-virtio-webui-servingproves a host HTTP peer fetching the byte-verified UI bundle underdisable-modern=on. Local serving evidence for the GCE NIC shape only, not live GCE reachability.- The no-spend provider-harness gates are done as recording-stub fixture
evidence — provider CLIs resolve only to the stubs, with no real provider
invocation or mutation on any path: the private-proof harness
--preflight-onlymode, the private and public proof-evidence validators, the public ingress resource plan gate, the journal-driven teardown engine, and the provider-command allowlist gate. They bound the future private/public runs’ evidence, resource graph, teardown, and provider-command surfaces; they are not reachability, exposure, or spend authorization. A matching public-harness no-spend preflight task is dispatchable future work, not landed. cloud-gce-private-self-hosted-webui-prooffollows the local Web UI L4 proof and proves private GCE reachability over the live NIC without public IP or public firewall exposure. It remains on hold on missing firewall IAM against GCE default-deny ingress and on per-run billable authorization; the legacy-virtio serving gap is closed locally.cloud-gce-public-webui-ingress-tls-policy-designselected the public ingress, TLS/certificate, firewall, browser-session, and teardown policy before exposure work starts (see “Selected public ingress and TLS policy” below).cloud-gce-public-self-hosted-webui-ingress-tlsis blocked on the private proof and on explicit public-exposure approval. With the policy design closed, it is the first public operator-access step, builds against the selected provider-terminated-HTTPS policy, and does not permit raw public HTTP as the closeout proof. The local plan/teardown/evidence/allowlist gates above bound this future run without authorizing it.
IPv6 is a separate network-stack capability lane, not a Gate 1B blocker for the
first public Web UI proof. The IPv4 path above still owns the first useful GCE
Web UI closeout; the IPv6 scope decision
cloud-prod-ipv6-architecture-status-grounding
is done and the lane is tracked in
Hardware, Boot, and Storage.
The broader network usability lane is
Network Usability and Post-smoltcp:
DNS resolver, POSIX getaddrinfo, ping/ping6, packet tracing, socket
readiness, and transport policy are follow-on usability work. They do not block
Gate 1B or the first IPv4 public Web UI proof unless a later ingress policy
explicitly promotes one; the local DHCP/IPv4 configuration gate is done and now
feeds the Web UI L4 and private GCE proof gates.
Selected public ingress and TLS policy:
- The first public exposure of
remote-session-web-uion GCE terminates HTTPS at a GCP external Application Load Balancer (Google front end, provider-managed certificate). capOS serves only plain HTTP/1.1 on its UI backend port; the operator browser reaches the UI exclusively through the load balancer’s HTTPS origin, and capOS never holds the TLS private key. - This is the bootstrap shape chosen because capOS does not yet have TLS
termination and private-key custody. The Phase-1 certificate verifier has
landed, but
TlsServerConfig, key custody, and the userspace L4TcpSocketrelocation have not landed. The ACME/Let’s Encrypt path is now decomposed in Certificates / TLS as a capability-native successor: minimalPrivateKey/KeyVault/KeySourcecustody, TLS client/server support, RFC 8555 account/order, scopedhttp-01,CertificateStore.watchrenewal, and then a separate public GCE direct-termination proof with explicit public-ingress and CA authorization. That successor does not replace the provider-managed first public proof. - Raw public HTTP is rejected as closeout evidence; any port-80 listener is a 301 redirect to HTTPS at the load balancer and never reaches capOS.
- Browser session rules add a single public HTTPS origin, firewall-bounded trust
of the load balancer’s forwarded-scheme header,
Secure/HttpOnly/SameSitesession cookies, HSTS, anti-CSRF tokens with an origin check, bounded session/idle lifetime, and server-side logout — over the unchanged Gate 1B view-model boundary. - Firewall ingress to the UI backend port is restricted to Google
load-balancer/health-check ranges (
130.211.0.0/22,35.191.0.0/16) and, if IAP fronts the door, the IAP range (35.235.240.0/20); never0.0.0.0/0. - The full firewall, certificate-custody, evidence, and teardown policy lives in the “Public Web UI Ingress Policy” section of Cloud Deployment, and the TLS-termination/key-custody decision in the “Bootstrap TLS for the First Public GCE Web UI” section of Certificates and TLS.
Selected design:
- Add a capOS userspace service named
remote-session-web-uifor the first proof. It is a sibling ofremote-session-capset-gateway, not a replacement for the gateway and not the hostremote-session-uibridge running inside capOS. The service owns the web listener, static assets, authenticated web sessions, remote-session backend state, per-session worker proxies, and browser-facing view-model projection. - Static assets live as a checked-in, fixed-name UI bundle embedded in the
capOS boot package and served by
remote-session-web-ui. The service serves only fixed files,/bundle/manifest.json, and same-origin JSON API routes; it does not expose a general filesystem, asset directory traversal, host path, or development hot-reload surface. The full-bundle proof isremote-session-self-served-full-ui-bundle. - The first listener is HTTP/1.1 on a manifest-scoped
TcpListenAuthorityfor a dedicated UI port, for example guest port8080under QEMU host forwarding. The service serves staticGETassets and same-origin JSON API routes. WebSocket, server-sent events, and streaming terminal/media paths are later extensions that require separate per-route authority and resource bounds; the first self-served proof does not need them. - Manifest grants authorize the listener and backend work: scoped
TcpListenAuthorityfor the UI port,SessionManager,AuthorityBroker, a named immutable UI asset bundle, and only the same narrow remote-client service-runner/backend-launch authority already allowed for the remote session path. The service does not receive rawNetworkManager, rawTcpListenerfactories, broad storage roots, rawProcessSpawner, shell launcher authority, endpoint owner caps, or arbitrary endpoint creation authority. remote-session-web-uiis the trusted backend and holds the remote session CapSet/proxy state server-side. Browser JavaScript receives only browser-safe view models, launch forms, user-event commands, typed results, denial diagnostics, and redacted transcript rows. It never receives raw capOS caps, rawProcessSpawner, process handles, endpoint owner authority, local cap IDs, result-cap slots, session-global identifiers, remote CapSet handles, host usernames, host environment variables, host paths, or QEMU-forwarding identity hints.- Authentication remains gateway/session-manager shaped. The browser sends
credentials or guest/anonymous intent to the capOS-served JSON endpoint; the
service derives connection/source metadata from its accepted socket and its
own event id, asks
SessionManagerfor aUserSession, asksAuthorityBrokerfor the remote-client bundle, and projects only the disclosed session and service fields into browser-safe view models. The browser cannot choose a principal, profile, worker session context, or backend cap holder by replaying a request field. - Cloudboot-local authority inventory for the completed
cloud-prod-remote-session-web-ui-l4-local-proof: the non-qemuproof manifest grantsremote-session-web-uionlyconsole, a scoped UITcpListenAuthorityfor guest port8080served by the Phase C userspace network-stack path,SessionManager,AuthorityBroker, the read-onlymanualcap, thetimercap used by the HTTP/backend loop, and the fixed-name boot-resource UI bundle. It does not satisfy the UI listener from a kerneltcp_listen_authoritysource in the non-qemucloudboot path, and does not grant rawNetworkManager,TcpListener/TcpSocketfactories, broad storage roots, rawProcessSpawner, shell launcher authority, endpoint-owner caps, arbitrary endpoint creation authority, host filesystem paths, or provider/cloud mutation authority. Backend launch/service-runner authority remains available only through the same broker-approved remote-client bundle policy described above. - The local cloudboot proof should assert the same browser boundary as the
self-served QEMU proof while proving the different listener substrate:
browser-visible envelopes, DOM state, diagnostics, transcripts, and JSON
responses must not contain raw capOS caps, raw process authority,
endpoint-owner authority, local cap ids, result-cap slots,
NetworkManager,TcpListenAuthority,TcpListener,TcpSocket, host usernames, host environment variables, host paths, QEMU-forwarding identity hints, provider resource identifiers, public IPs, firewall rules, or TLS key material. Login/source metadata must come from the accepted socket plus a service-generated event id; browser requests cannot supply the trusted principal, profile, source address, worker-session context, or backend cap holder. - Expected local cloudboot proof markers are the existing service-side lines
that show the narrow service capset, scoped listener, fixed-name bundle,
backend-held login/session, backend-held
SystemInfocall, browser-safe workspace view models, redacted transcript, backend-held manual view-model projection, and stale-call failure, followed by exactly onecloudboot-evidence: remote-session-web-ui-l4 <token>marker after all forbidden-authority and browser-visible marker checks pass. That marker is local QEMU/cloudboot evidence only; it does not prove private GCE reachability, public ingress, HTTPS/TLS custody, firewall policy, or browser production readiness. - Proof marker triage:
| Missing or failed marker class | Likely failed invariant | Owning lane | Blocks local Web UI L4 proof? |
|---|---|---|---|
Narrow service capset, scoped UI listener, or trusted listener/source metadata is absent, or the listener is satisfied by the non-cloudboot qemu kernel socket path | remote-session-web-ui is not bound to the manifest-scoped TcpListenAuthority served by the Phase C userspace network-stack path | Listener substrate | Yes. The local L4 proof cannot close without the non-qemu cloudboot listener source. |
Fixed-name bundle, byte-for-byte asset, content-type, /ui-config.js, or /bundle/manifest.json marker is absent or mismatched | The capOS-served origin is not serving the reviewed immutable boot-resource UI bundle | Fixed-bundle serving | Yes. A health-only service marker is not a self-served Web UI proof. |
Backend-held login/session, SystemInfo, manual view-model, or workspace view-model marker is absent, or a browser request supplies trusted principal/source/backend holder fields | The service is not deriving authority from server-side session state and broker-approved backend caps | Authenticated backend call | Yes. The proof must exercise at least one backend-held cap path after login. |
| Logout/stale-call failure marker is absent, stale requests keep dispatching, or result-cap/session table identifiers leak into client-visible state | Backend session teardown does not fail closed before later public or provider promotion | Stale/logout failure | Yes. The first local L4 proof needs the stale-call denial; later session-hardening work may add stricter lifetime controls. |
| Browser-visible envelopes, DOM, diagnostics, transcripts, or JSON contain raw caps, cap ids, process/socket/network authority, host identity, provider resource ids, public IPs, firewall rules, or TLS material | The browser-safe view-model boundary leaked trusted authority or out-of-scope provider/exposure state | Browser-visible forbidden marker leak | Yes for local-service leaks. Provider, public-ingress, and TLS material also route to their later proof lanes before promotion. |
All service-side markers pass but the final cloudboot-evidence: remote-session-web-ui-l4 <token> marker is missing, duplicated, or emitted before forbidden-authority checks finish | The harness has not produced a single closeout marker tied to the completed local cloudboot proof | Evidence-class boundary | Yes. The local proof is incomplete without exactly one final local L4 marker. |
| Private GCE probe, public HTTPS, DNS, certificate, firewall, load-balancer, or operator-exposure markers are absent | The run did not attempt a later evidence class, or correctly kept provider/public exposure out of the local proof | Evidence-class boundary | No. Those belong to cloud-gce-private-self-hosted-webui-proof or the on-hold public ingress/TLS task, not the local L4 closeout. |
- The first implementation gate was
remote-session-self-served-web-ui: boot a focused manifest, load the UI from the capOS-owned HTTP endpoint, log in, exercise at least one granted capability call through the service-held backend state, prove logout/stale failure remains closed, and run browser automation against that capOS-served origin. That pre-Phase-C target used the qemu-only kerneltcp_listen_authoritysocket owner and is no longer current selected- milestone evidence after the kernel L4 owner was retired. The replacement gate ismake run-cloud-prod-remote-session-web-ui-l4, owned bycloud-prod-remote-session-web-ui-l4-local-proof. - Validation targets:
make run-cloud-prod-remote-session-web-ui-l4clearly distinguishes the self-served origin from the host development bridge and asserts forbidden browser-visible markers are absent. The currentmake remote-session-uibridge remains a development tool, andmake run-remote-session-capset-uikeeps its existing host-bridge smoke coverage while the self-served path evolves. Ordinarymake runremains a remote CapSet forwarding path, not a self-served UI proof, unless the default-run integration task closes with reviewed manifest, forwarding, and operator-instruction changes. - Rollback path: remove the self-served focused manifest/target and stop
granting
remote-session-web-uiits UITcpListenAuthorityand asset bundle, while leaving the host-servedmake remote-session-uipath and the remote-session CapSet gateway unchanged. Because the static assets are boot-package resources and the listener is manifest-granted, rollback is a manifest/build-target selection change rather than a downgrade of the gateway authority model.
Acceptance for the implementation gate:
- The browser retrieves UI assets or the UI backend entry point from a
capOS-owned service path, not from the host
remote-session-uidevelopment bridge. - Browser JavaScript receives browser-safe view models and user-event
commands only; raw caps, raw
ProcessSpawner, endpoint owner authority, result-cap slots, and host-local identity hints stay out of browser-visible state. - The proof uses browser automation against the self-served path and exercises login plus at least one granted capability call.
Gate 2: Gateway Bootstrap And Auth Method Inventory
- Add
RemoteSessionGateway.authMethodsand a policy-shaped method list. - Support explicit denial for disabled methods so the harness can prove password-only assumptions are not baked into the protocol.
- Record gateway-derived source metadata, method kind, requested profile, and protocol binding in audit-shaped output.
- Keep first-remote-client setup disabled unless a manifest explicitly grants a local setup authority path.
Gate 3: First Auth Adapter
- Choose one bounded first adapter for the proof. Acceptable first choices
are public-key fixture auth, password via existing
SessionManager.loginunder explicit policy, or guest/anonymous admission under a narrow profile. Do not design the schema as password-only. - Map the accepted proof into
SessionManagerand mint a realUserSession. - Add Rust-level backend/account-store proof coverage that disabled,
locked, and recovery-only accounts, unknown principals, and missing or
retired resource profiles cannot yield remote-client bundle plans, and
that SessionManager password-account selection rejects unknown or
inactive account records before a
UserSessioncan be minted. - Prove failed proof, wrong requested profile, and unknown principal in the live host/QEMU remote-gateway path before the broker returns a CapSet. The proof also covers anonymous profile mismatch and asserts denied re-login clears previous per-connection/session view state.
Gate 4: Broker Remote Bundle
- Add an
AuthorityBrokerpath for remote-client bundles, or a temporary clearly named wrapper around the existing shell bundle that does not imply terminal authority. - Bundle at least
sessionandsystemInfo; add one demo service cap such aschatorpaperclipsfor behavior proof. - Add a remote-client bundle shape that preserves the useful
default-operator service surface without becoming an operator shell
bundle. It should include a restricted launcher/service-runner descriptor
for allowed service binaries, broker-held or remote-proxyable service
endpoints such as
chatandadventure, and enough metadata for the UI to construct launch plans for server processes. It must not grant a raw shell launcher, terminal authority, rawProcessSpawner, raw network factories, or endpoint owner authority to browser code. - Ensure anonymous/guest/default remote bundles do not receive operator shell launcher or broad service endpoints unless policy explicitly grants them.
- Add wrong-name and wrong-interface tests for
RemoteCapSet.get.
Gate 4A: Remote Service Catalog, Launch DTO, Adventure Launch, And Game Server Caps
- Define a remote service catalog DTO or capnp-rpc object. It should list policy-approved service profiles, runnable binaries, companion processes, required grant names/interfaces/transfer modes, exported capability descriptors, attach/start/stop policy, and whether each grant is backend-held, service-owned, or a client facet. The current DTO catalog describes available DTO services plus Adventure/Paperclips launch profiles. Adventure start/attach is the current restricted-runner slice; Paperclips attach/start/stop policy remains future runner work.
- Define the restricted service-runner launch request/status/probe DTO
shape: submit a catalog profile plus explicit named grants, then return
side-effect-free support state, accepted grant names, a message, and
planned remote descriptors for exported or broker-held capabilities.
This slice intentionally does not start processes, create endpoint
owners, attach returned caps, or expose raw
ProcessSpawner, process owner handles, endpoint owner caps, local cap IDs, result-cap slots, or browser-held capOS caps. - Implement the actual restricted service-runner behind the
serviceLaunchcontract for Adventure in the defaultmake runmanifest. The service runner may use local spawn authority internally, but the remote/browser-facing contract must still expose only launch request/status DTOs and remote capability descriptors, never raw spawn authority or local handles. - Implement the first game-server flow for Adventure. The
backend should use the remote session’s restricted launcher/service-runner
to start
adventure-serverand simple NPC companion processes with the remote-safe endpoint grant shape: the Adventure endpoint owner and Chat client facet are passed to child processes, while the gateway’s system Console cap is not regranted across the operator-session boundary. The backend then attaches or retains backend-held Adventure and chat-facing service descriptors/caps. Chat now uses a per-session worker endpoint proxy forChat.send; Adventurestatus,look,inventory,go(direction), and boundedtake/drop/useitem calls use the same pattern after launch. Broader Adventure endpoint calls and rich client controls remain later. - Implement the Paperclips direction as soon as the server-owned Paperclips server profile is available in the remote catalog: start or attach to the authoritative Paperclips server, read structured status/project/command descriptors, and submit commands through server-owned capabilities. Until then, the UI may show Paperclips as “terminal-only/not remote-proxyable yet” rather than scraping terminal text.
- Prove launch denials are explicit: disallowed binaries, missing required
grants, wrong-interface grants, stale sessions, and anonymous/guest
profiles without service-runner authority all fail before any process is
started or any returned cap is exposed. The live remote-gateway proof
covers stale sessions and anonymous/no-runner sessions in the CLI QEMU
path; guest admission now has a dedicated
RemoteGatewayRequest.guestLoginarm, and guest sessions go through the broker/account-store remote-client bundle policy with the same no-runner constraint. - Prove process handles and endpoint owner caps stay backend-local or are withheld entirely from the browser. Browser-visible state is limited to launch status, service descriptors, command/status view models, denial diagnostics, and redacted transcript rows. CLI and UI smoke checks reject raw authority markers in transcripts, reports, and API envelopes.
- Add a focused guest remote-gateway login proof once the wire protocol and
gateway expose a concrete guest auth adapter, then repeat the same
no-runner
serviceLaunchdenial assertions for guest sessions. Landed2026-05-08 03:59 UTC. The QEMU interop harness ships aguest admission happy proof(manifest seeds a guest principal, gateway accepts therequestedProfile = "guest"request) and anguest launch-denial proof(successfully admitted guest sessions repeat the service-launch denial matrix; in the Adventure interop manifest this proves the guest bundle still lacks service-runner authority even when the operator path can launch) plus anauth denial guest profile mismatch proof(gateway refusesrequestedProfile = "operator"through the guest method with the redacted"guest login denied"message). The bridge host-tests additionally pin theRemoteErrorCode::DisabledAuthMethoddenial that fires when the manifest has no guest seed.
Gate 5: Per-Session Worker And Proxy Lifetime
- Host the first post-auth endpoint-backed remote cap,
Chat.send, in a per-session worker/proxy context instead of calling it from the gateway process. - Associate the first chat proxied calls with the live remote session context; the focused QEMU proof shows the spawned chat worker running with the operator session context.
- Drop/release the chat worker holds when logout is called, the connection closes, or the worker exits; teardown now asks the worker to shut down through its control endpoint and falls back to termination only if that path fails.
- Generalize the worker/proxy lifecycle infrastructure for the currently
supported endpoint-backed calls. Chat
sendand Adventurestatus/look/inventorynow share worker spawn validation, exactly-one parent control endpoint validation, graceful shutdown, forced termination fallback, logout/close teardown, and release flushing. - Add the first richer Adventure worker/client protocol slice on top of the
shared lifecycle manager: read-only
Adventure.lookandAdventure.inventorynow share the same per-session Adventure worker asAdventure.status. - Add the first service-specific mutable Adventure worker/client protocol
slice: bounded
Adventure.go(direction)now runs through the same per-session Adventure worker and returns bounded movement text plus room state. - Add the first item-oriented Adventure worker/client protocol slice:
bounded
Adventure.take(item),Adventure.drop(item), andAdventure.use(item)run through the same per-session Adventure worker, validate transcript-safe item tokens, and return bounded text or room state to the CLI and web bridge. - Add service-specific worker/client protocol slices for broader mutable Adventure calls and future Paperclips service calls on top of the shared lifecycle manager.
- Treat send-side disconnects while replying as connection close, then release gateway-held state through the existing per-connection teardown path instead of failing the whole gateway process.
- Prove stale proxy calls after logout/disconnect fail closed.
Host-client/backend coverage now includes pre-session bootstrap reset and
zero-byte read-timeout retry, repeated DTO calls, repeated post-logout
stale-call probes, authenticated gateway close during a call, and oversized
gateway response frames. The scripted CLI retries authMethods connection
resets before login so QEMU host-forwarding races do not look like real session
loss. The trusted web backend also retries a pre-session authMethods
bootstrap disconnect or no-byte read timeout before any auth inventory or
session state exists, clears backend-held session state for
disconnect/oversized response failures, and returns user-facing
gatewayDisconnected / reconnectRequired guidance without exposing raw frame
errors to browser JavaScript. Kernel deferred TCP recv waiters now fail
closed with an error CQE on terminal runtime/transport errors instead of
dropping the pending call without completion; WouldBlock still requeues, and
socket close still returns zero-byte EOF. The gateway now uses a connection
frame-read wait instead of the short service-call wait, so an idle TCP remote
session remains open past the former five-second read window and tears down
only when the peer closes or the transport actually fails.
Gate 6: Capability Calls Beyond Chat
- Call at least two granted capabilities through generated host bindings.
The current proof covers
UserSession.info/session,SystemInfo.motd/system_info, the first worker-backedChat.send, and the worker-backed Adventure methods,Adventure.status,Adventure.look,Adventure.inventory, mutableAdventure.go(direction), and bounded item controlsAdventure.take/Adventure.drop/Adventure.use. Broader Adventure controls andPaperclipsGame.statuswait for later service-specific proxy/client gates. - Prove a service-specific domain denial remains a schema result rather
than a transport failure. The focused chat proof asks the per-session
worker to call
Chat.sendwithout first joining the proof channel and requireschatSent(false)in the CLI/UI API smokes, notRemoteErroror a gateway disconnect. - Prove target service sees session-bound caller metadata rather than a
caller-selected identity field. The remote-client chat facet now grants
only the existing bounded disclosure fields to the per-session worker,
the worker explicitly requests those fields on
Chat.join/Chat.send, andchat-serverlogs a target-service proof only after it sees a live opaque caller-session reference with operator principal class, password auth strength, and operator profile class. Browser/client-visible DTOs still do not expose raw scoped refs, local cap handles, or process handles.
Gate 7: Transport Security And Non-Password Auth Expansion
- Add capOS-terminated TLS server config once certificate/TLS primitives exist. Until then, the first public Web UI ingress terminates HTTPS at the provider load balancer (see “Selected public ingress and TLS policy” under Gate 1B); this checklist item is the capability-native successor, not the first public proof.
- Add mTLS client identity admission when certificate policy and account bindings exist.
- Add public-key auth with protocol-domain-separated challenge bytes.
- Add OIDC device-code and browser-assisted PKCE flows when OAuth/OIDC token capabilities exist.
- Add passkey/WebAuthn through the web gateway path when authenticator primitives exist.
- Add service/workload credential admission for non-human automation.
Gate 8: Renewal, Revocation, And Resource Bounds
- Wire kernel-backed
UserSession.logoutand gateway/connection close propagation for the current DTO remote-session gateway. - Reject already-admitted endpoint returns after caller logout/session death before result bytes, exception payloads, or result caps are installed in the stale caller.
- Extend logout/revocation cleanup to live remote proxy objects once standard RPC framing lands.
- Add renewal only through a narrow session-manager/broker path that does not revive stale ordinary grants by accident.
- Add resource limits for connections, remote refs, in-flight calls,
queued promises, result sizes, and per-session CPU/memory/network
accounting. Initial four classes landed
2026-05-03 16:21 UTC: transcript ring (6d855c01), backend cap-holders + catalog mirrors (5ec0e456), outstanding worker calls per session (0f82528c), and gateway concurrent logins per principal (99955d59). Bound choices and the exhaustion-as-typed-denial contract are documented in the proposal’s “Resource and revocation bounds” section. Per-session CPU/memory/network accounting and remote-ref limits remain future work tied to thecapnp-rpcrewrite. - Add explicit
CapException/RPC exception tests for the currently representable Gate 8 failure classes: transport breakage, worker/proxy failure, stale sessions after logout, and oversized messages. Host coverage now checks that the backend-onlycapnp-rpcChat facade maps DTO transport breakage tocapnp::ErrorKind::Disconnected, maps DTO denials and unexpected worker/proxy responses toFailedCapException-like errors, and does not expose raw proxy positions, local cap ids, result-cap labels, session ids, or socket hints in exception text. The trusted web bridge coverage now drives worker-targetedChat.senddisconnect, oversized worker response, and post-logout stale-session paths; each fails closed asgatewayDisconnectedorstaleSession, decrements outstanding worker-call accounting, clears or preserves backend state according to the existing lifetime contract, and keeps redacted transcript export free of raw socket errors, frame sizes, local cap ids, proxy positions, raw session id hex, passwords, and host endpoint hints. Revoked-lease coverage remains blocked rather than faked: the current DTO surface has lease timestamps inRemoteCapEntry, but no explicit revoke/lease-expired request path orRemoteErrorCodevariant that can distinguish a revoked lease from the existingstaleSession/methodDenieddenials. Add the revoked-lease proof when the standard RPC object lifetime path or a reviewed DTO denial code makes it observable.
Gate 9: Bidirectional UI Composition
- Keep this separate from Gate 1A. Gate 1A is a host-rendered UI over the existing client; Gate 9 lets capOS-side services propose bounded UI surfaces back to that host UI through explicit capabilities.
- Add a proposal-level
RemoteUiHost/RemoteUiSurfaceschema slice or equivalent typed DTOs for declarative UI patches and typed user events. - Keep the first UI proof behind a separate granted UI-surface cap, not
implicit in
RemoteSessionorRemoteCapSet. - Prove a capOS service can open/update one bounded surface and receive one typed user event from the host UI.
- Prove the same service cannot spoof login/permission chrome, inject raw JavaScript/CSS, persist layout or theme state, or exceed update/size quotas without explicit authority.
- Add a host-app reset/close path that releases the UI surface and leaves underlying service caps intact.
Verification Targets
Initial documentation/planning check:
make docs
git diff --check
First implementation check:
cargo test --manifest-path tools/remote-session-client/Cargo.toml --target x86_64-unknown-linux-gnu
make run-remote-session-capset-interop
make run-remote-session-adventure-interop
make run-capnp-chat-interop
Security review checklist:
- Remote client cannot obtain authority by guessing a cap name.
- Remote client cannot replay a session or grant identifier on another connection.
- Remote client cannot ask for a local cap slot, endpoint selector, or receiver metadata.
- Logout/close/revocation tears down all session-bound proxies.
- Guest/anonymous profiles receive only explicitly policy-granted caps.
- Browser/agent paths never receive raw capOS capability objects client-side.
- GUI/Tauri/web front ends keep capOS caps in the Rust/backend/gateway side of the trust boundary; UI code receives typed view models, command descriptors, or tool requests.
- UI composition is capability-gated, declarative, quota-bound, and reversible by the user.
capOS SDK Crate And Dual Transport Backlog
Detailed decomposition for the front-door capos SDK crate: one published
crate whose typed capability clients run unchanged against two transports –
the in-process capability ring (an application running inside capOS) and a
remote connection (a host-side RPC client). This extends the
Crate publication roadmap track with a concrete architecture
and publication ordering, and consumes the remote transport planned in
Remote Session CapSet Client. docs/tasks/README.md
should point here when selecting slices; it should not inline the details.
Visible Outcome
- A Rust author writes against typed capability clients (
Console,Timer,EntropySource,VirtualMemory, and future caps) once.cargo add caposbuilds an in-systemno_stdapplication that reaches the kernel through the ring.cargo add capos --no-default-features --features remotebuilds a host program that reaches a capOS instance through the remote session transport, using the same client API. - The crates.io names
capos(front-door facade) and thecapos-*family (capos-abi,capos-lib,capos-config,capos-rt) are published with real content, stable versioning, rendered docs, and license/metadata.
Why This Is High Priority
Two reasons, one architectural and one external:
- Architecture. The Cap’n Proto-first design already treats each process as a capnp-rpc vat and the per-process ring as that vat’s connection to the kernel (design principle 5). “In-system app” and “remote client” differ only in the transport under the typed clients, so a single SDK with a transport seam is the natural shape rather than two parallel client stacks.
- Namespace contention. The
capos/capos-*crate names overlap with an unrelated capability-OS effort already publishing under the same prefix on crates.io. crates.io is a flat, first-come namespace, so publishing the real reusable crates early (and reserving the barecaposfacade) both prevents name contention and establishes a dated public-use record. Publish crates with real content; do not register empty placeholder names, which the crates.io policy can reclaim.
Transport Seam
The seam is a Transport trait (working name) that the typed capability
clients depend on instead of the concrete ring. It must express the existing
ring opcode semantics faithfully rather than collapse them:
call(cap, interface_id, method_id, request_bytes) -> CallHandle– maps to the ringCALLSQE.- completion retrieval (
poll/wait) – maps to consuming aCQEaftercap_enter. release(cap)– maps to the localRELEASESQE.- server-side
recv/return_(call, response_bytes, result_caps)– maps toRECV/RETURNfor endpoint-owning servers. The first SDK slice may scope to the client side (CALL/RELEASE+ completions) and add the server side when an endpoint-owning userspace service consumes the SDK.
Two implementations:
RingTransport(no_std, defaultringfeature): wraps the existing capos-rt single-owner ring client andcap_enter. This is current behavior moved behind the trait, not new behavior.RemoteTransport(std,remotefeature): connects over the remote session transport, authenticates throughSessionManager/AuthorityBroker, holds a forwardedRemoteCapSet, and dispatches typed calls over the same length-prefixed DTO gateway used bytools/remote-session-client/.
Crate Layout
| crate | role | std/no_std |
|---|---|---|
capos | front-door facade: prelude, re-exports runtime + typed clients, transport selection by feature | no_std core; std only behind remote |
capos-rt | ring runtime (_start, syscalls, ring client); provides RingTransport | no_std |
capos-abi | ABI/policy constants | no_std |
capos-lib | host-testable pure logic (ELF, CapTable, ring/SQE validation) | no_std + alloc |
capos-config | manifest/CUE loader, ring structs | no_std + alloc |
Feature flags on capos: ring (default, in-system, no_std), remote
(host client, pulls std + the remote transport deps). The shared core –
typed capability clients plus capnp encode/decode (capnp 0.25 is
no_std + alloc) – must stay no_std; std is confined to the remote
transport. Open decision: whether RemoteTransport lives in capos behind
remote or in a separate capos-remote crate the facade re-exports.
Honesty / Caveats
- The remote transport is transitional. The remote session path is
currently length-prefixed schema-framed DTOs, not standard
capnp-rpcwith live object proxies; the rewrite is gated on a reviewed capOS userspace async runtime or a sync-friendly Cap’n Proto RPC adapter (see Remote Session CapSet Client, Gate 1 / Task 1). So the firstremoteform proxies through the trusted host backend boundary; it is not arbitrary remote capability invocation with promise pipelining. Do not document it as live guest-wirecapnp-rpc. capnp-rpc0.25 isstd-only and needs a futures executor, which is why theremotetransport isstd-only and host-side by necessity. This matches the in-system/no_stdversus host/stdsplit rather than fighting it.- The trust models differ but the client API does not: in-system the kernel hands an unforgeable bootstrap CapSet; remote, the client only ever sees caps explicitly forwarded into its authenticated session. Keep the API identical and the authority boundary explicit.
Publication Decision
Decision recorded 2026-05-22 23:41 UTC: the SDK track publishes real crates
only, never empty placeholder packages. This follows the
Cargo publishing model
where crate names are first-come, first-served, published versions are
permanent, and publishers should fill out license, repository, readme,
description, keyword, and category metadata before upload. It also follows the
crates.io usage policy
against packages that exist only to reserve a name for a prolonged period
without genuine functionality or development activity. The exact planned
crates.io names (capos, capos-abi, capos-capnp-build, capos-config,
capos-lib, and capos-rt) were not present in the crates.io API or sparse
index when checked on 2026-05-22 23:39 UTC, and were re-confirmed unclaimed on
2026-06-02 16:10 UTC; the adjacent capos-bitstruct crate exists under the unrelated
cap-os/rust-tools repository and is the visible namespace contention signal.
libcapos and libcapos-posix are not crates.io crates (they ship as
release artifacts, see item 7), so their names are deliberately left unclaimed
on crates.io – an accepted residual risk. Re-check the registry immediately
before any real publish.
The publish set and order are:
capos-abi0.1.0– sharedno_stdABI and policy constants.capos-capnp-build0.1.0– build-time schema generation helper used bycapos-config; it is a real package becausecapos-configcannot publish with an unpublished path-only build dependency.capos-config0.1.0– manifest/config/ring structs and generated schema module built from packaged schema source; depends oncapos-abiandcapos-capnp-build.capos-lib0.1.0– reusableno_std + allocpure logic; depends oncapos-abiandcapos-config.capos-rt0.1.0– in-systemno_stdruntime and ring transport, published with the bare facade slice after the transport seam lands.capos0.1.0– front-door facade withringas the defaultno_stdfeature andremotereserved as astdfeature until the remote transport slice closes.libcapos/libcapos-posix– C-substrate distribution. Decision2026-06-02 16:10 UTC: these ship as release artifacts only (prebuiltlibcapos.a/libcapos_posix.aplus thecapos/*.hheaders attached to a GitHub release/tarball), not crates.io crates, because their consumers are C programs that link the archive, not Rust crates run throughcargo add, and they build only for the customx86_64-unknown-capostarget. The artifact build is bundled into the same explicit operator publish wave as the Rust crates above (built viamake libcapos/make libcapos-posix, verified through the existing C smokes). Their crates.io names are intentionally not reserved (accepted residual risk).
Do not reserve capos-remote yet. If slice 4a proves the remote backend should
live outside the facade crate, publish capos-remote with real host transport
code in that slice.
The MSRV target for Rust crates was 1.85.0 (the Rust 2024 edition floor) until
the first-public-release slice verified the package set against stable. The
verified MSRV is stable Rust 1.88.0: capos-config uses let chains in
if/while conditions (&& let Some(..) = ..), stabilized in 1.88.0, so the
set does not build on 1.85.0. rust-version = "1.88.0" is set on all four
published crates. The current repository still builds the OS with the
date-pinned nightly-2026-04-20 toolchain. capos-rt must document the capOS
target toolchain requirement separately if its package cannot be built on stable
for the custom userspace target.
The license gate is satisfied: the repository carries LICENSE-APACHE,
LICENSE-MIT, and LICENSING.md, and per LICENSING.md the published SDK
crates use MIT OR Apache-2.0 (the kernel/system is Apache-2.0-only and is not
in the publish set). The publish metadata uses that SPDX expression in the
license field of capos-abi, capos-capnp-build, capos-config,
capos-lib, capos-rt, and capos. Each carries
repository/readme/description/keywords/categories metadata and docs.rs settings.
Generated Cap’n Proto bindings do not ship as a separate published crate in the
first release set. capos-config ships the schema source needed to build its
own generated module, and capos-capnp-build remains the single no-std patch
helper for that generation path. The publish slice must make the
capos-config package self-contained instead of relying on repository-sibling
paths such as ../schema/capos.capnp. A separate generated-bindings crate is
deferred until an external consumer needs schema bindings without the
manifest/config crate.
Versioning Policy
The published crates – capos, capos-rt, capos-abi, capos-lib,
capos-config, and capos-capnp-build – follow one policy:
- Pre-1.0 SemVer. The set starts at
0.1.0. Per Cargo’s SemVer rules a0.y.zrelease treats the minor field as the breaking-change field: a breaking API/ABI change bumps the minor (0.1 -> 0.2); a backward-compatible addition or fix bumps the patch (0.1.0 -> 0.1.1). Treat the whole0.xseries as unstable and do not promise API stability until a1.0.0is deliberately cut. - Schema/ABI contract maps to a breaking bump. These crates encode the
capOS wire contract:
capos-configcarries the generated Cap’n Proto bindings and the ringCapSqe/CapCqestructs,capos-abicarries the policy/quota constants, and the typed clients incapos-rt/caposencode the SQE CALL/RELEASE wire format. A change toschema/capos.capnp, the generated bindings, the ring wire layout, or a typed-client method’s wire encoding is a breaking change and bumps the minor field (pre-1.0). The SDK is a schema consumer; the schema change lands under its owning plan, and this set’s version bump follows it. - Lockstep versioning across the set. Because the crates share the wire
contract and depend on each other by exact path/version, publish them at the
same version and bump them together rather than independently. A consumer
pinning
capos = "0.1"then gets a coherentcapos-rt/capos-config/capos-abigraph. Independent per-crate versioning is deferred until a crate has external consumers and a stability story of its own; revisit at1.0. - MSRV is stable Rust
1.88.0. Verified by the slice-2 publish dry-run:capos-configusesletchains (&& let Some(..) = ..) stabilized in1.88.0, so the set does not build on the Rust 2024 edition floor1.85.0.rust-version = "1.88.0"is set on the four published workspace crates. The OS itself still builds only on the date-pinned nightly; the MSRV applies to the host-target build of the reusable crates, not the kernel.
Publish Dry-Run Gate
make sdk-publish-dry-run is the repeatable, one-command reproduction of the
slice-2 publish verification. It runs, on the host target:
- A coordinated multi-package
cargo publish --dry-runovercapos-abi -> capos-capnp-build -> capos-config -> capos-libin dependency order. Cargo packages each crate, unpacks the earlier ones into a temporary registry, and verify-builds the later ones against them, so it fails loudly if publish metadata, packaging (including thecapos-configpackagedschema/capos.capnp), or the dependency order regresses. Coordinated multi-package dry-run is nightly-only, so this step runs on the repo-pinned nightly. - An MSRV-floor
cargo +1.88.0 buildof the same set, catching a regression that would need a newer toolchain than the recorded MSRV (the nightly dry-run alone would not).
The gate is prep-only: every dry-run upload aborts, and no real
cargo publish runs. Like dependency-policy-check, it is a focused target,
not part of make check, because the verify-builds are slow.
capos-rt and capos are part of the publish set but are not in this host
gate: they build only on the custom userspace target/code model, so a
host-target cargo publish --dry-run verify-build does not apply. The release
gate verifies their capOS-target builds via make capos-sdk-check rather than
a host dry-run. The initial publication used the local Cargo API-token path
after the final crates.io name re-check; subsequent releases can use
.github/workflows/publish-crates.yml after each crate’s trusted publisher is
configured on crates.io for this repository, workflow, refs/heads/main, and
the crates-io-publish GitHub environment.
Ordered Slices
The near-term high-priority slices (1-3, 5) do not depend on the capnp-rpc
transport rewrite and have landed. Slice 4 is split: slice 4a (a
transitional RemoteTransport over the existing host DTO backend) can ship now,
while slice 4b (the live-proxy capnp-rpc upgrade) is gated on the
remote-session async-runtime decision.
- Publish-set + reservation decision (
docs-status, closed2026-05-22 23:41 UTC). The decision above pins the publish order, exact crate names, MSRV target, feature-flag story, license gate, metadata requirements, and generated-binding packaging decision. - First public release of existing layers (
behavior, prep landed2026-05-23;0.1.0published2026-06-05). Publish metadata (description,MIT OR Apache-2.0license, repository, keywords, categories,rust-version = 1.88.0, README, docs.rs config) added tocapos-abi,capos-capnp-build,capos-config, andcapos-lib.capos-configis now self-contained: it shipsschema/capos.capnp(an in-repo symlink to the single repo-root schema, materialized into the package archive) andcapos-capnp-buildresolves it from the crate’s own manifest dir viagenerate_packaged_schema_bindings(). Verified with a coordinatedcargo publish --dry-runof the set in dependency order plus a stable-1.88.0build and a local docs render. The initial0.1.0versions were published fromorigin/mainon2026-06-05through the local Cargo API-token path after the final crates.io name re-check. Thecapos-configdocs.rs build is accommodated by a packaged generated-binding fallback used only whenDOCS_RSis set, so the docs.rs sandbox no longer needs the external pinnedcapnpbinary. The repository-URL rewrite is no longer a blocker: decision2026-06-02 16:10 UTCkeepsrepository = "https://github.com/ei-grad/capos"for the first wave (publishing many crates from one repo is standard; therepositoryfield can be updated later at repo-migration time without republishing). This claims thecapos-*prefix with shipped code. - Reserve bare
capos+ transport seam (behavior, closed2026-05-23 23:07 UTC).capos-rtnow defines theTransporttrait (src/transport.rs): the client-side seam ofsubmit_call/submit_call_with_copy_transfers/submit_call_borrowed_wait_forever(CALL),wait/try_complete(completion aftercap_enter), andrelease_wait(localRELEASE).RingTransportis the existing single-ownerRingClientviewed through the seam (current behavior, not new behavior); bothRingClientandRuntimeRingClientimplementTransport. The 189 client-side typed-client methods take&mut impl Transport; the result-cap-adopting methods stay on the concreteRuntimeRingClientbecause generalizing result-cap adoption across transports is later server-side/promise work. The new standalonecapos/facade crate re-exports the runtime, typed clients, theentry_point!macro, and apreludebehind the defaultringfeature; the later 4a slice maderemotea host-only feature over the transitional DTO backend. QEMU proof:make run-spawnbootsdemos/timer-smoke, whose typed-client code now imports fromcaposinstead ofcapos-rt, and asserts[timer-smoke] Timer now/sleep ok..capos-rtandcapos0.1.0were published with the slice-2 set after the final name re-check andmake capos-sdk-checkcustom userspace target verification; the repository-URL rewrite is no longer a blocker – see slice 2. remotetransport backend, split into 4a/4b.- 4a – transitional
RemoteTransport(behavior, closed2026-06-06). Thecaposfacade’sremotefeature now builds on the host target without the defaultringfeature, enablescapos-rt/host-testfor the shared typed clients, and providesRemoteTransportover the existing host DTO backend boundary used bytools/remote-session-client.RemoteTransportauthenticates through the same DTO gateway, obtains forwarded caps throughCapSetGet, assigns synthetic host-side cap ids, and provesSystemInfo.motdthroughSystemInfoClient::motd_waitover the current length-prefixed DTO wire without making the unpublished host tool crate part of the publishedcapospackage graph. Unsupported calls fail closed with ring-style transport completions. Negative-path hardening now covers wrong-interface and missing-cap denials, released local cap ids, remote denied calls, malformed and mismatched DTO responses, and disconnects during synchronous DTO calls. This is not blocked on the async-runtime decision and remains transitional host-backend proxying, not guest-wirecapnp-rpcwith promise pipelining. - 4b – live-proxy
capnp-rpcupgrade (behavior, blocked). Replace the DTO wire underRemoteTransportwith standardcapnp-rpcframing and live object proxies. Gated on a reviewed capOS userspace async runtime or a sync-friendly Cap’n Proto RPC adapter, tracked by remote-session Gate 1 (docs/backlog/remote-session-capset-client.md). Do not block 4a on it.
- 4a – transitional
- Versioning + publish CI (
harness-hardening, closed2026-05-24). The “Versioning Policy” section above pins pre-1.0 SemVer, the schema/ABI-to-breaking-bump mapping, lockstep versioning, and the1.88.0MSRV.make sdk-publish-dry-runreproduces the slice-2 publish verification in one command (coordinated multi-packagecargo publish --dry-runover the four host-buildable crates in dependency order + an MSRV-floor build); see “Publish Dry-Run Gate”. ThecaposfacadeREADME.mddocuments the workingringdefault and the transitionalremotefeature..github/workflows/publish-crates.ymlruns the same release gates, obtains a short-lived crates.io token through trusted publishing only fromrefs/heads/main, skips versions already present on crates.io, and publishes the six crates in dependency order when its manualpublishinput is enabled with the current explicit user release instruction recorded in the dispatch input. Non-mainpublish=truedispatches and publish dispatches without that current instruction fail before any crates.io token is requested. The initial six-crate0.1.0release is complete; future releases use the workflow only after explicit user authorization for that release and once crates.io trusted publishers are configured for the six crates.
Conflict Surface
- Owns: NEW
capos/facade crate, this backlog file, the roadmap “Crate publication” section,[package.metadata]/publish metadata on the published crates, and any newdocs/proposals/capos-sdk-proposal.md. - Coordinates (do not run blindly in parallel):
capos-rt/– theTransport-trait refactor of typed clients. Serial with other capos-rt client changes.tools/remote-session-client/and the Remote Session CapSet Client plan – theremotetransport reuses that host transport. Thecapnp-rpcrewrite is owned there, not here.
- Must NOT touch:
schema/capos.capnportools/generated/(the SDK is a schema consumer, not a producer) and kernel behavior. If a slice needs a schema change, it queues on the shared schema serial surface under the owning plan, not this one.
Grounding Files
docs/roadmap.md“Crate publication” track.docs/backlog/remote-session-capset-client.md(remote transport, gating).docs/proposals/remote-session-capset-client-proposal.md.docs/proposals/userspace-binaries-proposal.md(the C substrate layer under the same SDK family).docs/capability-model.md, README “Core Idea” (design principle 5: each process is a capnp-rpc vat; the ring is its connection).
Capability-Infrastructure Cluster Backlog
A planning audit found a cluster of maturing proposals whose Phase 1 slices are
now extractable (their stated prerequisites have landed) plus the Stage 6
capability remainder. Most of these slices ADD interfaces to
schema/capos.capnp and therefore share the schema serial surface: only one
plan at a time may change the schema (docs/backlog/index.md “Concurrency
Notes”), and the next plan must rebase on the generated-code refresh. This file
decomposes the cluster and records the recommended ordering so the slices do not
all become ready at once and collide on that surface.
docs/tasks/README.md points here for the cluster; it should not inline the details.
Ordering Contract
- The non-schema slices (capos-service framework, tickless idle, default avatar) are dispatchable in parallel today and have their own ready task files; they do NOT queue here.
- The schema-touching slices below queue on the shared schema serial surface.
Promote ONE at a time from this backlog into a
docs/tasks/file, land it, refresh generated bindings, then promote the next. Do not file all of them asreadysimultaneously. - The
ResourceProfileRecord/ManifestResourceProfileschema,capos_config::ResourceProfilecarrier, and non-schema spawn-limit enforcement have landed. Crypto key caps Phase 1 has also landed. The next queued schema-serial slice iscrash-recovery-stale-cap-phase1. - Recommended schema promotion order from here: crash-recovery stale-cap →
authority-broker → live-upgrade
CapRetarget→ Stage 6 remainder. Reorder by explicit user priority. Do not promote a schema slice in parallel with another schema-surface task.
Schema-Serial Phase-1 Slices
Each slice names a 1-line scope, the owning proposal, and the conflict domains
its eventual task file should carry. All share
interface:schema-capos-capnp + path:schema/capos.capnp +
path:tools/generated/ (the serial surface) in addition to the listed domains.
monitoring-log-surface (landed)
- Scope:
LogSink/LogReaderschema + a minimal userspace log service backed byConsole, withlogLevelenforcement and scopedLogSinkcaps granted to children at spawn. Source:docs/proposals/system-monitoring-proposal.md. - Domains:
resource:system-monitoring,path:kernel/src/cap/,path:demos/,docs:system-monitoring. - Landed (2026-05-25): additive
LogSink.write @38/LogReader.read @39plusLogRecord/LogFilter(reusingLogLevel), backed by a bounded drop-oldest kernel ring (kernel/src/cap/log.rs). The sink drops below-SystemConfig.logLevelrecords (boot-seeded) and forwards accepted records to serial; the reader returns cursor/filtered records withnextCursor/dropped.capos-rtLogSinkClient/LogReaderClient, producer/reader demos,system-monitoring-log.cue, andmake run-monitoring-log-smokeprove the sink drop, read-back, and reader-sideminLevelfilter. The widerSeverity(critical), correlation fields, token-bucket backpressure, and persistent retention remain later phases. Task:docs/tasks/done/2026-05-25/cap-infra-monitoring-log-surface.md.
crypto-key-caps-phase1 (landed)
- Scope:
SymmetricKey/PrivateKey/PublicKeyschema interfaces + a software-backed userspace key service + a QEMU encrypt/sign smoke over the cap boundary. Unblocks TLS, OIDC, volume encryption, signed audit, SSH cert upgrade. Source:docs/proposals/cryptography-and-key-management-proposal.md. - Domains:
resource:crypto-key-service,path:demos/,docs:cryptography-and-key-management. - Landed (2026-06-06): minimal RAM-only
SymmetricKey,PrivateKey, andPublicKeyABI inschema/capos.capnp, regenerated bindings,capos-tlsXChaCha20+HMAC-SHA256/P-256 cores, RAMKeyVaultprivate-key custody, and the development-onlyKeySourcebootstrap. Local proofs cover symmetric AEAD/MAC, private/public signing, KeyVault stale-handle custody, and development-source admission/rejection. Remaining work is production/runtime key service wiring, symmetric derivation/wrapping, persistence, hardware/cloud custody, ACME/TLS handshakes, and production public-ingress key sources. Task:docs/tasks/done/2026-06-06/cap-infra-crypto-key-caps-phase1-reconcile-local-proof.md.
time-wallclock-phase1 (landed)
- Scope:
WallClockread cap +ClockProvenancelabel + manifest-seeded boot time; WASIclock_time_get(REALTIME)and audit timestamp delegate to it. Source:docs/proposals/time-and-clock-proposal.md. - Domains:
resource:time-clock-authority,path:kernel/src/cap/,docs:time-and-clock. - Landed (2026-05-24, fixed-boot-base variant):
WallClock.wallTimeread cap +ClockProvenanceenum (untrusted @0fail-closed zero value),KernelCapSource::wallClock @36,kernel/src/cap/wall_clock.rs, thecapos-rtWallClockClient, and a shelldatecommand grantedwall_clockinsystem-shell.cueand asserted bymake run-shell. ManifestseedUtcSeconds, a statefulWallClockState, WASI realtime-clock delegation, and init audit/TLS grants remain Phase 1.x / Phase 2 follow-ups. Task:docs/tasks/done/2026/time-wallclock-phase1.md.
crash-recovery-stale-cap-phase1
- Scope: stale-cap
DISCONNECTED/server-death CQE propagation to in-flight callers and endpoint holders on unplanned process death, plus a redactedCrashRecordappended toAuditLog. Source:docs/proposals/crash-recovery-supervision-proposal.md. - Domains:
resource:crash-recovery,path:kernel/src/cap/,path:kernel/src/process.rs,docs:crash-recovery.
debug-session-phase1
- Scope:
DebugSessionattach cap (owner-consent or broker maintenance grant, audited) + read-only cap-table snapshot that transfers no authority. Source:docs/proposals/debug-trace-authority-proposal.md. - Domains:
resource:debug-trace-authority,path:kernel/src/cap/,docs:debug-trace.
authority-broker-phase1
- Scope: endpoint-served
AuthorityBroker+ShutdownControlschema + runtime client + a QEMU proof that an anonymous shell cannot invoke shutdown. Source:docs/proposals/userspace-authority-broker-proposal.md. - Domains:
resource:authority-broker,path:init/,path:shell/,docs:userspace-authority-broker. - Status note: the interim kernel broker no longer owns hard-coded demo binary
allowlists.
kernelParams.authorityBrokerPolicynow carries the admitted session-context, remote-client spawn, and worker service grant policy with manifest validation. The endpoint-served userspace broker and shutdown-control interfaces remain the queued Phase 1 work.
live-upgrade-capretarget-phase1
- Scope:
ProcessControl+retargetCapskernel op for stateless Case 1 upgrades, with a QEMU retarget-mid-call smoke. Foundation for DDF userspace-driver fault containment. Source:docs/proposals/live-upgrade-proposal.md. - Domains:
resource:live-upgrade,path:kernel/src/cap/,docs:live-upgrade.
system-info-hostname (done)
- Scope: add
hostnameto theSystemInfocap +kernelParams.hostname+ manifest field. Source:docs/proposals/system-info-proposal.mdPhase 3. - Domains:
resource:system-info,path:kernel/src/cap/,docs:system-info. - Landed:
SystemInfo.hostname @1served fromkernelParams.hostname(defaultcapos), printed by the shellhostnamecommand, asserted inrun-shell. Task:docs/tasks/done/cap-infra-system-info-hostname.md.
stage6-remainder
- Scope: the remaining Stage 6 capability semantics –
SharedBufferSQE opcode + kernel mapping authority, typed notification objects with ringRecvintegration, andCapabilityManager.list/grant. Decomposed indocs/backlog/stage-6-capability-semantics.md; queue each as its own slice on the schema surface. Source: roadmap Stage 6. - Domains:
resource:stage6-capability-semantics,path:kernel/src/cap/,path:kernel/src/cap/ring.rs,docs:stage-6.
Non-Schema Slices
These are dispatchable now and are tracked as ready or done tasks, not queued on the schema serial surface:
- Done:
cap-infra-resource-profile-enforcement-local-proof– binds the existingResourceProfileRecord/ManifestResourceProfileandcapos_config::ResourceProfilecarrier to remaining cap-slot and thread spawn-limit enforcement, with rollback proof (docs/tasks/done/2026-06-06/cap-infra-resource-profile-enforcement-local-proof.md). - Done:
capos-service-lifecycle-slice1–ServiceMain/lifecycle framework abovecapos-rt, one converted gateway proof (docs/tasks/done/2026/capos-service-lifecycle-slice1.md). - Done:
default-user-avatar– deterministic native-shell avatar selection over the shipped flat catalog, printed in the shellsessionoutput without schema or broker changes (docs/tasks/done/2026/default-user-avatar.md). - Done:
scheduler-tickless-idle-step6– enable true-idle tickless windows while keeping cap-enter polling dependencies periodic (docs/tasks/done/2026/scheduler-tickless-idle-step6.md).
Still-Gated (not in this cluster)
Memory-authority, OOM/swap, certificates/TLS, OIDC, volume-encryption,
go-runtime, chat-multimedia, llm/agent, browser, GPU, formal-MAC/MIC,
cloud-metadata, HPC, scientific, hosted-agent-swarm remain gated on this
cluster, DDF, networking, storage persistence, or SMP Phase C / Ring v2. See
each proposal’s gating note and docs/backlog/research-design-gaps.md.
SMP Phase C Backlog
ARCHIVED — milestones complete; residual full-SMP-hardware work tracked in Scheduler Evolution “Phase F.5: Full-SMP Hardware Scalability”. Both visible milestones this backlog tracks landed: Multi-Process SMP Concurrency (the
make run-smp-process-scaleproof is complete) and In-Process Threading Scalability (closed at commit136b72de,2026-05-01 14:58 UTC). No SMP track is active indocs/tasks/README.md. This file is retained as historical context and as the proof-contract reference; do not select new work from it – the next visible SMP milestone is the planning slot inscheduler-evolution.mdPhase F.5.
Detailed context for the selected SMP Phase C AP scheduler-owner proof and the remaining full-concurrent-SMP and in-process thread-scaling follow-on work.
Visible Goal
Move from a single scheduler owner to multiple CPUs that can run independent scheduler-owned kernel/user work concurrently, and prove that capability-owned processes can improve wall-clock performance on a deterministic CPU-bound workload under QEMU/KVM.
This backlog tracks two distinct visible milestones:
- Multi-Process SMP Concurrency:
make run-smp-process-scaleshould boot a focused manifest, run a deterministic SMP scaling demo across independent worker processes, print verified workload output, and report comparable 1/2/4-process timing. The proof is complete only when repeated KVM-backed-smp 1and-smp 2runs show near-linear speedup for the selected workload, while the ordinary manifest, ring, thread, park, and process-exit smokes still pass under-smp 2. - In-Process Threading Scalability:
make run-thread-scaleshould run a deterministic workload across sibling threads inside one process, verify the result, and report comparable 1/2/4-thread timing. This milestone closed at commit136b72de(2026-05-01 14:58 UTC) against the pre-collapse per-CPU placement model: caller-aware child publication and the existing timer fast-path slices produced repeated KVM-backed physical-core evidence above the configured 1-to-2 work and total speedup thresholds. The 4-worker row remained diagnostic rather than a linear-scaling claim. The 2026-05-02 per-CPU run-queue collapse retired that placement chain (caller-aware publication, per-CPU runnable queues, local-first stealing, theWakePolicy::QueueCpu(usize)variant). A post-collapse 3-run diagnostic oncapos-bench2026-05-02 10:42 UTC measured 1-to-2 work/total1.890x/1.792x(slight improvement) and 1-to-4 work/total1.504x/1.436x(clear regression on single-queue scheduler-lock contention). The formal capOS+Linux accepted-evidence pair landed against the same single-global-queue scheduler oncapos-bench2026-05-02 21:38 UTC againstmaincommit374f8556: capOS work1.883x/ total1.787xclear the configured 1-to-2 gates, while the 1-to-4 row (capOS1.566x/1.538xvs Linux3.963x/3.858x) is the diagnostic gating Phase D’s fair-share enqueue policy. Reintroducing per-CPU runnable queues with that policy must materially close the capOS-vs-Linux 1-to-4 gap before per-CPU queues land back in the scheduler. Seedocs/architecture/scheduling.md,docs/benchmarks.md, anddocs/backlog/scheduler-evolution.mdfor the current state.
Full concurrent SMP scheduling remains the underlying kernel goal for the multi-process milestone. It means more than one CPU can own scheduler work simultaneously, including per-CPU runnable ownership, cross-CPU idle-to-runnable handoff, reschedule IPIs, safe current-thread tracking, and reviewed lock/residency rules. The multi-process scaling demo is the first user-visible acceptance test for that kernel capability.
Completed Gates
- Ground the multi-CPU scheduling slice in the SMP proposal, scheduler and
threading docs, and relevant
docs/research/files. - Migrate syscall entry/exit to the GS-base/
swapgsper-CPU path, including non-sysretqscheduler/exit paths. - Add LAPIC timer, EOI, and IPI support for per-CPU ticks and cross-CPU coordination. The active backend is PIT-calibrated xAPIC MMIO with PIT/PIC fallback; x2APIC remains a later backend.
- Add TLB shootdown before any user address space can run on more than one CPU over its lifetime.
- Extend scheduler state from BSP-only ownership to per-CPU current-thread tracking with AP idle/runnable handoff. The first AP scheduler proof uses one AP as scheduler owner while the BSP stays in kernel idle, preserving the process-wide ring invariant.
- Add QEMU proof that AP cpu=1 executes scheduler-owned work and the
existing manifest/ring/thread/park smokes still pass under
-smp 2.
In-Process Threading Closeout Rules
-
Resolve the scheduler hot-lock blocker before calling the selected milestone a scalability proof. The implementation at the time had per-CPU runnable queues and dispatch state, but they remained under one global
Schedulerlock. A closeout branch should either split the hot dispatch path so ordinary timer preemption, local run-queue selection, and sibling CPU-bound thread requeue do not serialize on one global lock, or explicitly narrow the milestone to “functional in-process threading” and select a follow-on scheduler-lock scalability milestone. Completed 2026-05-01 14:58 UTC after repairing the benchmark shape against Linux baseline evidence and tightening caller-aware child publication: the repaired blocking-parent 16 MiB/64-round shape scales on Linux, and controlled physical-core capOS evidence passed the enforced 1-to-2 work and total gates. Four-worker capOS scaling remained a separate follow-up because total time still showed scheduler/exit/join overhead. (Update 2026-05-02: the per-CPU runnable queues and the caller-aware child publication described here were later collapsed into a single global runnable queue with the per-CPU run-queue-collapse cleanup slice; the recorded 1-to-2 capOS gates were against that pre-collapse placement model. The current single-global-queue scheduler now has its own formal accepted 1-to-2 pair oncapos-bench2026-05-02 21:38 UTC againstmaincommit374f8556(capOS work1.883x/ total1.787x; Linux baseline1.988x/1.987x); the 1-to-4 row remains the diagnostic gating Phase D’s fair-share enqueue policy. Per-CPU queues and caller-aware placement return when that policy ships and materially closes the capOS-vs-Linux 1-to-4 gap. Seedocs/architecture/scheduling.md,docs/benchmarks.md, anddocs/backlog/scheduler-evolution.mdfor current state.) -
Add a bounded timer continuation fast path as a conservative split-prep slice. Completed 2026-05-01 10:29 UTC: a user-mode LAPIC timer tick may keep running the current non-idle thread without entering
sched::schedule()only when a previous locked slow path has published a clean hard-work summary, the CPU has no pending reschedule IPI, and the per-CPU one-skip budget has not been exhausted. The 2026-05-01 11:40 UTC follow-up keeps every dirty producer forcing at least one locked timer pass, then allows remaining run queues and handoff-current markers alone to be treated as fairness/protection state for one continued tick. Direct IPC, deferred cleanup, Timer sleeps, and timed cap-enter/Park waiters still keep the hard slow-path bit set. The full scheduler path remains authoritative and still runs regularly for ring SQEs, cap-wait scans, cleanup, and accounting. This narrows timer-side scheduler-lock contention but does not by itself close the selected scalability milestone. Controlledcapos-benchphysical-core0-3before/after evidence for the initial strict-clean version stayedaccepted=false: baselinetarget/thread-scale/timer-fastpath-baseline-main-physical-20260501T102938/reported work speedups0.998xand0.998x; after-changetarget/thread-scale/timer-fastpath-after-physical-20260501T104700/reported work speedups1.001xand0.999x. Controlledcapos-benchphysical-core0-3evidence for the fairness-only follow-up also stayedaccepted=false: baselinetarget/thread-scale/20260501T120224Z/recorded work speedups1.001xand0.999xplus total speedups0.913xand0.587x; after-changetarget/thread-scale/20260501T120709Z/recorded work speedups1.001xand1.000xplus total speedups1.125xand0.828x. -
Add timer-fast-path attribution counters for guest-measure thread-scale runs. Completed 2026-05-01 10:58 UTC: aggregate and per-phase
timerlines now report fast-path attempts, continues, and fallback reasons for slow-required/dirty summaries, skip-budget exhaustion, pending reschedule IPIs, no-current/non-idle CPUs, and inactive/invalid scheduler CPUs. These counters answer whether the bounded continuation path fires inside benchmark phases. They are benchmark-only instrumentation and do not close the currentaccepted=falsespeedup gate. Local one-run evidence intarget/thread-scale/20260501T110157Z/passed with the new fields present in every 1/2/4-threadmeasure.log; the timed work phase recordedfast_path_continues=0for all three rows. -
Add timer slow-summary reason attribution for guest-measure thread-scale runs. Completed 2026-05-01 11:28 UTC: aggregate and per-phase
timer_slow_summarylines now report required/clean counts and the predicate reasons that keepTIMER_SLOW_PATH_REQUIREDset after a locked timer slow path. Reason fields cover nonempty run queues, direct IPC targets, handoff-current state, pending process termination/drop/stack release, timer sleeps, and timed cap-enter versus park waiters. Local one-run evidence intarget/thread-scale/20260501T112359Z/passed; the work phase showedrequired=2/4/8,clean=0,run_queue_nonempty=2/4/8,handoff_current=2/4/8, and zero timer sleeps/timed waiters for the 1/2/4-thread rows. The behavior follow-up keeps the output shape but changesrequiredto mean hard timer work, not run queues or handoff markers alone. This attribution does not close the selectedaccepted=falsespeedup gate. -
Add explicit thread-placement evidence and conservative new-child publication spreading. Completed 2026-05-01 12:37 UTC, refined 2026-05-01 13:20 UTC, and repaired 2026-05-01 14:58 UTC after the blocking-parent benchmark exposed a placement regression. Guest-measure runs now emit aggregate and per-phase
thread_placementlines for publish targets, caller-current publish buckets, caller-aware avoid, fallback, and strict-load fallback counts, selected CPUs, first-selected CPUs, and migration events across CPU slots 0-3. Newly created non-single-owner threads avoid the caller’s current CPU only when another active ready scheduler CPU has a strictly lower non-idle dispatch load under the scheduler lock; on equal load, an active-ready caller CPU wins the tie instead of falling through to CPU0-biased least-loaded scanning. Single-owner processes stay pinned to CPU0. Timer, unblock, direct-IPC fallback, steal retry, and steal requeue paths keep their existing allocation-free targeting behavior. (Update 2026-05-02: the per-CPU run queues described here were later collapsed into a single global run queue, retiring the caller-aware placement and steal scans. Seedocs/architecture/scheduling.mdand the per-CPU run-queue collapse entry indocs/backlog/scheduler-evolution.mdfor current state. Per-CPU queues return with the fair-share enqueue policy that Phase D will own.)The earlier avoid-caller rule passed the old spinning-parent 1-to-2 gate but was wrong for the repaired blocking-parent benchmark: a controlled run before the strict-load fix regressed to 1-to-2 work/total speedups `0.886x`/`0.928x` because the children were biased away from an otherwise available caller CPU. After the strict-load fix, controlled physical-core evidence passed the enforced 1-to-2 work/total gates with `1.828x`/`1.687x`. The same run recorded diagnostic 1-to-4 work/total speedups `3.029x`/`2.386x`; with scheduler switch diagnostics suppressed, those 1-to-4 diagnostics recorded `3.272x`/`2.303x`. Four-worker capOS scaling remains a follow-up, not a completed linear-scaling claim. -
Preserve correctness gates while narrowing the lock: generation-checked
ThreadRefownership, no stale runnable queue entries after process or thread exit, direct-IPC preference without bypassing ownership checks, allocation-free timer/unblock runnable publication, and cleanrun-smp2-smokesevidence. Completed 2026-05-01 14:58 UTC: the caller-aware publication change preserves single-owner pinning and leaves timer/unblock/requeue/direct-IPC targeting unchanged; ordinary-smp 2regression coverage passed. -
Rerun controlled physical-core evidence after any scheduler hot-lock change. The milestone should stay open until host-summary work and total gates pass, or until the milestone scope is intentionally changed and recorded in
docs/tasks/README.md,docs/roadmap.md, and this backlog. Completed 2026-05-01 14:58 UTC after benchmark repair: the matching Linux baseline validated the repaired blocking-parent 16 MiB/64-round shape on the selected physical CPU set with 1-to-2 work/total speedups1.991x/1.990xand 1-to-4 work/total speedups3.958x/3.834x. Controlled capOS evidence passed the enforced 1-to-2 work/total gates with1.828x/1.687x. -
Track post-closeout 4-worker scalability caveats separately from the recorded 1-to-2 milestone. The repaired benchmark proved the configured 1-to-2 work and total thresholds only against the pre-2026-05-02 per-CPU placement model. Linux now scales under the same repaired shape, so the remaining 4-worker capOS gap was not a benchmark-shape excuse. The strongest evidence at that time was: unsuppressed capOS 1-to-4 work/total speedups
3.029x/2.386x, scheduler-switch-log-suppressed diagnostics3.272x/2.303x, and guest-measure runs that showed globalSchedulerlock wait/hold cycles plus exit/join/block/schedule overhead while shared kernel locks were not visibly contended. Treat those numbers as historical; superseded by the formalcapos-bench2026-05-02 21:38 UTC pair againstmaincommit374f8556(capOS work1.883x/ total1.787xclears the configured 1-to-2 gates; 1-to-4 capOS1.566x/1.538xvs Linux3.963x/3.858xremains the diagnostic that gates Phase D’s fair-share enqueue policy). Future four-core scaling claims should add an explicit 1-to-4 gate, keep placement evidence enabled, separate work-window from total-time attribution, and continue splitting hot scheduler metadata/lock paths.
Multi-Process SMP Concurrency Gates
- Split the current one-owner scheduler latch into per-CPU scheduler run
queues or equivalent ownership that can keep more than one CPU executing
scheduler-owned work at the same time. Completed in commit
20f6894(2026-04-30 05:30 UTC) with per-CPU scheduler ownership, current and handoff tracking, per-CPU idle/fallback cleanup slots, and temporary BSP pinning for endpoint-, launcher-, spawner-, and thread-authority holders so process-wide ring paths remain single-owner during this milestone. - Add reschedule IPIs for idle-to-runnable handoff across scheduler owners. The current scheduler tree tracks pending reschedule IPIs per target CPU, wakes halted scheduler-owner loops for newly runnable work, and uses the same serialized fixed LAPIC IPI send path as TLB shootdown without claiming a general preemptive reschedule interrupt.
- Prove concurrent scheduler-owned work on more than one CPU with
independent worker processes first. This avoids process-wide capability
ring races while still proving real multi-core execution. The focused
proof harness is on mainline as of commit
c2790c0(2026-04-30 07:38 UTC), and the completed milestone is recorded at commit3fb89923(2026-04-30 09:45 UTC). - Add an SMP scaling demo binary and focused manifest. The first workload is segmented prime counting over generated ranges. It partitions work statically by worker index, avoids hot-path syscalls and serial output, produces aggregate prime-count/checksum verification, and prints one compact result line per accepted case.
- Add a host harness for
make run-smp-process-scalethat runs the same workload under-smp 1,-smp 2, and optionally-smp 4, captures raw logs, and reports worker count, CPU count, ticks or cycles, output checksum, and speedup. A single noisy QEMU run is not enough evidence for a scaling claim; keep raw repeated-run artifacts for review.tools/qemu-smp-process-scale-harness.shbuilds/usescapos-smp-process-scale.iso, stores serial logs undertarget/smp-process-scale/<timestamp>/, defaults to five repetitions, reports per-case medians, and enforces the 1.6x 1-to-2 median threshold only when KVM-backed evidence is available. - Treat near-linear 1-to-2 CPU speedup as the first publishable target. Use a threshold high enough to reject accidental concurrency illusions but low enough for QEMU/KVM variance, for example at least 1.6x median speedup over repeated runs. Record the exact threshold in the harness when this milestone is selected for implementation.
make run-smp-process-scale Proof Contract
This target is the acceptance test for Multi-Process SMP Concurrency. It must stay narrower than the later in-process threading milestone: one process ring per worker process, no sibling threads in the timed section, no shared ParkSpace words, no IPC throughput loop, and no completion-ring demux claim.
The first implementation should add:
- a focused
system-smp-process-scale.cuemanifest; - a coordinator binary that receives the manifest-granted
ProcessSpawner, spawns a fixed set of worker process cases, waits for each child, verifies aggregate results, and prints the compact result lines; - a worker binary or a small family of worker binaries that execute one static partition of the deterministic workload and report only their final result through a parent endpoint or other existing spawn-result path after the timed section finishes;
- a
tools/qemu-smp-process-scale-harness.shhost harness wired tomake run-smp-process-scale.
The workload should be segmented prime counting over generated integer ranges.
Each run case divides the same total range into workers contiguous segments.
Worker i handles segment i without terminal output, IPC calls, heap-heavy
allocation, or capability operations in the timed region. The coordinator
collects one post-compute result per worker and verifies the aggregate prime
count plus a stable checksum or hash against known constants before it accepts
timing evidence.
The guest must print one line per accepted run case in this shape:
[smp-process-scale] cpus=<n> workers=<n> range=<lo>..<hi> primes=<count> checksum=<hex> elapsed=<ticks-or-cycles> verified=true
The exact time source can be monotonic ticks or a cycle counter, but it must be an in-guest measurement that brackets only the worker-process computation after spawn/setup and before serial reporting. If timer granularity makes the proof too noisy, increase the total range instead of measuring host wall time as the primary signal. Host wall time may be reported as secondary harness metadata.
The host harness policy is:
- default to
CAPOS_SMP_SCALE_RUNS=5complete repetitions per CPU-count case; - run and report the advertised 1/2/4-worker timing cases. At minimum that
means
-smp 1/one worker,-smp 2/two workers, and a 4-worker timing case; the preferred 4-worker case is-smp 4when the local QEMU/KVM host exposes four usable vCPUs, otherwise the harness must still report the 4-worker case under the largest available SMP count and mark why a 4-vCPU run was not collected; - require KVM for a speedup claim. If
/dev/kvmor QEMU KVM acceleration is unavailable, the target may run a functional verification mode, but it must report that publishable speedup evidence was not collected; - keep raw serial and terminal logs under a stable
target/subdirectory such astarget/smp-process-scale/<timestamp>/; - summarize the median verified elapsed value for each case and require at
least
1.6xmedian speedup from the-smp 1/one-worker baseline to the-smp 2/two-worker case before accepting the near-linear 1-to-2 speedup claim; - rerun the ordinary manifest, ring, thread, park, and process-exit smokes
under
-smp 2before marking the selected milestone complete.
As of commit 3fb89923 (2026-04-30 09:45 UTC), the focused manifest,
process-scale demo, and
host-side harness wiring produce passing default repeated KVM-backed speedup
evidence. The accepted run in
target/smp-process-scale/cycle-balanced-default/ recorded medians
smp1=1693, smp2=1053, smp4=2314, or 1.608x, satisfying the required
1.6x threshold. The worker-reported elapsed value is a scaled user-mode cycle
count, and the static worker ranges are contiguous but cost-balanced for the
prime-counting loop. The ordinary -smp 2 smoke gate also passed:
target/smp2-smokes/run-smoke.log covers the default manifest smoke, and
target/smp2-smokes/run-spawn.log covers endpoint roundtrip, ring-reserved
opcodes, timer/runtime children, thread lifecycle, park cleanup, generic child
waits, and process exit. The Multi-Process SMP Concurrency milestone is
complete. The harness fails closed when the focused manifest, ISO, expected
compact proof lines, or speedup evidence are unavailable instead of fabricating
timing evidence.
tools/linux-smp-process-scale-baseline.sh is the reference-OS comparison for
this proof. It builds a tiny static Linux initramfs that runs the same forked,
deterministic prime-counting workload under the same QEMU/KVM CPU and memory
envelope, records raw logs under target/linux-smp-process-scale/, and uses
the same default five-run median policy. The script defaults now match capOS’
balanced contiguous splits; rerun the Linux comparison before publishing a new
OS-comparison table for the accepted capOS evidence.
The process-scale harnesses also expose an opt-in smp8-smt diagnostic through
CAPOS_SMP_SCALE_INCLUDE_SMT=1 and LINUX_SMP_SCALE_INCLUDE_SMT=1. It uses
the same range and aggregate verifier with eight contiguous ranges and is
collected only when the host reports at least eight logical CPUs. This case is
for SMT behavior on 4-core/8-thread hosts; it must not be treated as 8-core
evidence or included in the accepted 1-to-2 speedup gate.
The proof must not depend on KVM paravirtual APIC, IPI, or TLB-flush features. The current architectural xAPIC MMIO LAPIC timer/IPI path remains the correctness surface; paravirtual APIC acceleration is future performance work.
Before the scheduler implementation branch claims this target, review the non-blocked findings that could invalidate the evidence:
- panic-surface hardening for guarded unwraps, stale queues, blocking waits, process/thread exit, endpoint cancellation, and rollback restoration paths touched by scheduler ownership changes;
- quota/exhaustion behavior for the child-process, process-handle, outstanding call, scratch, frame, and invalid-SQE paths used by the coordinator and workers;
- release/revoke epoch behavior only for capabilities the demo actually grants.
Findings unrelated to this proof, such as DMA provenance, shared ParkSpace unmap/reuse, or same-process per-thread ring routing, should stay tracked in the migrated review-finding task records but must not be represented as blockers for independent worker-process SMP scaling.
SMP Review-Finding Reconciliation
This section classifies the review-finding task records for the selected multi-process SMP proof. It does not close those findings; it defines what the next scheduler and harness branches must satisfy before they can depend on the paths involved in the proof.
Blocking or proof-invalidating for this milestone:
- Scheduler panic surfaces touched by ownership changes. A branch that
changes scheduler ownership, per-CPU queues, idle-to-runnable handoff, or
process/thread exit cleanup must audit and either harden or explicitly test
the relevant
docs/panic-surface-inventory.mdscheduler rows:block_current_on_cap_enter,capos_block_current_syscall, stale run-queue process references,exit_current,current_ring_and_caps, schedulerstart, and context-restore CR3 assumptions. The branch should add targeted host or QEMU coverage for each panic surface it claims to close. - Process/resource exhaustion on paths used by the coordinator. The proof
depends on
ProcessSpawner,ProcessHandle.wait, result-cap adoption, and likely a parent endpoint or equivalent post-compute result path. Those paths must keep controlled failures for cap-slot exhaustion, process-handle exhaustion, endpoint queue pressure, scratch/result-buffer pressure, outstanding call pressure, and frame-grant/frame-exhaustion pressure from loading worker ELF pages, stacks, and TLS. Existing endpoint pending-RECV and queued-CALL overload coverage can be reused, but new coordinator-specific resource pressure introduced by the demo needs matching coverage before the proof is used as milestone evidence. - Runtime invalid-SQE flood handling if the harness exercises malformed submissions. The process-scaling demo should not need malformed SQEs. If a future scheduler or harness branch adds invalid-submission stress to this target, it inherits whatever invalid-submission review-finding task records remain open at that time. Runtime flood handling and log/rate-limit suppression should be evaluated separately because active remediation may close one without closing the other. Otherwise invalid-submission remediation remains a separate track and should not block the pure scaling proof.
Guardrails that must be preserved but are not standalone blockers for the independent worker-process proof:
- Explicit revoke/epoch tests. The demo should use only the capabilities needed to spawn workers and collect their final results. It must not claim peer revocation, stale session rejection, or object-epoch behavior unless it grants revocable/session-sensitive authority and adds flow-specific revoke or expiry tests.
- ParkSpace unmap/reuse enforcement. Independent worker processes should
avoid shared ParkSpace words in the timed workload. The ordinary park smoke
still has to pass under
-smp 2before milestone completion. - Process-wide capability ring constraint. The proof remains valid only because each worker has its own process ring and the timed section avoids ring traffic. It must not be cited as evidence for same-process sibling thread scalability, per-thread completion routing, or Ring v2.
- Raw evidence retention. Local repeated KVM logs are enough for this
development milestone, but production/reproducibility claims remain governed
by the provenance finding. Keep raw
target/smp-process-scale/<timestamp>/artifacts for review and avoid implying third-party reproducibility.
Out of scope for this milestone unless a branch expands the demo surface:
- DMA owner state, generation-checked DMA/MMIO/IRQ handles, stale interrupt proofs, and DMA ResourceLedger/OOM implementation;
- shared ParkSpace unmap/reuse beyond preserving existing park smokes;
- same-process thread creation, join, TLS, per-thread rings, and Ring v2 completion routing.
In-Process Threading Scalability Gates
- Define the per-thread capability-ring/completion-routing contract needed
before same-process sibling threads can claim independent scaling.
Completed 2026-04-30 10:19 UTC in
docs/proposals/ring-v2-smp-proposal.md: the first Ring v2 slice uses kernel-chosen child-thread ring mappings, a sharedRingEndpointrecord for initial and child rings, andThreadRef -> RingEndpointas the routing model. - Move capability-ring waiting/completion routing to the per-thread
ThreadRefmodel before claiming same-process sibling threads scale independently on different CPUs. Endpoint, timer, park, process-wait, thread-join, deferred-cancel, and direct IPC completion paths must all route through the target thread’sRingEndpointbefore same-process scaling can be claimed. Completed through the Ring v2/thread-scale substrate: spawned child threads receive independent ring endpoints, and local/controlled thread-scale evidence verifies child rings. - Ensure thread creation, FS/TLS setup, thread exit, join, park waits,
and process exit remain generation-checked and safe when sibling threads
can be resident on different CPUs. Completed through the reviewed
thread-scale implementation and the closeout
run-smp2-smokespass. - Add an in-process thread scaling demo that uses the same class of
deterministic CPU-bound workload as the multi-process proof, but splits
work across sibling threads in one process. Prefer fixed-size
parallel hashing/checksum chunks over prime counting for this milestone:
equal-byte chunks have much more uniform work than trial division over
increasing integer ranges, still keep the timed region syscall-free, and
verify through one deterministic root hash. Print one compact result line
per run.
Completed with the
demos/thread-scaleproof and reusabledemos/thread-scale-workloadcrate. - Add a host harness for
make run-thread-scalethat runs 1/2/4-thread cases under matching QEMU CPU counts, captures raw logs, and rejects results until the verified median speedup reaches the accepted threshold. Completed 2026-05-01 14:58 UTC after benchmark repair: the harness enforces KVM-backed 1-to-2 work and total thresholds when requested, carriesparent_waitandwork_roundsthrough CSV metadata, and the repaired blocking-parent 16 MiB/64-round run passed both enforced physical-core gates. 2026-04-30 12:34 UTC functional checkpoint: this branch adds the same-process demo and QEMU harness as diagnostic evidence only. The harness retains raw serial logs undertarget/thread-scale/<timestamp>/, parses exactly one verified[thread-scale]line per 1/2/4-thread case, and reports median elapsed values plus diagnostic speedups. Focused phase diagnostics now add guest cycle fields forspawn_ready,work,shutdown, andtotalto separate thread creation/ready time, the syscall-free workload window, and thread exit/join time.elapsedremains the workload value and is an alias ofwork, so harness speedup calculations continue to use only the timed workload. The retained artifacts are raw QEMU serial/terminal/stdout/stderr logs plusresults.csvandsummary.log. Host-side QEMU profiling is opt-in throughCAPOS_THREAD_SCALE_PROFILE=1; it requiresperfand storesperf.data,perf.script,perf.report.txt, andprofile-command.txtplusqemu.statusin each case-run artifact directory. These are host samples of the QEMU process and the preserved workload exit status, not guest symbol attribution by themselves, so the guest phase counters remain the default diagnostic. Guest-side kernel measurement is separately opt-in throughCAPOS_THREAD_SCALE_GUEST_MEASURE=1; it rebuilds the thread-scale ISO with the benchmark-only kernelmeasurefeature and retains release symbols for that benchmark build only. It writes the kernelmeasure:segment summaries from each case-run serial log to that case-run’smeasure.logand records the per-case userspace symbol map path inresults.csvunderguest_symbol_map. It also writes auser-pc-symbols.logreport beside eachmeasure.logand records that path underuser_pc_symbol_report; the report maps aggregate and per-phaseuser_pc_samplesexact-RIP buckets to the nearest userspace symbol address not greater than the PC. Those segment counters cover scheduler choice, schedule save/requeue, timer and park wake paths, cap-wait scans, thread exit/join cleanup, and process exit/drop cleanup. First-slice shared-kernel contention counters now add aggregate and per-phaseshared_kernel_locklines for frame allocator alloc/free lock acquisitions, contention, and spin loops, plus the ring-dispatch cap-table and ring-scratch locks beforecap::ring::process_ring. Follow-up counters also cover endpoint inner queue locks, endpoint cancellation scratch locks, and all direct per-process address-space lock sites. Heap attribution now routes the global allocator mutex throughSharedKernelLock::Heapin measure builds; one-run guest-measure evidence recorded zero timed-work-phase heap acquisitions for the syscall-free benchmark and nonzero spawn/shutdown allocator activity. These remain benchmark-onlymeasureattribution and do not close the broader shared-service contention finding. Fresh result rows now explicitly classify the benchmark hot section as syscall-free CPU work with ring and allocator activity limited to setup/shutdown, no endpoint or network activity, and result-only logging. The harness requires those benchmark-class fields for new QEMU parses, validates the expected values for this benchmark, carries them intoresults.csv, and keeps summary-only replay tolerant of legacy CSV files that predate the class columns. Local one-run evidence is retained intarget/thread-scale/20260501T083254Z/. Network/polling attribution now adds aggregate and per-phasemeasure: network_polllines for initialized virtio-net scheduler, runtime, and interface polling; the built-in TCP HTTP proof poll; virtqueue poll spins and completions; and pending network waiter scans. The guest-measure harness requires those lines. Local one-run evidence intarget/thread-scale/20260501T093505Z/passed and retained zero aggregate and per-phase network/poll counters for the 1/2/4-thread rows. The default thread-scale manifest has no virtio-net device, and the scheduler poll entry returns before the driver mutex in that no-device case. For this CPU-bound benchmark they are zero-evidence guardrails, not service-throughput proof and not milestone acceptance. The symbol map and resolved report are benchmark-only nearest-symbol attribution aids for interpreting rawuser_pc_samplesbuckets, not line-level profiling, a complete guest profiler, or normal-build guest attribution. These diagnostics are for reviewers, not speedup acceptance. The guest result line deliberately printsaccepted=falseas diagnostic guest-side state. Host acceptance is a separate summary decision:CAPOS_THREAD_SCALE_REQUIRE_SPEEDUP=1requires KVM-backed evidence and the configured 1-to-2 medianwork/elapsedspeedup threshold, but it does not fail merely because parsed guest rows carryaccepted=false. The total-case summary gate is separate and opt-in:CAPOS_THREAD_SCALE_REQUIRE_TOTAL_SPEEDUP=1requires KVM-backed evidence and the configuredCAPOS_THREAD_SCALE_TOTAL_SPEEDUP_THRESHOLDagainst the 1-to-2 mediantotalspeedup. It is also supported by summary-only replay and is not enforced by default.capos-benchdiagnostic runtarget/thread-scale/capos-bench-thread-20260430T125613Z/usedn2-highcpu-8KVM with QEMU pinned to physical CPUs0-3for five runs per case. Median elapsed cycles were thread156244112, thread284429072, and thread4140666438; diagnostic speedups were thread1-to-thread20.666xand thread1-to-thread40.400x, with all rows stillaccepted=false. After phase diagnostics landed,capos-benchruntarget/thread-scale/capos-bench-phase-20260430T134301Z/used the same pinned physical CPU set and recorded five-run medians: thread1elapsed/work=56285136,spawn_ready=43054612,shutdown=57693626,total=157008630; thread284432724,76247932,142200058,303096216; thread4140768008,205527230,395434364,741943554. The phase output shows shutdown/join cost increasing sharply with worker count, but all rows still remainaccepted=false. After child-ring endpoints and the optional SMT8 diagnostic landed,capos-benchphysical-core runtarget/thread-scale/20260430T151909Z/recorded five-run medians pinned to logical CPUs0-3: thread1elapsed/work=56215128,spawn_ready=41692656,shutdown=57753172,total=155536564; thread284420848,74791942,142065130,301170274; thread4140697028,143691606,395397620,679786606. Final SMT diagnostic runtarget/thread-scale/capos-bench-final-smt8-20260430T154058Z/at commit19f2fc66used logical CPUs0-7and recorded medians: thread156272620,54277322,57824172,168448508; thread284343990,72757730,142229724,299693446; thread4140992614,144614212,396264522,681167764; thread8253352976,290422132,1239856304,1786188514. All rows remainaccepted=false, and thread8 is informational SMT evidence only. Scheduler-unpin final diagnostic runtarget/thread-scale/scheduler-unpin-final2-20260430T160700Z/removed the scheduler’s transient same-pid pinning and verified 1/2/4-thread cases without the child-ring map/unmap TLB shootdown panics seen during this slice. One-run medians were thread1elapsed/work=56293734,spawn_ready=39202342,shutdown=34848540,total=130344694; thread257101752,95921604,69869786,222894030; thread4274828354,275826356,407818252,958473044. Diagnostic speedups were thread1-to-thread20.986xand thread1-to-thread40.205x; all rows remainaccepted=false. Follow-up local checks passedmake run-smp2-smokesintarget/smp2-smokes/20260430T160936Z/and reran three thread-scale samples intarget/thread-scale/scheduler-unpin-rerun-20260430T161104Z/. That rerun kept correctness intact but recorded thread4902520658cycles under local oversubscription, so it remains diagnostic only. After guest-side measurement landed,capos-benchruns at commita5c4f789recorded five-run medians with QEMU pinned to host logical CPUs0-3, which map to distinct physical cores on that host: thread156341030, thread256166300, thread470122044(1.003x,0.803x). The SMT diagnostic pinned to logical CPUs0-7recorded medians thread156315082, thread256233080, thread462630052, thread8125488946(1.001x,0.899x,0.449x). The one-run guest-measure pass intarget/thread-scale/20260430T182824Z/recorded per-casemeasure.logfiles. Top measured guest-side cycle totals werering_processingandmethod_body, withsched_choose_nextandthread_exit_join_cleanupgrowing at higher thread counts. A follow-up local phase-aware guest-measure pass intarget/thread-scale/20260430T184532Z/verified that each casemeasure.lognow includes final-summarymeasure: checkpointandmeasure: phaseattribution forspawn_ready,work,shutdown, andfinal_total; the harness rejects guest-measure runs missing any of those phase summaries. These runs remain diagnostic andaccepted=false. After phase-aware guest measurement landed on main at commitda92ed42,capos-benchreran the diagnostic with QEMU pinned to host logical CPUs0-3, which map to distinct physical cores on that host. Runtarget/thread-scale/capos-bench-phase-main-20260430T191146Z/recorded five-run medians: thread1elapsed/work=56242252,spawn_ready=38789562,shutdown=34859130,total=130093430; thread256233998,91718518,61923280,205126974; thread462926552,109723566,119015960,297970796. SMT diagnostic runtarget/thread-scale/capos-bench-phase-smt8-main-20260430T191408Z/pinned QEMU to logical CPUs0-7and recorded medians: thread156198166,41134070,34781494,132161420; thread256196302,42453050,63546086,162449504; thread462361512,87093620,109458814,258043804; thread8125378372,249877254,528656458,904149404. A one-run host-profile plus guest-measure sample intarget/thread-scale/capos-bench-profile-phase-main-20260430T191703Z/used temporary host perf access with QEMU pinned to logical CPUs0-3, then restoredkernel.perf_event_paranoid=4. The host reports still show QEMU/KVM execution,ioctl, QEMU mutexes, and MMIO/read helpers near the top; guest phase counters show no ring dispatches in the measured work phase, while shutdown/join and scheduler choice costs grow with worker count. These results remain diagnostic andaccepted=false. Artifact content verification after collection checkedsummary.logandresults.csvfor the two five-run diagnostics and the one-run profile sample, plus the profile sample’smeasure.logandperf.report.txt, against the recorded medians, pinning,accepted=falsestatus, guest phase claims, and host-profile claims. Join-cleanup optimization follow-up on branchworkplan/thread-scale-join-cleanupadds per-thread pending join-waiter accounting so exiting worker threads that never blocked inThreadHandle.joinskip the thread-handle waiter scan. Local evidence:target/thread-scale/join-cleanup-local-20260430T193657Z/passed functional guest-measure verification, andtarget/thread-scale-join-cleanup-run-spawn.logpassedmake run-spawn; local timing remains diagnostic because the host was not a controlled benchmark environment. Controlledcapos-benchreruns for this branch kept all rowsaccepted=false: physical-core runtarget/thread-scale/capos-bench-join-cleanup-20260430T194536Z/recorded medians thread156173118, thread256166224, thread462070170(1.000x,0.905x), and SMT diagnostictarget/thread-scale/capos-bench-join-cleanup-smt8-20260430T194734Z/recorded medians thread156251116, thread256197306, thread462519276, thread8122089762(1.001x,0.900x,0.461x). Scheduler-choice cleanup follow-up on branchworkplan/thread-scale-scheduler-choiceremoves a redundant blocked-thread scan from the idle fallback inchoose_next_locked. Local functional evidence:target/thread-scale/scheduler-choice-local-20260430T200257Z/passed guest-measure verification. Controlledcapos-benchruntarget/thread-scale/capos-bench-scheduler-choice-20260430T201041Z/recorded medians thread156171526, thread256301462, thread462433702(0.998x,0.900x), so the cleanup does not close the milestone. The immediate review-finding note that the scheduler still had a two-CPU owner mask is addressed by raising the temporary scheduler-owned CPU slot count and wake mask to four, so the 4-thread diagnostic can exercise four scheduler owners. This is only a blocker-removal step. The open attribution, serial/logging, scheduler-lock counter, workload-baseline, and per-CPU run-queue findings in the migrated review-finding task records remain required before accepting a speedup claim. Initial local build gates passed. The firstmake run-smp2-smokesattempt intarget/smp2-smokes/four-scheduler-cpus-20260430T202129Z/exposed an early boot failure after the enlarged static scheduler value crossed a fragile initialization path. The implementation now uses a capacity-reserved deferred process-drop queue instead of embedding oneProcessslot per scheduler CPU in theSchedulerstatic. Boundedrun-spawnsmoke evidence passed intarget/smp2-smokes/four-scheduler-cpus-spawn-pending-vec-20260430T203055Z/. Fullmake run-smp2-smokespassed intarget/smp2-smokes/four-scheduler-cpus-full-20260430T203214Z/. Local thread-scale guest-measure verification passed intarget/thread-scale/four-scheduler-cpus-local-20260430T203356Z/withCAPOS_THREAD_SCALE_RUNS=1, QEMU pinned to local CPUs0-1, and cases through-smp 4; local timing remains noisy and is not controlled speedup evidence. Controlledcapos-benchruns then verified the effect on the benchmark host. Physical-core runtarget/thread-scale/capos-bench-four-scheduler-cpus-20260430T203733Z/used QEMU pinned to logical CPUs0-3, recorded medians thread156144884, thread256190496, thread436386164(0.999x,1.543x), and kernel logs show AP scheduler owners on CPUs 1-3 starting benchmark threads. SMT diagnostictarget/thread-scale/capos-bench-four-scheduler-cpus-smt8-20260430T203945Z/used logical CPUs0-7, recorded medians thread156181720, thread256191504, thread456213928, thread8116270280(1.000x,0.999x,0.483x). Both rows remainaccepted=false; the physical 4-thread speedup is close to but below the1.6xthreshold, and the SMT8 row is informational because the scheduler owner mask remains four CPUs. Scheduler-attribution follow-up branchworkplan/thread-scale-scheduler-attributionadds guest-side total and per-phase scheduler counters for direct-target, run-queue, and idle candidate classes; runnable/retry/drop outcomes; and reschedule IPI target/sent/skipped/failure counts. Local functional verification intarget/thread-scale/scheduler-attribution-local-20260430T210322Z/passed all 1/2/4-thread cases withCAPOS_THREAD_SCALE_GUEST_MEASURE=1,CAPOS_THREAD_SCALE_RUNS=1, and QEMU pinned to local CPUs0-1; the shell wrapper reported failure only because it reused zsh’s read-onlystatusparameter after the harness had already written a successfulsummary.log. The 4-thread work phase now records scheduler retry pressure (55run-queue candidate checks,7idle candidate checks,28runnable outcomes, and34retry outcomes) while still recording zero ring dispatches. This materially improves attribution but does not close the broader scheduler-lock, serial, CR3/TLB, guest-symbol, or workload-baseline requirements in the migrated review-finding task records. Serial-attribution follow-up adds guest-side total and per-phase serial byte counters toCAPOS_THREAD_SCALE_GUEST_MEASURE=1. Bytes are counted after LF-to-CRLF expansion and after a UART byte is emitted, including emergency writes in measure kernels. Local functional verification intarget/thread-scale/serial-attribution-local-20260430T212243Z/passed all 1/2/4-thread cases withCAPOS_THREAD_SCALE_RUNS=1and QEMU pinned to local CPUs0-1; the stricter harness now requires aggregate and per-phase serial lines. The run recorded total serial bytes of4161,4788, and6295; work-phase serial bytes stayed at74in each case, while shutdown serial bytes rose from70to145to631. This closes the serial-byte counter blind spot, but it does not close scheduler-lock, CR3/TLB, guest-symbol, workload-baseline, or logging-suppression A/B requirements in the migrated review-finding task records. Scheduler-lock attribution follow-up adds guest-side total and per-phase global scheduler-lock counters toCAPOS_THREAD_SCALE_GUEST_MEASURE=1. It records acquisitions, contended acquisitions, try-lock failures asspin_loops, contended wait cycles, and hold cycles. Local functional verification intarget/thread-scale/lock-attribution-local-20260430T214854Z/passed all 1/2/4-thread cases withCAPOS_THREAD_SCALE_RUNS=1and QEMU pinned to local CPUs0-1; the stricter harness now requires aggregate and per-phase scheduler-lock lines. The local 4-thread final-total counters were234acquisitions,104contended acquisitions,2,161,691spin loops,1,239,033,542wait cycles, and570,372,812hold cycles; the 4-thread work phase still had15acquisitions,5contended acquisitions,95,047spin loops,37,181,792wait cycles, and32,762,392hold cycles. This closes the first scheduler-lock counter blind spot; hold cycles include measure acquisition-counter update overhead and exclude release-counter update and unlock overhead, so they are first-pass attribution rather than exact critical-section time. At that point, CR3/TLB, guest-symbol, workload-baseline, logging-suppression A/B, and controlled benchmark-host confirmation requirements in migrated review-finding task records remained open; timer tick count attribution was queued for the follow-up recorded below. Controlledcapos-benchreruns after this landed on main at commit6eff7ae4used QEMU pinned to logical CPUs0-3for physical-core evidence and0-7for the informational SMT diagnostic. Physical-core runtarget/thread-scale/capos-bench-lock-main-physical-20260430T220944Z/recorded medians thread156309194, thread256302666, thread428301916(1.000x,1.990x); SMT diagnostictarget/thread-scale/capos-bench-lock-main-smt8-20260430T221246Z/recorded medians thread156379514, thread256186566, thread428259776, thread8131264324(1.003x,1.995x,0.430x). A one-run guest-measure confirmation intarget/thread-scale/capos-bench-lock-main-measure-20260430T221543Z/verified scheduler, serial, and scheduler-lock lines on the benchmark host. Host perf profiling was not collected becauseperf_event_paranoid=4blocked unprivileged perf on the restarted VM. Timer-attribution follow-up on branchworkplan/thread-scale-timer-attributionadds guest-side total and per-phase timer counters toCAPOS_THREAD_SCALE_GUEST_MEASURE=1, distinguishing user-mode timer interrupts entering the scheduler path from kernel-mode timer interrupts that only advance time and EOI, with separate BSP tick-advance counts. The harness now requires aggregate and per-phase timer lines. Local functional verification intarget/thread-scale/timer-attribution-local-20260430T223441Z/passed all 1/2/4-thread cases withCAPOS_THREAD_SCALE_RUNS=1, QEMU pinned to local CPUs0-1, and guest measurement enabled. Aggregate timer counters were7/7/0/7,25/17/8/9, and132/101/31/23(interrupts/user_scheduler/kernel_only/bsp_tick_advances); the 4-thread work phase recorded7/7/0/1. The remaining attribution requirements at that point were CR3/TLB, guest-symbol or guest-PC sampling, workload-baseline, and logging-suppression A/B evidence. CR3/TLB-attribution follow-up on branchworkplan/thread-scale-tlb-attributionadds guest-side total and per-phase TLB counters toCAPOS_THREAD_SCALE_GUEST_MEASURE=1, covering runtime CR3 writes, pending-flush checks, pending full TLB flushes, remote shootdown requests, target CPUs, shootdown IPIs, and deferred completion drains. The harness now requires aggregate and per-phase TLB lines. Local functional verification intarget/thread-scale/tlb-attribution-local-20260430T225628Z/passed all 1/2/4-thread cases withCAPOS_THREAD_SCALE_RUNS=1, QEMU pinned to local CPUs0-1, and guest measurement enabled. Aggregate TLB counters were3/28/0/0/0/0/0,7/52/3/3/3/3/2, and14/139/17/7/17/17/4(cr3_writes/pending_flush_checks/pending_flush_all/shootdown_requests/shootdown_target_cpus/shootdown_ipis/deferred_completion_drains); the 4-thread work phase recorded0/10/0/0/0/0/0. The remaining attribution requirements at that point were guest-symbol or guest-PC sampling, workload-baseline evidence, and logging-suppression A/B evidence. Logging-suppression A/B follow-up addsCAPOS_THREAD_SCALE_SUPPRESS_SWITCH_LOGS=1tomake run-thread-scale. The knob suppresses scheduler transition diagnostics in the benchmark kernel while preserving proof, error, and measurement output. Local one-run A/B verification withCAPOS_THREAD_SCALE_GUEST_MEASURE=1,CAPOS_THREAD_SCALE_RUNS=1, and QEMU pinned to local CPUs0-1produced artifacts intarget/thread-scale/logging-ab-baseline-local-20260430T231800Z/andtarget/thread-scale/logging-ab-suppressed-local-20260430T232600Z/. Targeted scheduler diagnostic line counts dropped from7/12/18to0/0/0for the 1/2/4-thread cases, and aggregate serial bytes dropped from4161/4743/5889to3894/4280/5047. This closes only the logging A/B blind spot; guest-symbol or guest-PC sampling and workload/cacheline baseline evidence remained open. Linux pthread baseline follow-up addsmake run-linux-thread-scale-baselinefor the exact fixed-size thread-scale checksum workload. Controlled nativecapos-benchruns at commit370ce145with taskset pinned to physical-core logical CPUs0-3recorded padded-slot capOS-shaped work-window medians of306776,152293, and1120024ns for 1/2/4 workers (2.014x,0.274x). Compact-slot medians were similar at316388,152291, and1123534ns (2.078x,0.282x), so result-slot false sharing is not the visible differentiator for the current workload shape. The SMT diagnostic pinned to0-7recorded padded work medians303877,155565,170019, and243481ns for 1/2/4/8 workers (1.953x,1.787x,1.248x). The exact baseline shows the one-megabyte workload and coordinator spin window are not a clean four-core linear-scaling reference. This closes the exact Linux pthread baseline and result-slot padding blind spots only; guest-symbol or guest-PC sampling and larger-workload/Amdahl- sensitivity evidence remain open. Benchmark repair follow-up completed 2026-05-01 14:58 UTC: the default host baselines now use blocking parent join, 262,144 blocks (16 MiB), andwork_rounds=64instead of the old 1 MiB/spinning-parent shape. Controlled Linux evidence on the selected physical CPU set recorded 1-to-2 work/total speedups1.991x/1.990xand 1-to-4 work/total speedups3.958x/3.834x, proving the repaired benchmark shape can scale on the host before capOS results are interpreted as scheduler evidence. Guest-PC sampling follow-up adds a measure-only exact-RIP histogram for user-mode timer interrupts while a thread-scale case is active. The harness now requires aggregate and per-phaseuser_pc_sampleslines forCAPOS_THREAD_SCALE_GUEST_MEASURE=1. Local one-run verification intarget/thread-scale/guest-pc-sampling-local-20260501T001500Z/usedCAPOS_THREAD_SCALE_RUNS=1with QEMU pinned to local CPUs0-1and passed all 1/2/4-thread cases. Aggregate PC sample counts were6,17, and55with zero overflow; the 4-thread phase counts were spawn-ready13, work9, shutdown33, and final-total55. This closes the guest-PC sampling blind spot only; the later symbol-map harness slice preserves a benchmark-only userspace map for interpreting those raw PC buckets, and larger-workload Amdahl-sensitivity evidence remained open until the follow-up below. Resolved PC attribution report follow-up completed 2026-05-01 06:13 UTC on branchworkplan/thread-scale-pc-symbol-report: guest-measure case-runs now writeuser-pc-symbols.logbesidemeasure.logand record it inresults.csvunderuser_pc_symbol_report. Local verification intarget/thread-scale/20260501T060822Z/usedCAPOS_THREAD_SCALE_GUEST_MEASURE=1,CAPOS_THREAD_SCALE_RUNS=1, and QEMU pinned to local CPUs0-1; the thread4 report resolves sampled PCs toworker_entry,run_case, andRingClient::waitnearest symbols and keeps PCs below the first symbol as explicit<unmapped>rows. Larger-workload/Amdahl follow-up addsCAPOS_THREAD_SCALE_TOTAL_BLOCKSandLINUX_THREAD_SCALE_TOTAL_BLOCKSso the same deterministic checksum workload can run beyond the default one-megabyte case. Controlledcapos-benchruns at commit32c066b8used1,048,576blocks (64 MiB). With QEMU pinned to physical-core logical CPUs0-3, capOS work medians were112590712,112511206, and36369098cycles for 1/2/4 workers (1.001x,3.096x), while total medians were189204910,218898002, and205640850cycles (0.864x,0.920x). The matching native Linux physical-core baseline recorded work medians17766664,8961256, and7442107ns (1.983x,2.387x) and total medians17883289,9094596, and10090354ns (1.966x,1.772x). SMT diagnostic rows pinned to0-7recorded capOS 1/2/4/8-worker work speedups of1.002x,2.870x, and0.644xand Linux speedups of1.993x,2.458x, and2.658x. Raw artifacts are undertarget/thread-scale/amdahl-1048576-physical-20260501T003700Z/,target/thread-scale/amdahl-1048576-smt8-20260501T004200Z/,target/linux-thread-scale/amdahl-1048576-physical-20260501T003400Z/, andtarget/linux-thread-scale/amdahl-1048576-smt8-20260501T004000Z/. This closes the larger-workload evidence blind spot, but the milestone remains open because 1-to-2 work scaling is flat and total-case scaling remains below 1x for 2/4 workers. The guest rows still carry diagnosticaccepted=false; host-summary acceptance remains gated by KVM evidence and the configured 1-to-2 median work and opt-in total thresholds. Guest-measure runs now preserve the benchmark-only userspace symbol map needed to interpret raw PC buckets after collection. Post-threshold-policycapos-benchreruns at main commitf198b099verified the host-summary total-speedup fields while keeping the milestone open. Physical-core pinning0-3recorded work speedups1.002xand1.002xplus total speedups0.911xand0.601xfor 2/4 workers intarget/thread-scale/total-threshold-main-physical-20260501T065028Z/. SMT diagnostic pinning0-7recorded 1/2/4/8 work speedups1.001x,0.998x, and0.333xplus total speedups0.913x,0.621x, and0.200xintarget/thread-scale/total-threshold-main-smt8-20260501T065443Z/. Scheduler-lock site attribution follow-up completed 2026-05-01 09:52 UTC: guest-measure kernels keep the existing aggregatemeasure: scheduler_lockline and add aggregate plus per-phasemeasure: scheduler_lock_sitecounters for generic, timer pre-ring, timer select, blocking, process exit, thread exit, start/idle selection, wake/unblock, and metadata classes. The harness requires those lines forCAPOS_THREAD_SCALE_GUEST_MEASURE=1. Local one-run evidence intarget/thread-scale/20260501T100202Z/verified the new lines and still reportedaccepted=falsewith 1-to-2/1-to-4 work speedups0.998xand1.001xand total speedups0.921xand0.509x. This is bounded split-prep attribution for the known global scheduler-lock bottleneck, not speedup evidence; the later caller-aware placement closeout above is the controlled evidence that passed the work and total gates. - Record aggregate same-process worker placement for
make run-thread-scaleand fix creation-time local concentration. Completed 2026-05-01 12:37 UTC: guest-measure output recorded aggregate publish, selected-CPU, first-selected CPU, and migration buckets for CPU slots 0-3. Newly created non-single-owner threads were published to the least-loaded active scheduler CPU slot, while single-owner capability pinning, generation checks, direct-IPC preference, and allocation-free timer/unblock paths were preserved. This aggregate evidence proved the 4-worker first-selected distribution reached all four scheduler CPU slots, but it was not per-worker identity tracking and it was not speedup evidence. (Update 2026-05-02: the publish counters and the caller-aware placement chain were retired with the per-CPU run-queue collapse;make run-thread-scaleand the kernel measure printer no longer emit the publish__cpu / publish_caller_* fields. Selected-CPU, first-selected CPU, and migration buckets remain. Per-CPU placement evidence returns with the fair-share enqueue policy that Phase D will own.) - If later attribution needs individual worker histories, add per-worker placement output for first scheduled CPU, latest scheduled CPU, migration count, and runnable-owner distribution without replacing the aggregate counters used by the thread-scale harness.
- Treat same-process speedup as a separate claim from multi-process SMP
concurrency. Passing
make run-smp-process-scalemust not imply this milestone is complete. Completed: same-process speedup was accepted only aftermake run-thread-scalecontrolled evidence on the thread-scale harness, separate from the earlier process-scale milestone. - Keep the ordinary
-smp 2regression gate repeatable while the thread-scaling implementation evolves. Themake run-smp2-smokestarget runs the default manifest smoke and the spawn manifest smoke with-smp 2, retaining raw per-target logs under the configured target directory. Closeout evidence passed.
Task Selection
Choose a task that isolates scheduler and CPU parallelism rather than a subsystem bottleneck. Both milestones should use workload shapes with these properties:
- CPU-bound and deterministic, with no network, disk, terminal, or heap-heavy hot path.
- Naturally partitionable into independent chunks so workers do not share a lock, mutable buffer, or capability ring while the timed section runs.
- Verifiable by a compact checksum, count, or known-answer oracle.
- Long enough to dominate boot, process spawn, timer granularity, and serial logging overhead.
- Runnable as independent worker processes for the multi-process milestone, and runnable as sibling threads through the per-thread completion-routing model used by the in-process milestone.
Avoid using IPC throughput, capability-ring dispatch, park wake storms, console logging, or allocator stress as the first SMP scaling claim. Those are valid later benchmarks, but they measure shared kernel bottlenecks as much as CPU scheduling. Same-process thread scaling remains a separate milestone because it needs accepted per-thread-ring timing evidence, not only functional sibling execution.
For the in-process milestone, the default workload should be a uniform fixed-size chunk workload such as BLAKE3-style tree hashing, CRC32C over disjoint buffers, or a small native deterministic block-hash loop. The first implementation does not need a cryptographic dependency; it does need fixed-size chunks, per-thread private output slots, and a root checksum that detects missing, duplicated, or reordered chunks. Prime counting remains valid historical evidence for multi-process concurrency, but it is a weaker same-process scaling workload because numeric range cost is not uniform.
Grounding Files
docs/proposals/smp-proposal.mddocs/proposals/ring-v2-smp-proposal.mddocs/architecture/scheduling.mddocs/architecture/threading.mddocs/research/completion-ring-threading.mddocs/research/out-of-kernel-scheduling.mddocs/research/sel4.mddocs/research/zircon.mddocs/research/x2apic-and-virtualization.md
Notes
Initial multi-CPU scheduling may keep the current process ring while the
runtime serializes process-ring consumption. Full SMP where sibling threads
from one process wait independently on different CPUs should not keep the
process-wide CQ as the kernel ABI endpoint. The target transport model is
per-thread capability rings: cap_enter(min_complete, timeout_ns) waits on the
current thread’s CQ, kernel waiters route completions by generation-checked
ThreadRef, and SQPOLL becomes a per-ring mode with one kernel SQ consumer.
SharedParkSpace park-words still need MemoryObject mapping provenance or object pins before shared-key derivation lands.
2026-04-25 11:36 UTC: commit d88bca7 recorded the First AP Scheduler proof.
AP cpu=1 can run scheduler-owned user contexts under -smp 2, and a one-way
scheduler-owner latch prevents the BSP and AP from both entering
scheduler-owned user work while the process-wide ring remains the active
transport.
Scheduler Evolution Backlog
This backlog decomposes future scheduler architecture from
Scheduler Evolution. It also
retains the completed attribution and placement history that closed the
In-Process Threading Scalability milestone; new selected-milestone work now
continues from docs/tasks/README.md.
Design Grounding Checklist
Before implementation slices, read:
docs/architecture/scheduling.mddocs/backlog/smp-phase-c.mddocs/proposals/smp-proposal.mddocs/proposals/ring-v2-smp-proposal.mddocs/proposals/tickless-realtime-scheduling-proposal.mddocs/proposals/stateful-task-job-graphs-proposal.mddocs/proposals/scheduler-evolution-proposal.mddocs/proposals/system-performance-benchmarks-proposal.mddocs/proposals/hpc-parallel-patterns-proposal.mddocs/research/future-scheduler-architecture.mddocs/research/nohz-sqpoll-realtime.mddocs/research/out-of-kernel-scheduling.mddocs/research/completion-ring-threading.mddocs/research/hpc-parallel-patterns.md
For realtime or isolation slices, also read:
docs/research/multimedia-pipeline-latency.mddocs/research/robotics-realtime-control.mddocs/research/x2apic-and-virtualization.md
Phase A: Attribution and Guardrails
- Finish first-pass thread-scale attribution guardrails. Scheduler candidate/outcome, reschedule-IPI, serial-byte, scheduler-lock, timer interrupt, CR3/TLB, raw guest-PC sample, logging-suppression A/B, exact Linux pthread baseline, compact-versus-padded result-slot diagnostic, and larger-workload/Amdahl evidence now exist. The evidence does not identify the primary remaining non-scaling cause; it keeps per-CPU runnable ownership, accepted threshold-passing work/total evidence, and optional symbolic attribution as follow-on work.
- Add bounded scheduler-lock site attribution before a structural lock
split. As of 2026-05-01 09:52 UTC, measure builds keep the compatible
aggregate
scheduler_lockline and also emit aggregate plus per-phasescheduler_lock_sitecounters for generic, timer pre-ring, timer select, blocking, process exit, thread exit, start/idle selection, wake/unblock, and metadata classes. This is split-prep attribution only; it does not accept the in-process thread-scale milestone. - Add timer-fast-path attribution for the bounded continuation path. As of
2026-05-01 10:58 UTC, measure builds extend the aggregate and per-phase
timercounter lines with fast-path attempts, continues, and fallback reasons for slow-required/dirty summaries, skip-budget exhaustion, pending reschedule IPIs, no-current/non-idle CPUs, and inactive/invalid scheduler CPUs. The thread-scale harness requires those fields only forCAPOS_THREAD_SCALE_GUEST_MEASURE=1. This is attribution only; it does not change scheduler behavior and does not close the currentaccepted=falsework or total gates. Local one-run evidence intarget/thread-scale/20260501T110157Z/passed with the new fields present in every 1/2/4-threadmeasure.log; the timed work phase recordedfast_path_continues=0for all three rows. - Add timer slow-summary reason attribution for dirty fast-path summaries.
As of 2026-05-01 11:28 UTC, measure builds emit aggregate and per-phase
timer_slow_summarylines with required/clean counts plus reason fields for nonempty run queues, direct IPC targets, handoff-current state, pending process termination/drop/stack release, timer sleeps, and timed cap-enter versus park waiters. The harness requires those lines only forCAPOS_THREAD_SCALE_GUEST_MEASURE=1. Local one-run evidence intarget/thread-scale/20260501T112359Z/passed with the new lines present in every 1/2/4-threadmeasure.log; the timed work phase reported dirty summaries attributable torun_queue_nonemptyandhandoff_currentonly, withrequired=2/4/8,clean=0, and timer sleeps/timed waiters at zero for the 1/2/4-thread rows. The subsequent fairness-only behavior slice keeps the same fields, butrequirednow means direct IPC, deferred cleanup, timer sleeps, or timed waiter work still force the next locked timer pass. - Complete thread-scale shared-kernel-state contention attribution
guardrails beyond the first measure-only lock-counter slice. As of
2026-05-01 08:07 UTC,
CAPOS_THREAD_SCALE_GUEST_MEASURE=1emits aggregate and per-phaseshared_kernel_lockcounters for frame allocator alloc/free locks, ring-dispatch cap-table and ring-scratch locks beforecap::ring::process_ring, endpoint inner/cancellation scratch locks, direct per-process address-space locks, and heap allocator locks. As of 2026-05-01 08:29 UTC, fresh thread-scale rows also carry explicit benchmark-class fields and the harness requires, validates, and exports those fields toresults.csv; local one-run evidence is retained intarget/thread-scale/20260501T083254Z/. As of 2026-05-01 08:49 UTC, guest-measure runs also emit and require aggregate and per-phasenetwork_pollcounters for initialized virtio-net scheduler/runtime/interface polling, the built-in TCP HTTP proof poll, virtqueue poll spins and completions, and pending network waiter scans. Local one-run evidence intarget/thread-scale/20260501T093505Z/passed and retained zero aggregate and per-phase network/poll counters for the 1/2/4-thread rows. The default thread-scale manifest has no virtio-net device, and the scheduler poll entry returns before the driver mutex in that no-device case. Those counters are expected zero-evidence for the CPU-bound thread-scale benchmark. They do not prove service throughput; future service/network benchmarks still need their own hot-section attribution and acceptance evidence. - Add a benchmark-kernel mode that suppresses per-context-switch logging
during measured cases so serial MMIO cannot masquerade as scheduler cost.
Completed with
CAPOS_THREAD_SCALE_SUPPRESS_SWITCH_LOGS=1; benchmark proof/error output and measure lines remain enabled. - Decide which counters are permanent observability and which stay behind
measure. Completed 2026-05-01 04:55 UTC indocs/architecture/scheduling.md: all existingkernel/src/measure.rscounters remain benchmark-only behind themeasurefeature. Permanent scheduler observability should be added later through a separate low-overhead operator snapshot surface after the Phase C runtime accounting ledger exists, starting with runtime, context-switch, preemption, voluntary-block, migration, queue-depth, reschedule-IPI, TLB-shootdown, and policy admission/denial counts. Phase/cycle attribution, scheduler-lock wait/hold cycles, serial byte attribution, timer/TLB benchmark totals, raw user-PC samples, and thread-scale phase checkpoints stay behindCAPOS_THREAD_SCALE_GUEST_MEASURE=1. Grounding read:docs/architecture/scheduling.md,docs/proposals/scheduler-evolution-proposal.md,docs/research/future-scheduler-architecture.md,docs/research/out-of-kernel-scheduling.md,docs/research/nohz-sqpoll-realtime.md, anddocs/research/completion-ring-threading.md. - Record controlled benchmark-VM evidence before and after each scheduler
structure change.
Latest follow-up after the first Phase C runtime-accounting slice reran
the in-process thread-scale diagnostic at main commit
a88e7906with QEMU pinned to physical-core logical CPUs0-3and SMT logical CPUs0-7. All rows remainedaccepted=false: physical 1/2/4 work speedups were1.000xand0.999x, and SMT 1/2/4/8 work speedups were1.000x,1.001x, and0.333x. Follow-up after the total-speedup host-summary gate landed reran currentmaincommitf198b099on the benchmark VM with QEMU pinned to0-3and0-7. The harness now reports total-speedup diagnostics explicitly: physical 1/2/4 work speedups were1.002xand1.002x, total speedups were0.911xand0.601x; SMT diagnostic 1/2/4/8 work speedups were1.001x,0.998x, and0.333x, total speedups were0.913x,0.621x, and0.200x. Both host-summary gates remain unsatisfied.
Phase B: Per-CPU Runnable Ownership
-
Land the first bounded per-CPU runnable queue slice. Commit
1a8bf909replaces the single global schedulerVecDequewith four per-scheduler-CPU FIFO queues under the existing global scheduler lock, centralizes enqueue/requeue/removal helpers, keeps single-owner capability processes on CPU0, prefers local work before bounded stealing, preserves direct IPC preference, and removes stale runnable entries for process/thread exit. Review fixes track live run-queue reservations, reserve all per-CPU queues to that count before publishing a new runnable thread, and release reservations on process/thread exit or pre-publication rollback, keeping timer and unblock requeue paths allocation-free after cross-CPU steals. Verification coveredrun-spawn,run-smp2-smokes, and controlled benchmark-VM 1/2/4/8-thread diagnostics. The default workload and total-case 64 MiB rows remainaccepted=false, so this is structure evidence, not milestone closeout. -
Finish
PerCpuRunQueueownership invariants as a documented contract. Completed 2026-05-01 02:13 UTC indocs/architecture/scheduling.md: a live generation-checkedThreadRefhas at most one runnable dispatch owner across current slots, per-CPU run queues, and the direct IPC target; migration is a scheduler-lock-contained remove-before-publish transfer; local-first stealing is bounded by the scheduler CPU slots; and live run-queue reservations keep timer, unblock, direct-IPC fallback, steal retry, and steal requeue paths allocation-free. -
Split current-thread and runnable ownership from shared process/thread metadata without widening emergency-path allocation. Completed 2026-05-01 04:22 UTC in commit
d7221648:Scheduler::processesremains the shared process/thread metadata table, whileSchedulerDispatchnow owns per-CPU run queues, current and handoff slots, idle slots, the direct IPC target, run-queue reservation count, pending process drops, and pending thread stack releases. The existing global scheduler lock and generation checks are unchanged, and the dispatch split keeps the pre-reserved run-queue capacity model for timer, unblock, direct-IPC fallback, steal retry, and steal requeue paths. Verification passedmake fmt-check,cargo build --features qemu, a cachedmake run-spawnrerun, andmake run-smp2-smokesintarget/smp2-smokes/20260501T042343Z/. Controlled benchmark-VM timing after merge56458b12stayedaccepted=false:| Pinning | Workers | Work Median | Total Median | Work Speedup | Total Speedup | | --- | ---: | ---: | ---: | ---: | ---: | | physical `0-3` | 1 | `56275842` | `140953762` | `1.000x` | `1.000x` | | physical `0-3` | 2 | `56290542` | `153327094` | `1.000x` | `0.919x` | | physical `0-3` | 4 | `56315094` | `237018874` | `0.999x` | `0.595x` | | SMT `0-7` | 1 | `56258010` | `140620194` | `1.000x` | `1.000x` | | SMT `0-7` | 2 | `56313324` | `153367860` | `0.999x` | `0.917x` | | SMT `0-7` | 4 | `56352472` | `237971426` | `0.998x` | `0.591x` | | SMT `0-7` | 8 | `169006414` | `727393630` | `0.333x` | `0.193x` | -
Add a bounded timer continuation fast path before a broader scheduler lock split. Completed 2026-05-01 10:29 UTC: user-mode LAPIC timer ticks can continue the current non-idle thread without calling
sched::schedule()only when a previous locked timer slow path published a clean hard-work summary, the current CPU is a valid active scheduler slot, no reschedule IPI is pending for that CPU, and the per-CPU one-skip budget is not exhausted. Dirty producers still force at least one locked pass before bypass, but the 2026-05-01 11:40 UTC follow-up lets that pass classify remaining nonempty run queues and handoff-current markers as fairness/protection-only state. Direct IPC targets, deferred termination/drop/stack cleanup, Timer sleeps, and timed cap-enter/Park waiters still keep the hard slow-path bit set; ordinary ring SQEs and indefinite cap wait scans are still serviced by forced slow-path ticks. This is a correctness-first split-prep slice, not a replacement for narrower scheduler metadata locks or accepted thread-scale evidence. Controlled benchmark-VM physical-core0-3before/after runs for the initial strict-clean version retainedaccepted=false: baselinetarget/thread-scale/timer-fastpath-baseline-main-physical-20260501T102938/recorded work speedups0.998xand0.998xplus total speedups0.907xand0.620x; after-changetarget/thread-scale/timer-fastpath-after-physical-20260501T104700/recorded work speedups1.001xand0.999xplus total speedups0.909xand0.602x. Controlled benchmark-VM physical-core0-3before/after runs for the fairness-only follow-up stayedaccepted=false: baselinetarget/thread-scale/20260501T120224Z/recorded work speedups1.001xand0.999xplus total speedups0.913xand0.587x; after-changetarget/thread-scale/20260501T120709Z/recorded work speedups1.001xand1.000xplus total speedups1.125xand0.828x. -
Add cross-CPU wake policy for endpoint, timer, park, process wait, and thread join completions. Completed 2026-05-01 03:06 UTC: queued wakeups now target the selected per-scheduler-CPU FIFO owner instead of scanning all idle scheduler CPUs.
-
Add explicit placement evidence and placement policy for newly runnable same-process worker threads. Completed 2026-05-01 12:37 UTC, refined 2026-05-01 13:20 UTC, and repaired 2026-05-01 14:58 UTC after the blocking-parent benchmark exposed a placement regression. Measure builds emit aggregate and per-phase
thread_placementlines with single-owner publish buckets, normal publish buckets, caller-current publish buckets, caller-aware avoid, fallback, and strict-load fallback counts, selected CPU buckets, first-selected CPU buckets, and migration totals/targets for CPU slots 0-3.publish_created_thread()receives the caller thread fromThreadSpawner.create, keeps single-owner processes on CPU0, and avoids the caller’s current CPU only when another active ready scheduler CPU has a strictly lower non-idle dispatch load. On equal load, an active-ready caller CPU wins the tie instead of falling through to CPU0-biased least-loaded scanning; if the caller slot is unknown or ineligible, publication falls back to the least-loaded active scheduler CPU behavior. Timer, unblock, direct-IPC fallback, steal retry, and steal requeue paths keep their existing allocation-free targeting behavior.The earlier avoid-caller policy passed the old spinning-parent 1-to-2 gate but failed the repaired blocking-parent shape: before the strict-load fix, controlled capOS evidence regressed to 1-to-2 work/total speedups `0.886x`/`0.928x` because children were biased onto the non-caller queue even when the caller CPU had equal load. The repaired benchmark shape uses blocking parent join, 262,144 blocks (16 MiB), and `work_rounds=64`. The matching Linux baseline scales on the selected physical CPU set with 1-to-4 work/total speedups `3.958x`/`3.834x`. Controlled capOS evidence on the same CPU set passed the enforced 1-to-2 work/total gates with `1.828x`/`1.687x`; the unsuppressed 1-to-4 diagnostic recorded `3.029x`/`2.386x`, and scheduler-switch-log-suppressed diagnostics recorded `3.272x`/`2.303x`. Remaining four-worker limits are now scheduler implementation issues, not benchmark-shape excuses: serial switch logging, global `Scheduler` lock contention, total-time exit/join/block/schedule overhead, and the temporary four-owner CPU mask. -
Add bounded reschedule IPI behavior for idle-to-runnable transitions. Completed 2026-05-01 03:06 UTC: queued wakeups target at most one queue-owner CPU, direct IPC targets at most one eligible idle scheduler CPU, and measure builds emit wake scan, eligible idle CPU, target, sent, pending-skip, not-ready-skip, missing-target, and failure counters.
-
Preserve direct IPC handoff as a scheduling preference without bypassing per-CPU ownership or generation checks. Completed 2026-05-01 03:06 UTC: direct IPC still uses the single preference slot when available and falls back to the normal queued owner path when the target cannot run directly.
-
Prove process/thread exit cleanup cannot leave a stale runnable entry on any CPU queue. Completed 2026-05-01 03:14 UTC: process termination, current-process exit, and
ThreadControl.exitThreadcleanup now assert under the scheduler lock that the exiting process or thread no longer appears in any per-scheduler-CPU FIFO or in the direct IPC target slot. The focused spawn smoke asserts the serial proof markers emitted by the exercised process/thread exit paths. -
Rerun
make run-thread-scale,make run-smp2-smokes, ordinary smoke, spawn/thread, park, ring, and process-exit focused proofs. Completed 2026-05-01 04:18 UTC: local serial reruns passed normalmake run-thread-scaleintarget/thread-scale/scheduler-phaseb-rerun-local-normal-20260501T034800Z/andmake run-smp2-smokesintarget/smp2-smokes/20260501T034414Z/. Controlled benchmark-VM reruns at main commit87be6e25pinned QEMU to physical-core logical CPUs0-3and SMT logical CPUs0-7; all rows remainedaccepted=false, so this closes the Phase B rerun-evidence gate but not the selected in-process speedup milestone.
Phase C: CPU Accounting
- Add monotonic runtime charge points when a running thread leaves the CPU
at context switch, preemption, blocking syscall, direct IPC handoff, and
thread exit. Completed 2026-05-01 05:08 UTC: running intervals are
charged with
crate::arch::context::monotonic_ns()when a current thread stops running through timer preemption, blockingcap_enter/ParkSpace, thread/process exit, and direct switch or handoff paths that select the next current thread. - Observe blocked runtime stability at unblock without charging non-running time. Completed 2026-05-01 05:08 UTC: unblock paths check the blocked runtime snapshot before making the thread ready.
- Track per-thread runtime, virtual runtime seed, context switches,
preemptions, voluntary blocks, and migrations. Completed 2026-05-01
05:08 UTC:
ThreadCpuAccountingis stored on eachThreadrecord and updated under the scheduler/process lock. Context switch counters increment when a thread is selected, preemptions increment only for timer-driven running-to-ready requeue, voluntary blocks increment for blockingcap_enterand ParkSpace waits, and migrations increment when a thread runs on a different scheduler CPU than its previous run. - Add process/session/service aggregation only after the per-thread record
has a single ledger of record. Completed 2026-05-22 13:50 UTC: a
per-
ProcessProcessCpuAccountingledger sumsruntime_nsand a process-levelcontext_switchesdispatch count incrementally at the same scheduler/process-lock charge points that updateThreadCpuAccounting, so it captures exited threads’ contributions. Only the always-present (non-measure) per-thread quantities are rolled up; the measure-gatedpreemptions/voluntary_blocks/migrationscounters are intentionally not aggregated so the default-build proof stays meaningful. The kernel emits asched: process_cpu_accounting pid=... runtime_ns=... context_switches=...line at per-process exit andmake run-spawnasserts a nonzero aggregate. Session/service aggregation remains a stretch follow-on. - Add tests or QEMU diagnostics proving runtime increases while running and
stops while blocked. Completed 2026-05-01 05:08 UTC:
make run-spawnnow asserts a compact scheduler proof line that requires nonzero runtime, context switches, preemptions, and voluntary blocks, plus stable blocked and exited runtime observations. - Keep runtime accounting independent of tickless idle by using the
monotonic clocksource layer. Completed 2026-05-01 05:08 UTC: normal
accounting uses
monotonic_ns()and does not readkernel/src/measure.rscycle counters.
Phase D: Best-Effort Fair Scheduling
Phase D accepted its Task 6 diagnostic closeout at commit 77caafc0
(2026-05-10 19:39 UTC, docs(scheduler): record phase d thread-scale gate)
and closed in docs commit 1a08ec23 (2026-05-10 21:47 UTC,
docs(scheduler): close phase d). The first
Phase D policy is weighted fair queueing on top of the existing
per-thread runtime_ns / virtual_runtime_ns accounting, with a
capability-authorized SchedulingPolicyCap for weight and latency-class
mutation. The controlled Task 6 benchmark pair passed the harness-enforced
1-to-2 work/total gates; capOS recorded 1-to-4 work/total diagnostics
3.088x / 2.700x at 4 workers versus the prior single-global-queue baseline
1.566x / 1.538x, and that 1-to-4 row was manually accepted for Phase D
closeout. The matching Linux pthread baseline on the same host and
physical-core logical CPUs 0,1,2,3 recorded 3.974x / 3.850x. EEVDF is
now a follow-on policy evaluation, not a Phase D blocker. The design content is
in
docs/proposals/scheduler-evolution-proposal.md “Phase D
first-policy decision”, “Phase D capability surface”, “Phase D
migration fairness sketch”, “Phase D test matrix”, and “Phase D
overload behavior” sections. The completed implementation plan is
archived at docs/backlog/scheduler-evolution.md.
The bullets below retain the closed acceptance gates and the
Phase D follow-ons that should be selected explicitly. Phase E
SchedulingContext is the next scheduler authority phase, followed
by Phase F auto-nohz / SQPOLL / tickless idle; generic full-nohz
remains deferred behind those prerequisites.
- Choose initial weighted-fair or EEVDF-like policy based on accounting and
queue data. Resolved
2026-05-05 19:00 UTC: WFQ first; EEVDF deferred. Seedocs/proposals/scheduler-evolution-proposal.md“Phase D first-policy decision”. - Add scheduler entity weights and latency class metadata through a
capability-authorized policy path, not ambient process fields.
Closed by
docs/backlog/scheduler-evolution.mdTasks 1-2:SchedulingPolicyCapschema + kernel cap, per-threadweight/latency_classfields, weighted vruntime, and caller-thread cap binding. - Preserve fairness across CPU migration. Implementation tracked in
docs/backlog/scheduler-evolution.mdTask 4 (vruntime travels with the thread,virtual_finish_nsrecomputed at destination enqueue, bounded steal targets the queue whose head has the lowestvirtual_finish_ns, matching the local pick rule of taking the front of the ascending per-CPU queue). Closed2026-05-08 00:53 UTC: invariants made explicit onrefresh_virtual_finish_ns_lockedand at the steal-insert site; thecfg(feature = "measure")-gatedThreadCpuAccounting.migrationscounter moved from the dispatch-timescheduled_measurepath to enqueue-timerecord_placement_spread_migration_lockedandrecord_steal_migration_lockedarms; weight-change-while- enqueued contract proved by construction with adebug_assert!reinforcement inProcess::refresh_thread_virtual_finish_ns. - Test CPU hogs, short sleepers, direct IPC server/client pairs,
multi-process load, and same-process sibling load. Implementation
tracked in
docs/backlog/scheduler-evolution.mdTask 5 (test matrix smokes) and Task 6 (the controlledmake run-thread-scaleevidence pair: harness-enforced 1-to-2 gates plus a manually accepted 1-to-4 diagnostic closeout row). Closed2026-05-10 19:46 UTC: the benchmark-VM Task 6 run at commit76025f0963a4recorded capOS 1-to-4 work/total diagnostics3.088x/2.700x; the 1-to-2 gate stayed green at1.809x/1.774x. The matching Linux pthread baseline on the same physical-core logical CPUs0,1,2,3recorded3.974x/3.850x. - Define overload behavior when runnable entities exceed the selected CPU
set or when migration cannot keep up. Resolved at the design
level
2026-05-05 19:00 UTC: soft overload uses vruntime ordering (no entity is starved); hard overload defers to Phase FCpuIsolationLeaseand Phase GRealtimeIsland. Seedocs/proposals/scheduler-evolution-proposal.md“Phase D overload behavior”. - Phase D follow-on: EEVDF migration. Once the WFQ slice has
accepted thread-scale evidence, evaluate replacing the bucketed
per-CPU
VecDequewith an EEVDF eligibility set (BTreeMap-by-virtual-deadline) plus per-thread request size and lag accounting. The accounting fields, capability surface, and migration contract carry directly; the change is localized to the dispatch ordering structure. Promote to its own design slice if and when selected; do not bundle it into the WFQ first-slice plan.
Phase E: SchedulingContext Capability
Phase E policy follow-ups are closed. Local owner-shell logout propagation is
recorded in
scheduler-phase-e-local-owner-shell-logout-propagation.
Endpoint donation/return, timeout/depletion notifications, and the
scheduler-observable session lifecycle hook are recorded on main:
scheduler-phase-e-endpoint-donation,
scheduler-phase-e-timeout-depletion-notifications, and
scheduler-session-lifecycle-hook.
The donated-context logout policy is also closed as a conservative
counted/skipped return-path proof:
scheduler-phase-e-session-logout-donated-context-policy.
Timeout/depletion notifications now use fixed per-context notification cells
allocated at context creation/bootstrap. The ordinary non-donated
session-logout stale-context proof is complete through the
UserSession.logout() hook. In-flight endpoint donation uses the conservative
counted/skipped policy during logout and relies on endpoint RETURN/cancel to
finish the in-flight transfer/clear without returning donor budget early. Local
owner-shell exit now calls the same UserSession.logout() path on clean REPL
exit or terminal-close completion; the shell proof observes the scheduler hook
with no bound local shell SchedulingContext, while the focused
session-context proof remains the ordinary bound-context stale evidence.
- Phase E preflight: retire the transitional
CAPOS_SCHED_DISABLE_WFQ=1/WakePolicy::QueueAnysingle-global-queue fallback that Phase D kept for one bisect cycle. This is a scheduler-surface cleanup beforeSchedulingContextclaims budget/period authority; do not treat it as an EEVDF blocker. Completed 2026-05-10 22:20 UTC: the source-level opt-out, queue-0 enqueue funnel, andQueueAnywake policy are gone. - Define the first
SchedulingContextobject shape. Phase E Task 1 adds the minimal schema/control-plane cap shape:SchedulingContextSpeccarries budget, period, relative deadline, byte-oriented CPU mask, and overrun policy;SchedulingContextInfois a read-only snapshot withremainingBudgetNsas derived info-only state; and the kernel/runtime expose an info-onlySchedulingContext.info()cap stub for focused grant/discovery and client decode coverage. ThecpuMaskfield is a canonical little-endian bitset: CPUnis bitn % 8of byten / 8, empty means no CPUs selected, producers omit trailing zero bytes, and non-empty canonical masks end in a nonzero byte. Dispatcher budget enforcement, replenishment, bind/revoke rules, donation/return, depletion notifications, realtime islands, SQPOLL, and nohz remain deferred. - Add capability creation/bind/revoke rules and generation identity. The
second Phase E control-plane slice keeps
info()method id 0 stable, adds same-interface context creation as a bounded result-cap transfer, records at most one caller-thread binding per context generation, and revokes by advancing the context generation and clearing the matching thread metadata binding. Bootstrap grants and created contexts use the same non-wrapping context-id allocator so distinct caps cannot alias the(contextId, generation)binding key. The focusedmake run-scheduling-contextQEMU smoke proves distinct bootstrap identities, create result-cap adoption, bind/revoke, stale-generation calls, release cleanup, and the explicitinfoOnlyNoDispatchChangedispatch-effect marker. Stale caps reportstaleGenerationand cannot mutate scheduler metadata; revoked contexts reportrevoked. Dispatch selection, WFQ ordering, runtime charging, replenishment, donation/return, timeout/depletion notification, realtime islands, SQPOLL, auto-nohz, and CPU placement enforcement remain future work. - Enforce budget and replenishment in the kernel dispatcher. First Phase E
budget enforcement landed 2026-05-11 08:38 UTC:
bindCallerThread()now installs a fixed per-thread budget ledger under the scheduler/process locking model, runtime charge decrements the bound context budget at the existing dispatch charge points, runnable selection replenishes elapsed periods without allocation, and exhausted contexts stay queued butRetryLateruntil their next period. Deadline-driven accounting closed the previous periodic-tick granularity caveat on 2026-06-04: the ordinary dispatch path arms a sub-tick budget-exhaustion one-shot when the selected thread’s remaining budget would deplete before the next scheduler tick, kernel-mode one-shot fires restore a live periodic timer, nohz re-arm folds the leased thread’s budget deadline into its existing nearest deadline, and nohz budget depletion restores the periodic tick withreason=scheduling-context-budget-throttled.make run-scheduling-contextproves visible charge, replenishment to full budget, stale/revoked fail-closed behavior, and a throttled wall-clock window withdispatch_effect=budgetEnforced; the representative 5 ms deadline marker recordedelapsed_since_arm_ns=5474819,overshoot_ns=474819,remaining_after_ns=0, andbounded_charge=true. At that slice’s landing, donation/return, depletion notifications, realtime islands, SQPOLL, auto-nohz, and CPU placement enforcement remained future work. - Add endpoint donation/return semantics for synchronous calls and passive
services. Completed 2026-05-11 10:51 UTC: endpoint in-flight call state
now carries a bounded internal donation token when a caller with a bound
SchedulingContextdelivers a synchronous CALL to a receiver thread without its own context. The scheduler charges pre-donation caller runtime before moving the ledger, charges passive-server runtime before returning the ledger, and returns the remaining budget to the caller before waking it when RETURN commits, commits an application exception, or fails with an invalid caller result buffer. RETURN preflight failures keep the in-flight donation intact; delivery/return cancellation paths return or clear the donation without allocating. A donor with an in-flight token is blocked from returning to userspace until the endpoint call returns or is canceled. Nested donation of an already donated context is rejected until stacked return tokens have a dedicated design. The focusedmake run-scheduling-contextsmoke now includes a same-process endpoint round trip withendpoint_donation=ok,endpoint_return=ok,endpoint_exception_return=ok,endpoint_invalid_return=ok, andendpoint_nested_rejected=ok, plus anendpoint_donor_block=okdelayed-servercap_enter(0, 0)proof, anendpoint_donor_fast=okfast-return race proof, and remaining-budget fields for successful RETURN, application-exception RETURN, invalid-result RETURN, nested-donation rejection, donor blocking, and fast donor return. This is synchronous endpoint donation/return only; depletion notifications, realtime islands, SQPOLL, auto-nohz, CPU placement enforcement, and session-logout stale-context coverage remain future work. - Add a scheduler-observable session lifecycle hook from
UserSession.logout()into scheduler-ownedSchedulingContextstale-marking. The hook covers explicit logout plus the remote DTO gateway logout/connection-teardown paths that already callUserSession.logout(): after the liveness cell flips to logged out, the scheduler scans process/thread metadata for the same session liveness cell, removes non-donated matching bindings from its ledger, and advances the bound context generation as revoked so ordinary old grants become stale. The hook preserves the scheduler as the binding authority and avoids scheduler-lock to context-record-lock inversion by taking one binding under the scheduler lock, dropping that lock, and then marking the context stale through its cleanup token. In-flight endpoint donation bindings are explicitly skipped because returning donor budget before endpoint cancellation would violate the donor-blocking invariant. This hook unblocks focused stale-context proofs: ordinary non-donated logout, donated-context policy, and local owner-shell propagation are now closed by their dedicated task records. - Add timeout/depletion notifications with preallocated emergency-path
storage. Completed in the timeout/depletion notification slice: every
SchedulingContextowns a fixed notification cell allocated at context creation/bootstrap, with coalescing slots for budget depletion and deadline/timeout, sequence counters, bounded coalesced-event counts, holder identity, donated-holder marking, remaining budget, and next timestamp snapshots. Scheduler charging, timeout/deadline observation, donation-return, and cancellation paths update only that fixed state; they do not allocate, publish result caps, append unbounded queues, or require hard-path logging.SchedulingContext.drainNotifications()exposes typedok,revoked, andstaleGenerationobserver results, plusexplicitRevokelifecycle state. The focusedmake run-scheduling-contextsmoke proves repeated budget-depletion coalescing, deadline notification, explicit revoke, stale observer labels, and endpoint-donated notification accounting. A pre-armed observer waiter/wakeup path remains a separate follow-up. - Extend stale-context proofs beyond the first revoke/generation contract to process and thread exit. The focused SchedulingContext smoke now proves that a context bound by an exiting thread becomes unbound without minting fresh budget on rebind, while process-exit and explicit process-termination children bind contexts and run the process cleanup path before cap-table release.
- Extend stale-context proofs to session logout. Completed for ordinary
non-donated contexts at 2026-05-11 17:44 UTC. This remains separate from
process/thread exit because logout propagation is owned by the session
lifecycle surface, not the scheduler dispatch loop. The focused
session-context smoke now binds a
SchedulingContextin a session-owned child, callsUserSession.logout(), observes the scheduler hook line, and proves the old cap is stale before budget refresh, caller-thread rebind, result-cap publication, or metadata mutation. Process/thread exit cleanup remains covered bymake run-scheduling-context. - Prove donated receiver logout policy. Completed at 2026-05-11 18:19 UTC.
Logout keeps the existing conservative counted/skipped behavior for
receiver threads holding endpoint-donated
SchedulingContextbindings. The focused session-context smoke has a donor call a guest-session receiver, the receiver logs out while holding the donated binding, the scheduler hook reportsstale_marked=0 donation_inflight_skipped=1, the donor remains blocked incap_enter(0, 0)until endpoint RETURN, and the donor context returns bound with reduced remaining budget rather than a refreshed or minted budget. Local owner-shell lifecycle propagation was closed separately byscheduler-phase-e-local-owner-shell-logout-propagation. - Propagate local owner-shell exit to session logout. Completed at
2026-05-11 19:36 UTC. Clean local REPL
exitand terminal-close completion now call the heldUserSession.logout()before process exit, so the session liveness cell is marked logged out through the same kernel hook used by explicit logout and the remote DTO gateway. The shell smoke asserts the scheduler-observable hook line withstale_marked=0 donation_inflight_skipped=0; ordinary boundSchedulingContextstale behavior remains proven by the focused session-context smoke through the same hook. Process/thread-exit cleanup remains separate and unchanged.
Phase F: CPU Isolation Lease and SQPOLL
The Phase E gates and the first Ring/SQPOLL ownership prerequisite are now
closed. Dispatch through
scheduler-phase-f-auto-nohz-sqpoll
only through its own Phase F authority, telemetry, rollback, and nohz/SQPOLL
tasks; this backlog entry does not implement Phase F behavior. The concrete
ring prerequisite is
scheduler-phase-f-one-sq-consumer-ring-ownership,
closed on 2026-05-11: ring endpoints now have generation-checked syscall-mode
SQ-consumer leases, duplicate future SQPOLL acquisition is rejected while that
owner is live, stale owner generations cannot advance SQ head, teardown
releases the owner without clearing accepted completions, and bounded SQPOLL
admission metadata exists without starting a poller.
The first executable Phase F child task,
scheduler-phase-f-cpu-isolation-lease-scaffold,
closed on 2026-05-12 12:02 UTC. It is limited to CpuIsolationLease authority,
activation preflight telemetry, and rollback scaffolding. It does not enable
SQPOLL, automatic nohz, tick suppression, automatic CPU isolation, or generic
full-nohz behavior. The second executable child task,
scheduler-phase-f-nohz-activation-telemetry,
closed on 2026-05-12 14:18 UTC. It turns the disabled preflight into observable
activation/deactivation and rollback decisions while still leaving tick
suppression, SQPOLL, automatic CPU isolation, and generic full-nohz disabled.
The housekeeping/deferred-work placement child closed on 2026-05-12 18:36 UTC
by
scheduler-phase-f-housekeeping-deferred-work-placement:
the scheduler now records an explicit online housekeeping CPU placement input,
selected housekeeping mask, deferred cleanup/timer/network/IRQ/accounting
placement or rejection labels, and bounded revoke, process-exit,
service-replacement, and session-logout cleanup placement while ticks remain
periodic.
The bounded SQPOLL ring-mode child closed on 2026-05-12 20:29 UTC by
scheduler-phase-f-sqpoll-ring-mode-bounded-poller:
ring endpoints now transition explicitly through syscall, SQPOLL starting,
running, sleeping, stopping, and rollback modes; a kernelSqpoll
CpuIsolationLease admits one bounded periodic-tick poller for the caller
thread’s ring; producer wakeups use NEED_WAKEUP; stale SQ owners fail before
SQ-head consumption; and poller stop/revoke preserves accepted CQEs while
releasing SQ ownership. Actual tick suppression is blocked until the
SQPOLL progress path no longer depends on periodic scheduler ticks. The
clockevent/deadline substrate child closed on 2026-05-12 23:07 UTC by
scheduler-phase-f-clockevent-deadline-substrate:
normal QEMU/x86_64 monotonic_ns() is backed by the calibrated TSC rather
than TICK_COUNT, the periodic LAPIC tick disciplines the TSC epoch while nohz
is disabled, Timer.sleep, finite cap_enter, and park waiters store
absolute monotonic deadlines, and the LAPIC clockevent backend can program a
bounded one-shot deadline and restore periodic mode. The substrate’s firing
precision is now proven, not only its programming: the
scheduler-lapic-oneshot-subtick-firing-precision child (closed
2026-06-04 03:26 UTC, commit 49b36129) arms a TICK_NS/2 one-shot over the live
periodic timer during boot and
measures the actual countdown-to-fire instant, asserting via
make run-scheduling-context that it fires sub-tick (~5 ms for a 5 ms request,
well under the 10 ms tick) with the current-count correctly reset to the
sub-tick value – ruling out the suspected “INITIAL_COUNT write does not reset
the running countdown” root cause – and that the kernel-mode-fire periodic
restore leaves a live timer (no lost-timer hang). Automatic nohz, tick
suppression, SQPOLL nohz, generic full-nohz, and production realtime admission
remain disabled. Known pre-existing gate flake (independent of the
firing-precision proof, which passed in 100% of measured boots): the
scheduling-context-smoke budget-timing proof exited early in ~20% of boots on
both main and this branch under host load – its wall-clock budget-throttle
assertions are sensitive to host scheduling jitter. Run make run-scheduling-context on an otherwise-idle host until the budget proof is
stabilized (own follow-up); it is orthogonal to the clockevent firing assertions.
A second substrate prerequisite surfaced 2026-06-04 from
scheduler-deadline-driven-budget-accounting’s Attempt 2: even with the LAPIC
one-shot firing precisely sub-tick, the monotonic clocksource discipline floored
a sub-tick interval to a full tick. A boot probe measured a real 5.0 ms interval
advancing monotonic_ns by 10.0 ms after one discipline_clocksource_tick step
(monotonic_delta_ns=10000020 for real_ns=5000118, floored=true), because
discipline_clocksource_tick took max(tsc_interpolated, epoch + TICK_NS) on
every fire. That was the real cause of that task’s Attempt 1 “9.85 ms” – not the
LAPIC firing (fixed) and not the ordinary-path timer-ISR rechecks (which provably
no-op when no nohz/idle window is active). The prerequisite
scheduler-monotonic-clocksource-subtick-discipline
closed it (2026-06-04): discipline_clocksource_tick now trusts the TSC
interpolation at sub-tick granularity, falling back to the TICK_NS floor only
when the interpolated advance is below MIN_DISCIPLINED_ADVANCE_NS (TICK_NS / 8)
so a degenerate (stalled/backward/mis-calibrated-slow) TSC still keeps a minimum
forward rate; the tick-derived fallback is unchanged. A boot proof
(context::qemu_clocksource_subtick_discipline_proof, emitted on
make run-scheduling-context) runs one real TICK_NS / 2 discipline step and
asserts monotonic_ns() tracked the sub-tick delta – measured
monotonic_delta_ns=5055612 for real_ns=5000474 (floored=false,
subtick_tracked=true). Deadline-driven budget accounting and generic full-nohz
can now observe a sub-tick deadline through the accounting clock.
The SQPOLL nohz-progress child closed on 2026-05-13 00:06 UTC by
scheduler-phase-f-sqpoll-nohz-progress:
cap_enter now has a bounded current-thread SQPOLL service entry for
producer wakes and syscall kicks that borrows the SQPOLL owner lease, charges
the admitted accounting target, and reports non-periodic progress evidence
while ordinary periodic service remains active. Automatic policy-service nohz
issuance and production realtime admission remain future work; generic SQPOLL
nohz for explicitly leased caller-thread rings landed in the later Step 14
slice.
The tickless-idle child closed on 2026-05-23 09:12 UTC by
scheduler-tickless-idle-step6:
the CPL0 idle loop now admits an idle-only tickless window when no non-idle
work is runnable, no nohz lease is active, no local deferred cleanup is
pending, no cap-enter polling dependency is present, and the LAPIC one-shot
clockevent plus monotonic clocksource are available. The periodic tick is
restored before non-idle dispatch and on rollback. Legacy cap-enter polling
surfaces, including the terminal shell path, remain periodic until they gain
explicit deadline or housekeeping placement.
- Define
CpuIsolationLeaseauthority separately from CPU-time budget. Completed 2026-05-12 12:02 UTC bydocs/tasks/done/2026/scheduler-phase-f-cpu-isolation-lease-scaffold.md. - Add scheduler activation proof for housekeeping, deferred cleanup, timers, networking, IRQ affinity, live accounting target, one-SQ-consumer state, and revocation latency. The scaffold reports blocked eligibility and leaves ticks/nohz/SQPOLL disabled.
- Enforce one live SQ consumer per ring before SQPOLL. Completed
2026-05-11 by
docs/tasks/done/2026/scheduler-phase-f-one-sq-consumer-ring-ownership.md. - Integrate SQPOLL ring mode only after this ownership prerequisite and
docs/tasks/done/2026/scheduler-phase-f-housekeeping-deferred-work-placement.mdhave landed. Completed 2026-05-12 20:29 UTC bydocs/tasks/done/2026/scheduler-phase-f-sqpoll-ring-mode-bounded-poller.md. - Add lease revocation on explicit revoke, process exit, service
replacement, and session close. Completed by the focused
make run-scheduler-cpu-isolation-leaseproof. - Add nohz activation/deactivation telemetry. Completed 2026-05-12 14:18 UTC by
docs/tasks/done/2026/scheduler-phase-f-nohz-activation-telemetry.md. The proof records active-candidate rejection, stale/revoked rollback, ready housekeeping CPUs under-smp 4, exactly-one-runnable target CPU evidence, deferred cleanup/timer/network/IRQ labels, valid accounting targets, explicit clocksource/accounting readiness or refusal, live syscall SQ-consumer state, revocation-latency policy, and disabled tick/SQPOLL/full-nohz guardrails. - Assign housekeeping and deferred-work placement before behavior.
Completed 2026-05-12 18:36 UTC by
docs/tasks/done/2026/scheduler-phase-f-housekeeping-deferred-work-placement.md. The proof keeps periodic ticks, SQPOLL, automatic CPU isolation, and generic full-nohz disabled. - Add bounded SQPOLL ring mode only after housekeeping/deferred-work
placement. Completed 2026-05-12 20:29 UTC by
docs/tasks/done/2026/scheduler-phase-f-sqpoll-ring-mode-bounded-poller.md. The proof covers one poller owner, bounded polling, stale queue-owner rejection, wake/sleep ordering, and teardown without losing completions while periodic ticks remain active. - Add clockevent/deadline substrate before automatic nohz activation.
Completed 2026-05-12 23:07 UTC by
docs/tasks/done/2026/scheduler-phase-f-clockevent-deadline-substrate.md. It split clocksource reads from clockevent programming, added a one-shot/restore timer backend, and converted tick-count waiters to absolute monotonic deadlines while ordinary scheduling remains periodic. - Add SQPOLL nohz progress that does not depend on periodic scheduler
ticks. Completed 2026-05-13 00:06 UTC by
docs/tasks/done/2026/scheduler-phase-f-sqpoll-nohz-progress.md. The proof preserves the one-SQ-consumer,NEED_WAKEUP, bounded polling, stale-owner rollback, and teardown/completion invariants while keeping periodic fallback service active. - Add automatic nohz activation only after placement, bounded SQPOLL
behavior, the deadline substrate, and non-periodic SQPOLL progress.
Completed 2026-05-14 09:01 UTC by
docs/tasks/done/2026/scheduler-phase-f-auto-nohz-activation.md. TheCpuIsolationLeaseactivation preflight now performs real per-CPU periodic-tick suppression for the narrow single-runnable-entity window (namedRing = nonecompute lease on the preflight CPU): it masks the periodic LAPIC tick and arms a bounded one-shot deadline atmin(nearest pending timer wakeup, now + max revocation latency). Network polling and IRQ affinity stay read-only fail-closed admission gates – any ring-coupled or device-owning mode keeps the conservative refusal. Every disqualifying change (stale lease generation, a second runnable entity, stealable sibling work, a local deferred-cleanup dependency, a target-CPU mismatch, or a one-shot backend that can no longer arm a deadline) rolls the CPU back to the periodic tick first. Themake run-scheduler-cpu-isolation-leaseproof asserts the activation and rollback log lines. Generic full-nohz and the broader SQPOLL-driven nohz state machine landed in later slices. - Measured suppressed-tick proof on the lease path (harness-hardening).
Completed 2026-06-02 19:53 UTC by
docs/tasks/done/2026-06-02/scheduler-cpu-isolation-measured-suppressed-tick-proof.md. Closes the review-identified honesty gap that the lease path proved suppression only by thetick_suppression=active periodic_tick=maskedmarker plus a no-hang progress loop, never that periodic timer interrupts actually stopped arriving. The kernel now counts genuine periodic LAPIC fires per CPU (account_timer_firein the timer ISR increments only when neither the lease-backed nor idle tick-suppression bit is set, so the one-shot replacement is never miscounted), snapshots the count at activation, and on rollback emitscpu-isolation: nohz suppressed-ticks cpu=<n> window_ns=<w> expected_periodic=<e> actual_periodic=<a> suppressed=<e-a>; a bounded post-rollbackcpu-isolation: nohz restored-rateline proves the periodic rate returns. The demo holds a childless compute lease on CPU 0 across a ~150 ms masked window, then a busy restore window; the harness asserts a masked window withactual_periodicnear zero (expected_periodic >= 10,suppressed >= 8) and a restored window withactual_periodictrackingexpected_periodic(>= 8). No activation behavior changed; the mask/one-shot mechanism is untouched. A durableticks_suppressed{cpu,mode}telemetry field on a monitoring/status surface remains future work. - Timeout-based auto-revoke primitive on
CpuIsolationLease. Landed viadocs/tasks/done/2026-05-30/scheduler-cpu-isolation-lease-timeout-auto-revoke.md. AddsleaseLifetimeNs @6toCpuIsolationLeaseSpec(0= no expiry, preserving every existing producer);read_specclamps to a one-hour ceiling and rejects a non-zero lifetime belowmaxRevocationLatencyNs(invalidSpec). A lease recordsexpires_at_nsat creation; the first observation past the deadline auto-revokes through the existing generation-advancing cleanup (reason=lease-expired, registry unregister, SQPOLL stop,rollback_nohz_for_lease) and every subsequentinfo/activationPreflight/revokereportsstaleGeneration. The nohz activation record carries the lifetime deadline so a tickless CPU under a lease that crosses its lifetime rolls back at the next timer/IPI recheck (lease-lifetime-expireddisqualifier), bounded bymaxRevocationLatencyNs.make run-scheduler-cpu-isolation-leaseasserts the expiry release line, the post-expirystaleGeneration, and theinvalidSpecrejection. - Enable tickless idle only when there is no runnable non-idle work and no
cap-enter polling dependency. Completed 2026-05-23 09:12 UTC by
docs/tasks/done/2026/scheduler-tickless-idle-step6.md. The idle path masks the periodic LAPIC tick only for true idle, arms a bounded one-shot at the nearestTimer/ParkSpacedeadline or 100 ms housekeeping floor, and restores periodic mode before ordinary work. Ready-but-budget-throttledSchedulingContextretry windows remain periodic so budget replenishment and deadline notification timing stay on the existing scheduler accounting path. - Keep automatic full-nohz behind the completed one-SQ-consumer ownership
prerequisite and the narrower
CpuIsolationLeasetelemetry/rollback proof. Generic full-nohz is not the first Phase F implementation task.
Phase F.5: Full-SMP Hardware Scalability
This phase is the planning slot for the next visible SMP milestone when the project is ready to answer whether capOS uses 16/32-core machines well. It does not replace the current Installable System selected milestone and should not be dispatched as a QEMU-only benchmark cleanup. QEMU remains regression infrastructure; the primary performance record should come from direct capOS execution on a dedicated high-core perf runner or bare-metal/cloud-bare-metal machine.
- Replace temporary four-owner scheduler assumptions with dynamic CPU topology: discovered scheduler CPU set, physical-core versus SMT sibling labeling, APIC id mapping, per-CPU allocation sizing, and boot/status output that makes the selected CPU set auditable.
- Add or select the APIC backend needed for high-core machines. xAPIC MMIO
can remain the current low-core path, but x2APIC selection is the likely
larger-APIC-id follow-up from
docs/research/x2apic-and-virtualization.md. - Shrink scheduler shared-state serialization. Local pick/requeue should avoid one global scheduler-lock critical section where possible, while shared process/thread metadata, blocking waiters, direct IPC handoff, timers/deadlines, and cleanup keep explicit ownership and rollback rules.
- Add topology-aware placement and observable migration policy. The record should distinguish local enqueue, cross-core wake, steal, SMT sibling placement, failed placement, reschedule IPI, and TLB-shootdown costs.
- Build the hardware benchmark profile from existing benchmark proposals: static map/reduce, uneven dynamic task pool, barrier phase loop, independent processes, same-process threads, and one capability-call/service-bound workload. Each workload reports work-window and total-time rows at 1/2/4/8/16/32 workers when hardware exists.
- Record matching native Linux rows on the same machine, plus capOS raw artifacts with source commit, toolchain, topology, frequency/isolation policy, run count, warmup policy, verifier output, medians, variance, speedup, efficiency, and scheduler counters.
Phase G: Realtime Islands
- Define
RealtimeIslandadmission inputs: scheduling contexts, memory reservations, device/IRQ reservations, communication paths, CPU leases, and overrun policy. - Add a small local-audio or synthetic periodic-control proof before robotics or provider workloads.
- Prove no allocation, blocking endpoint call, paging, or logging on the admitted realtime path.
- Record deadline misses and overrun handling as observable output.
Phase H: Policy Service
- Define a privileged scheduler policy service interface for admission, budget/profile updates, CPU lease grant/revoke, and diagnostics.
- Keep kernel fallback scheduling independent of policy-service liveness.
- Add manifest/config hooks for default profiles without making policy changes require kernel rebuilds.
- Add operator diagnostics that explain why a thread or island was denied, throttled, migrated, or revoked.
- Define how stateful task/job graph assignment metadata maps into
scheduler policy inputs: graph priority to weight/latency class, graph
deadline to request freshness or admission input, graph budget to
SchedulingContextreference, and graph queue to policy-service placement. The graph coordinator must not mint CPU authority by itself. - Design the user-space policy-service AutoNoHz placement heuristic for
ordinary threads that appear capable of utilizing a full CPU core. The
policy service synthesizes the “thread appears capable of utilizing a
full CPU core” decision from a future monitoring/status surface and
issues a bounded
CpuIsolationLeaseagainst a pre-authorized account or session CPU pool. The lease is placement only; it does not mint CPU-time authority. Required bounds on every auto-issued lease: lifetime shorter than admin-issued leases by default and renewable only by re-observing the signal;max_revocation_latency_nsbounded byNoHzEligibility; accounting target a liveSchedulingContextor coarseResourceLedger; CPU set restricted to the operator-declared auto-claim pool; priority-aware fairness preemption that terminates the lease (not just rolls back tick suppression) on arrival of an equal-or-higher priority runnable entity. Prerequisites: (a) a timeout-based auto-revoke primitive onCpuIsolationLease– LANDED 2026-05-30 asleaseLifetimeNs @6(0= no expiry) with enforced first-observation auto-revoke and alease-lifetime-expirednohz rollback; the auto-claim placement lease can now be granted with a bounded lifetime. The boundedrenewhalf LANDED asCpuIsolationLease.renew @4, which pushes the deadline forward by at most the original lifetime while keeping the lease’s identity / accounting / nohz state, leaving only the renewal-by-re-observation heuristic (when to callrenew) to Phase H; (b) the monitoring/status surface that exports per-thread saturation observation – LANDED 2026-05-30 as the non-measureper-thread saturation status surface.voluntary_blocksandpreemptionswere promoted out ofcfg(feature = "measure"), an always-builtrunnable_accumulated_nsrunnable-but-not-running accumulator was added (stamped at the run-queue enqueue chokepoint, accumulated at selection), and all three plusruntime_nsare exported throughSchedulingPolicyCap.snapshot @2(proofmake run-thread-fairness: hogvoluntary_blocks=0with livepreemptions/runnable_ns).migrationsstaysmeasure-gated. This read-side surface exports raw cumulative counters only; windowing and the saturation decision remain policy-service work; (c) the pool-grant authority shape that lets an operator pre-authorize an account’s auto-claim pool. Declared-pool descriptor LANDED 2026-05-30: theCpuIsolationLeaseSpeccarriespoolId @7(0= the implicit default pool over every scheduler CPU), the kernel seeds a fixed declared-pool registry (CpuIsolationPoolDescriptor: default pool0plus one declared non-default pool1over a single CPU), andread_specadmits a lease only when itspoolIdis declared and itsallowedCpuMaskis a subset of the pool’s CPU mask – echoing the admitting pool’s id/mask throughCpuIsolationLeaseInfo(proofmake run-scheduler-cpu-isolation-lease:nondefault_pool=invalidSpec(undeclared id),declared_pool=ok admitted_pool_id=1 admitted_pool_cpu_mask_subset=true,declared_pool_mask_violation=invalidSpec,default_pool_id=0). Manifest-sourced pool table LANDED 2026-05-30: the declared-pool registry is sourced from the boot manifestSystemConfig.cpuIsolationPools @14(each entry aCpuIsolationPoolDescriptor), with the in-kernel constant as the fail-closed default when the manifest omits/empties the list; the kernel validates each entry fail-closed at boot (canonical CPU mask subset of the scheduler mask, default pool0synthesized if omitted, duplicate ids rejected) and emitscpu-isolation: declared-pools source=manifest count=3 ...(proofmake run-scheduler-cpu-isolation-lease; kernel-default fallback proven bycargo test-configdecode/empty assertions). Per-pool live-lease capacity bound LANDED 2026-05-31:CpuIsolationPoolDescriptorcarriespoolMaxLeases @2(0= unbounded); a non-zero value caps the number of simultaneously live (non-revoked, current-generation) leases the kernel admits against that pool at create-time, counted from the existingLEASE_REGISTRYafterprune_dead, rejecting an over-capacity create fail-closedresourceExhausted. The manifest bounds pool2atpoolMaxLeases: 2; the proof admits two live leases, refuses a third (cpu-isolation: pool-capacity-rejected admitted_pool_id=2 live_leases=2 pool_max_leases=2 result=resourceExhausted,pool_capacity_exceeded=resourceExhausted), and reclaims after a revoke (pool_capacity_reclaimed=ok) – live-count, not cumulative. This is the count+reject mechanism the per-accountNpolicy keys onto. Account identity + per-accountNLANDED 2026-05-31:CpuIsolationLeaseSpeccarriesaccountId @8 :UInt64(0= unattributed, caller-asserted and inert until counted, echoed read-only throughCpuIsolationLeaseInfo.accountId @6) andCpuIsolationPoolDescriptorcarriespoolMaxLeasesPerAccount @3 :UInt32(0= unbounded per account). After the pool-wide check,registercounts the requesting account’s live entries (admitted_pool_idANDaccount_idboth matching) against the per-account bound and rejects an over-bound create fail-closedresourceExhausted(0account or0bound skips the gate). The manifest bounds pool2atpoolMaxLeasesPerAccount: 1; the proof admits one account-7 lease, refuses a second account-7 create (cpu-isolation: account-capacity-rejected admitted_pool_id=2 account_id=7 account_live_leases=1 pool_max_leases_per_account=1 result=resourceExhausted,account_capacity_exceeded=resourceExhausted), admits a different account-9 lease on that CPU (account_capacity_other_account=ok– per-account, not pool-wide), and reclaims after revoking account-7 (account_capacity_reclaimed=ok). The account id is caller-asserted, not yet authenticated. Bootstrap pool-grant authentication LANDED 2026-05-31:CpuIsolationPoolGrant(schema/capos.capnp, sourcecpu_isolation_pool_grant, kernelkernel/src/cap/cpu_isolation_pool_grant.rs) introduced a bootstrap-staged grant binding one authenticated account to one declared pool.createLeasestamps the bound account/pool onto the minted lease, overriding any caller-assertedaccountId/poolId, and reuses the exact lease-create admission path (cpu_isolation::create_lease_for_caller), so the per-account bound is unforgeable: a holder can no longer assert another account to evadepoolMaxLeasesPerAccount. The initial proof used one account-7/pool-2 grant; the current manifest-sourced proof below exercises multiple seeded grants. Manifest-declared multi-account grant table LANDED 2026-06-01: the grant binding is now operator-declared viaSystemConfig.cpuIsolationPoolGrants(schema/capos.capnp, decoded incapos-config, seeded at boot bycpu_isolation_pool_grant::seed_pool_grantsafterseed_declared_pools), mirroring the manifest-sourcedcpuIsolationPoolstable; thecpu_isolation_pool_grant/cpu_isolation_pool_grant_secondarysources stage seeded binding index0/1, so a manifest can pre-authorize multiple distinct(account, pool)grants, each staged as its own bootstrap cap. An absent/empty list falls back to one in-kernel binding at index0: account7bound to preferred pool1when active, otherwise account7bound to synthesized default pool0, so manifest-sourced pool tables that omit pool1still stage a usable default grant. Proofmake run-scheduler-cpu-isolation-pool-grantnow boots a two-entry grant table (account5/pool1, account8/pool2), holds both grant caps, and proves each stamps its OWN bound account (pool-grant: create ok bound=A stamped_account_id=5 .../bound=B stamped_account_id=8 ...) with the per-account bound still enforced fail-closed under the manifest-sourced path; boot evidencecpu-isolation: pool-grants source=manifest count=2. Fallback proofmake run-scheduler-cpu-isolation-pool-grant-defaultboots a manifest-sourced pool table that declares pool2and omits pool1plus an empty grant list; the kernel stages one default grant as(account 7, pool 0)and the smoke proves it can mint a stamped lease. Runtime grant minting landed (CpuIsolationGrantMinter): one cap mints a freshCpuIsolationPoolGrantfor an operator-chosen(account, pool)at call time, bounded by the declaredSystemConfig.cpuIsolationGrantMinterAllowlist(an out-of-allowlist mint is refusedunauthorized, so it is never an ambient grant-any authority; the minted grant reuses the same unforgeablecreateLeaseadmission path). The samerun-scheduler-cpu-isolation-pool-grantsmoke now also mints a grant for the allowed(account 6, pool 2), proves itscreateLeasestamps account6and stays bounded by the per-account gate, and proves an out-of-allowlist(account 99, pool 2)mint is refused; boot evidencecpu-isolation: grant-minter-allowlist source=manifest count=1. Grant-revocation lifecycle landed (CpuIsolationGrantMinter.revokeGrant): a runtime-minted grant gets a revocable(grantId, generation)identity;revokeGrant(grantId)advances the grant generation so a stale grant handle’screateLeasefailsstaleGeneration, and cascades to every live lease minted through it – reusing the landed fairness-termination cleanup (reason=grant-revoked, periodic-tick rollback, registry unregister) so the per-pool/per-account live-lease capacity frees immediately and a fresh grant is admitted into the reclaimed slot. Double-revoke isalreadyRevokedand an unknowngrantIdisunknownGrant, both fail-closed. The samerun-scheduler-cpu-isolation-pool-grantsmoke proves the full lifecycle. This closes Track C (prerequisite (c)) – operator grant authority is now mint + revoke complete. Detailed design indocs/proposals/tickless-realtime-scheduling-proposal.md“Policy-Service Userstories: AutoNoHz Placement for Compute-Capable Threads”.
AutoNoHz Decomposition: Roadmap to Full Auto-NoHz
The status bullet above narrates what landed. This subsection is the discrete dispatchable decomposition from the current landed state to full operator-driven auto-nohz, so the path is written as concrete slices rather than “future work” prose. Grounding: the proposal’s “Policy-Service Userstories: AutoNoHz Placement”, “Bounds the policy service must enforce”, “Telemetry Requirements”, and Implementation Sequence steps 7/14/17.
Landed substrate (not repeated below): the narrow manual per-CPU LAPIC
tick-mask for the single-runnable compute window and the SQPOLL-coupled
window, tickless idle, prerequisite (a) leaseLifetimeNs @6 timeout
auto-revoke, prerequisite (b) the SchedulingPolicyCap.snapshot @2
saturation observation surface, and prerequisite (c) pool-grant authority now
mint + revoke complete (the manifest-declared multi-account
cpuIsolationPoolGrants @15 table, runtime grant minting through
CpuIsolationGrantMinter, and the grant-revocation lifecycle that cascades to
minted leases). Fairness lease termination (Track D) and a measured
suppressed-tick proof have also landed, as have network-poll and IRQ-affinity
housekeeping routing, kernel-side generic full-nohz admission for ordinary
budgeted compute threads, and generic SQPOLL nohz admission for explicitly
leased caller-thread rings. What the name “auto nohz” still oversells today:
there is no production policy service, and broader userspace-poller/device-queue
issuance remains future work. Each remaining slice below closes one of those.
Conflict-domain note: every kernel slice here shares
resource:scheduler-cpu-isolation and writes kernel/src/cap/cpu_isolation*
or kernel/src/sched.rs, so they serialize against each other – dispatch
the chain head first; the rest convert from this list into
docs/tasks/ records as their depends_on closes. Slices marked
ready have a task record under docs/tasks/; the rest stay here
until their prerequisite lands.
Next increment (decomposed 2026-06-04 00:18 UTC; updated 2026-06-07 after
generic SQPOLL nohz landed): Track C, Track D, and the measured suppressed-tick
proof are all landed, and the ordinary-thread and SQPOLL-ring kernel admission
leaves are now done.
Records under docs/tasks/ capture:
scheduler-cpu-isolation-lease-renewal-on-reobservation (renewal residual),
scheduler-nohz-irq-affinity-housekeeping-routing,
scheduler-nohz-network-poll-housekeeping-routing,
scheduler-deadline-driven-budget-accounting, and
scheduler-generic-full-nohz-arbitrary-threads as done. The remaining
operator-driven AutoNoHz capstone is the policy service.
These scheduler CPU-isolation slices serialize against each other on
resource:scheduler-cpu-isolation but are parallel-safe against the in-flight
Phase C network-stack lane, so the scheduler lane stays runnable whenever Phase
C 7c holds the kernel cap/ surface.
Track C – complete operator grant authority (prerequisite (c) residual):
-
scheduler-cpu-isolation-runtime-grant-minting– behavior, normal, LANDED 2026-06-02 22:24 UTC. One cap (CpuIsolationGrantMinter) mints a freshCpuIsolationPoolGrantfor an operator-chosen(account, pool)at call time, bounded by the declaredSystemConfig.cpuIsolationGrantMinterAllowlist(an out-of-allowlist pair is refusedunauthorized), instead of only the boot-seeded table. The minted grant reuses the same unforgeablecreateLeaseadmission path. Proofmake run-scheduler-cpu-isolation-pool-grant. depends_on: manifest-multi-account grant table (landed). -
scheduler-cpu-isolation-grant-revocation-lifecycle– behavior, normal, LANDED 2026-06-03 17:11 UTC.CpuIsolationGrantMinter.revokeGrantrevokes a runtime-minted grant by advancing its(grantId, generation)so latercreateLeasethrough the stale handle failsstaleGenerationand mints nothing; revocation cascades to every live lease minted through that grant, driving the landed fairness-termination cleanup (reason=grant-revoked, periodic-tick rollback, registry unregister) once per tagged lease so per-pool/per-account capacity frees immediately (a fresh grant’s lease is admitted into the reclaimed slot in the proof). Double-revoke isalreadyRevoked, unknowngrantIdisunknownGrant, seeded grants stay un-revocable. Closes Track C. Proofmake run-scheduler-cpu-isolation-pool-grant. depends_on:scheduler-cpu-isolation-runtime-grant-minting(landed),scheduler-cpu-isolation-priority-aware-lease-termination(landed).
Track D – fairness preemption (proposal fairness_preemption):
-
scheduler-cpu-isolation-priority-aware-lease-termination– behavior, normal, LANDED 2026-06-02 21:17 UTC. On arrival of an equal-or-higher policy-priority runnable on the leased CPU when no other CPU authorized by both the admitted pool and the leaseallowedCpuMaskis eligible, the kernel now terminates (revokes) the lease itself at the existing nohz rollback site (fairness-preempted ... result=lease-terminated), not just restores the periodic tick, bounded bymaxRevocationLatencyNs. The recheck compares the static WFQ policy priority (latency_class,weight) of the arriving entity against the captured leased thread; a strictly-lower arrival or an eligible sibling CPU inside both masks keeps the existing tick-restore-only behavior. The termination runs the same generation-advancing cleanupleaseLifetimeNsexpiry uses (reason=fairness-preempted) immediately after the scheduler restores the periodic tick, so a subsequentinfo/revokereportsstaleGenerationand placement/account capacity is freed without waiting for the holder’s next cap call. Proven inmake run-scheduler-cpu-isolation-lease(default pool0withallowedCpuMask=0x01: an equal-priority sibling terminates and capacity is reclaimed, a strictly-lower sibling restores only). Out: no re-placement onto an eligible sibling CPU (the “no sibling eligible” condition is recorded; actual migration is generic-full-nohz work). depends_on: auto-nohz-activation (landed).
Lease lifetime renewal (proposal lifetime_ns renewal residual):
-
scheduler-cpu-isolation-lease-renewal-on-reobservation– behavior, normal, landed.CpuIsolationLease.renew @4pushesexpires_at_nsforward tonow + leaseLifetimeNs(clamped to the same one-hour ceilingread_specenforces), keeping the same(leaseId, generation), accounting binding, and nohz activation state. Callable only before expiry: a revoked, auto-revoked, or past-deadline lease stays stale (staleGeneration) and is not resurrected, and an unboundedleaseLifetimeNs = 0(or factory) lease reportsnotRenewable. The renewed deadline is propagated to a tickless CPU’s nohz activation record (renew_nohz_lifetime_deadline_for_lease) so thelease-lifetime-expireddisqualifier no longer rolls it back at the old deadline.CpuIsolationLeaseInfo.expiresAtNsechoes the deadline read-only. The kernel primitive the policy service uses to renew an auto-issued lease by re-observing the saturation signal; the re-observation heuristic itself stays Phase H policy-service work. Proofmake run-scheduler-cpu-isolation-lease. depends_on: timeout-auto-revoke (landed).
Honesty / telemetry (proposal Telemetry ticks_suppressed{cpu,mode}):
-
scheduler-cpu-isolation-measured-suppressed-tick-proof– harness-hardening, normal, LANDED 2026-06-02 19:53 UTC (docs/tasks/done/2026-06-02/scheduler-cpu-isolation-measured-suppressed-tick-proof.md). A kernel expected-vs-actual periodic-tick counter (account_timer_fire, counted only when no tick-suppression bit is set) over a bounded nohz window is asserted inmake run-scheduler-cpu-isolation-lease(cpu-isolation: nohz suppressed-ticks ...plus arestored-rateline), so the proof shows the periodic tick actually stopped firing, not only that the mask write was issued and the CPU made progress. Closed the review-identified honesty gap. A durableticks_suppressed{cpu,mode}telemetry field on a monitoring/status surface remains future work. depends_on: auto-nohz-activation (landed).
Step 7 – network poll housekeeping/deadline routing:
-
scheduler-nohz-network-poll-housekeeping-routing– behavior, normal, landed 2026-06-04 04:48 UTC. The in-kernel virtio-net poll (virtio::poll_scheduler) now routes off a lease-isolated (tickless) CPU: it consultssched::current_cpu_lease_nohz_active()and skips, emitting a boundedcpu-isolation: network-poll routed ... result=skipped-on-isolated-cpurecord, while the always-ticking housekeeping CPU the admission requires keeps the poll progressing. Thenetwork_pollingadmission gate flips from the hardrejected-periodic-network-polling-not-routed-to-housekeepingrefusal to a housekeeping-conditionedrouted-periodic-network-polling-to-housekeeping-cpuadmit (eligibility accepts therouted-prefix), and fails closed (rejected-network-polling-no-housekeeping-cpu-to-relocate) when no housekeeping CPU exists. The admittednamed_ring=Nonelease carries the routed label tick-suppressed; theCallerThreadcompute-with-ring lease’s network refusal is removed but it staysForcedPeriodicbecause IRQ affinity routing is the separate slice below. Proofmake run-scheduler-cpu-isolation-lease; regressionmake run-net. depends_on: housekeeping-deferred-work-placement (landed), auto-nohz-activation (landed). -
scheduler-nohz-irq-affinity-housekeeping-routing– behavior, normal, landed (docs/tasks/done/2026-06-04/). The activation path reroutes an opting-in leased CPU’s legacy IO-APIC redirection-entry destinations onto the selected housekeeping CPU (mask-before-reprogram + read-back, restored on rollback/revoke) before admitting tick suppression, and keeps the conservativerejected-irq-affinity-not-routed-to-housekeepingrefusal for a ring-coupled IRQ dependency that cannot be safely rerouted. Proofmake run-scheduler-cpu-isolation-lease(irq-affinity ok ... routed_admitted=true restored_on_revoke=true residual_forced_periodic=true); DDFrun-interrupt-grant/run-devicemmio-grantstay green. Scoped to a quiescent housekeeping destination: under the in-kernel KVM irqchip, reprogramming an IO-APIC redirection-entry destination onto an actively-scheduling CPU stalls that CPU’s forward progress, so the live reroute is gated to a focused proof lease (reroute sentinelmaxRevocationLatencyNs) whose destination is idle. A general busy-destination reroute remains future work behind a destination-quiescence gate or a non-KVM-irqchip delivery backend. depends_on: auto-nohz-activation (landed).
Step 14 – generic SQPOLL nohz for arbitrary rings:
-
scheduler-generic-sqpoll-nohz-arbitrary-rings– behavior, normal, done 2026-06-07. The SQPOLL nohz state machine now admits explicitly leased caller-thread rings when the SQPOLL worker is live, the ring is running/sleeping with a non-stale owner, exactly one SQ consumer is present, and producer wake/deadline rollback are bounded. The focusedmake run-scheduler-generic-sqpoll-nohzproof drives eligible entry, producer wake, SQPOLL service, rollback, and stale-owner rejection. BroaderAutoUserspacePolleruserspace-poller/device-queue issuance remains future policy-service work. depends_on: auto-nohz-sqpoll (landed),scheduler-nohz-network-poll-housekeeping-routing.
Generic full-nohz for arbitrary threads (the kernel half of “auto”):
-
scheduler-generic-full-nohz-arbitrary-threads– behavior, normal, done 2026-06-06. Ordinary budgeted compute threads can now enter full-nohz through an explicitSchedulingContext-targetedCpuIsolationLeasewhen the single-runnable, budget-deadline, housekeeping, network-poll, IRQ-affinity, timer, lifetime, and rollback gates all pass. Missing thread budget, multiple runnable work, revoked or expired leases, unrouted dependencies, and no-housekeeping cases still fail closed. Issuance is still policy-service future work; this is only the kernel admission half. depends_on:scheduler-cpu-isolation-priority-aware-lease-termination,scheduler-nohz-network-poll-housekeeping-routing,scheduler-nohz-irq-affinity-housekeeping-routing.
Step 17 – user-space AutoNoHz policy service (capstone):
-
scheduler-autonohz-policy-service-saturation-local-proof– behavior, normal, done 2026-06-07. A userspace AutoNoHz policy-service smoke now holds an operator-declaredCpuIsolationPoolGrant, consumesSchedulingPolicyCap.snapshot @2runtime / runnable / voluntary-block / preemption counters, denies a voluntarily blocking worker, issues a bounded full-nohz lease only after a local saturation window, renews only after re-observing saturation, and proves stopped-renewal expiry leaves fallback periodic scheduling intact. The proof records the grant-stamped account/pool and the single allowed CPU mask that the kernel admitted. depends_on:scheduler-cpu-isolation-runtime-grant-minting,scheduler-cpu-isolation-lease-renewal-on-reobservation,scheduler-cpu-isolation-priority-aware-lease-termination. -
scheduler-autonohz-production-policy-daemon– behavior, normal, blocked. Replace the local smoke’s fixed single-process proof with a privileged reusable policy daemon: profile-driven smoothing/window selection, cross-process target discovery, operator policy plumbing, structured observability, and revocation/non-renewal decisions for multiple accounts and pools. The landed local proof keeps this future work replaceable without ABI churn. depends_on:scheduler-autonohz-policy-service-saturation-local-proof.
Independent hardening (makes auto-nohz budget-safe):
-
scheduler-deadline-driven-budget-accounting– behavior, normal, done 2026-06-04. ChargeSchedulingContextbudget at monotonic-deadline granularity rather than per-periodic-tick so an auto-nohz thread cannot overshoot its budget by a full tick quantum while the tick is masked. Closes the “enforcement remains periodic-tick granularity” caveat that auto-nohz made load-bearing; the task ledger isdocs/tasks/done/2026-06-04/scheduler-deadline-driven-budget-accounting.md. depends_on: Phase E budget enforcement (landed),scheduler-lapic-oneshot-subtick-firing-precision(done),scheduler-monotonic-clocksource-subtick-discipline(done).
Cleanup: Retire Benchmark-Driven Scaffolding Before Phase E
This section captures simplification work identified during the post-thread-scale
SMP/threading architecture review on 2026-05-01 23:20 EEST. None of these items
are regressions: the affected code is correct, gated behind the measure
feature where it should be, and was added intentionally during attribution and
placement slices that closed the In-Process Threading Scalability milestone.
They are recorded here so the next selected scheduler milestone does not extend
or formalize speculative SMP scaffolding that the current per-CPU WFQ scheduler
does not need.
The cleanup is subordinate to the current selected milestone and to
already-open review-finding task records. Pick it up as Phase E preflight work
before SchedulingContext claims the scheduler surface. Each removal must
preserve the documented runnable-ownership invariants from
docs/architecture/scheduling.md (single dispatch owner per live ThreadRef
across per-CPU current/handoff_current slots, the per-CPU WFQ run queues,
and the direct IPC target; scheduler-lock-contained migration; allocation-free
timer/unblock/direct-IPC-fallback/requeue/steal-requeue paths) and the recorded
benchmark-only counter policy. The 2026-05-02 per-CPU run-queue collapse and
the accepted 2026-05-10 Phase D WFQ reintroduction are now both historical
evidence: the single-global-queue shape had accepted 1-to-2 evidence but a
1-to-4 diagnostic gap (capOS 1.566x/1.538x vs Linux 3.963x/3.858x),
and Phase D manually accepted the 2026-05-10 per-CPU WFQ 1-to-4 diagnostic
(capOS 3.088x/2.700x; matching Linux 3.974x/3.850x on the same pin
set) after the harness-enforced 1-to-2 gates stayed green.
Grounding read before any slice:
docs/architecture/scheduling.mddocs/proposals/scheduler-evolution-proposal.mddocs/proposals/smp-proposal.mddocs/backlog/smp-phase-c.mdkernel/src/sched.rskernel/src/process.rskernel/src/measure.rskernel/src/arch/x86_64/{smp.rs,lapic.rs,percpu.rs,tlb.rs}
Acceptance rule for every slice below: each removal must land with a host or QEMU test that fails without it, so a future reintroduction is explicit authority work rather than silent regression of an undocumented feature.
-
2026-05-02 08:07 UTC: Retired the timer continuation fast path, its per-CPU skip budget, and the slow-path-required mirror flags. Deleted
try_continue_current_on_timer_tick,mark_timer_slow_path_required,reset_current_cpu_timer_fast_path_skip_count,note_timer_slow_path_completed_locked(both feature variants),scheduler_has_hard_timer_slow_path_work_locked_excluding_endpoint_queue,scheduler_timer_slow_path_reasons_locked, theTimerBlockedWaiterKind/blocked_thread_*helpers, and the four atomic mirrorsTIMER_SLOW_PATH_REQUIRED,TIMER_FAST_PATH_SKIP_COUNTS,CURRENT_NON_IDLE_CPUS, andTIMER_FAST_PATH_MAX_CONSECUTIVE_SKIPS.set_current_thread_lockedno longer publishesCURRENT_NON_IDLE_CPUS. The timer interrupt entry inkernel/src/arch/x86_64/context.rsnow always callscrate::sched::schedule(context)instead of trying the lock-free fast path. Eightmark_timer_slow_path_required()call sites inkernel/src/sched.rs(run-queue publish, pending process drop, park-with-deadline, process termination queue, direct-IPC handoff, timer sleep enqueue, cap-enter-with-deadline, pending thread stack release, pending endpoint cancellation push) also dropped — they are no-ops once the fast path no longer exists. Verified thatmake run-spawnexits cleanly ([init] Spawn cap-table exhaustion check ok.,proc: process 2 exited with code 0,sched: last process exited, halting) andmake run-smokeruns the scripted login flow to operator session.cargo build --features qemuis warning-free (project rule). Reintroduce the fast path only if a future Phase D or Phase F slice ships an evidence pair where it measurably reduces scheduler-lock hold time on a contended SMP run.Follow-up partial 2026-05-02 08:39 UTC: `kernel/src/measure.rs` lost the eight public API entry points (`timer_fast_path_attempt`, `timer_fast_path_continue`, `timer_fast_path_slow_required_fallback`, `timer_fast_path_skip_budget_fallback`, `timer_fast_path_pending_reschedule_fallback`, `timer_fast_path_no_current_non_idle_fallback`, `timer_fast_path_inactive_invalid_cpu_fallback`, and `timer_slow_summary`) plus the now-orphaned `TimerSlowSummaryReasons` struct and its `requires_slow_path` impl. `cargo build --features qemu,measure` is back to warning-free. Follow-up complete 2026-05-02 21:00 UTC: the deeper deletion slice removed the seven `TIMER_FAST_PATH_*` static counters, the `TimerCounter::FastPath*` enum variants, the `TimerSlowSummaryCounter` enum, the `TIMER_SLOW_SUMMARY_*` counter arrays (`TIMER_SLOW_SUMMARY_COUNTER_VALUES`, `CASE_START_TIMER_SLOW_SUMMARY_COUNTERS`, `PREVIOUS_TIMER_SLOW_SUMMARY_COUNTERS`, `PHASE_TIMER_SLOW_SUMMARY_COUNTERS`), the `(TimerSlowSummaryCounter, &str)` reporting table, the `Snapshot.timer_slow_summary_counters` field, and the matching reset/diff/print helpers and accessors. `TIMER_COUNTER_COUNT` shrank from 11 to 4 (interrupts, user_scheduler, kernel_only, bsp_tick_advances). The `measure: timer ...` line is now compact and the `measure: timer_slow_summary ...` line is no longer emitted at all. `tools/qemu-thread-scale-harness.sh` dropped the `fast_path_*` clauses and the `timer_slow_summary` aggregate / per-phase grep checks in the same slice, satisfying the "removal must land with a host or QEMU test that fails without it" acceptance rule. Verified with `make fmt-check`, `cargo build --features qemu` (warning-free), `cargo build --features qemu,measure` (warning-free), `cargo test-lib` (171 passed), `make run-spawn`, and `make run-measure` (proof line emitted, exit 0). A local one-iteration `CAPOS_THREAD_SCALE_RUNS=1 CAPOS_THREAD_SCALE_GUEST_MEASURE=1 make run-thread-scale` was used solely as functional verification of the harness parser against the new measure-output shape (no CPU pinning, single iteration; the run reported `qemu taskset cpus: none` and the resulting medians/speedups are diagnostic only). This slice is a measure-output cleanup, not a scheduler-structure change, so it does not require controlled benchmark-VM timing evidence under the Phase A "before/after each scheduler structure change" rule; the harness fail-without-the-kernel-change pairing is the acceptance gate. -
2026-05-01 22:01 UTC: Collapsed the asymmetric scheduler CPU sizing.
MAX_SCHEDULER_CPUS = 64was deleted,MAX_SCHEDULER_CLEANUP_CPUS = 4was renamed to a singleSCHEDULER_CPUS = 4, andSchedulerDispatch.current[]resized from 64 toSCHEDULER_CPUSto matchrun_queues,handoff_current,idle_pids,idle_threads,pending_thread_stack_release,TIMER_FAST_PATH_SKIP_COUNTS, andSCHEDULER_CPU_MASK. The dualcurrent_cpu_slot()/current_cleanup_slot()helpers collapsed into a singlecurrent_cpu_slot()that bounds-checks againstSCHEDULER_CPUSand panics on overflow with"scheduler: CPU id {} exceeds scheduler-owned mask".scheduler_cpu_slot(cpu_id) -> Option<usize>retained for the non-panicking lookup. The earlier “raw CPU id 0..63 vs scheduler slot 0..3” indexing distinction is gone. Reintroduce a wider id-to-slot mapping only when a Phase D/F slice grows the scheduler-owned mask beyond the current four. Verified withcargo build --features qemuandcargo build --features qemu,measure(both warning-free) plusmake run-smokeandmake run-spawnon 2026-05-01. -
2026-05-02 09:26 UTC: Replaced the per-CPU run-queue array with a single global
run_queue: VecDeque<ThreadRef>.SchedulerDispatchkeepsrun_queue_live_reservationsas a single counter; thereserve_run_queue_capacity_for_thread_locked/release_run_queue_capacity_reservations_locked/push_reserved_run_queue_lockedtriple still bounds growth but operates on the single queue.enqueue_ready_thread_on_cpu_locked,run_queue_target_cpu_locked, thecreated_thread_target_cpu_lockedplacement chain (active_ready_scheduler_cpu_mask,non_idle_dispatch_load_locked,least_loaded_scheduler_cpu_*,caller_current_scheduler_cpu_slot_locked), theCreatedThreadPublishPolicy/CreatedThreadTargettypes, thescheduler_cpu_scan_orderhelper, and thecrate::measure::thread_placement_publish_caller_*reporting surface are all gone.WakePolicy::QueueCpu(usize)collapsed toWakePolicy::QueueAny.wake_idle_scheduler_cpus_lockedwalks eligible idle scheduler CPUs and stops only after the first one that accepts a fresh reschedule IPI; CPUs that already have a pending IPI (or that fail LAPIC delivery) are skipped without breaking, so a burst of ready work cross-wakes more than one neighbor for both queue and direct-target wakes.publish_created_threadno longer takes acaller_threadargument and no longer emits a per-CPU placement record: under the single global queue there is no per-CPU publish target, and hard-coding CPU0 misclassified normal worker publishes as single-owner-CPU0. Phase D later reintroduced the per-CPU split without restoring those publish counters; reintroduce them only through a separate operator-observability slice.Verified with `cargo build --features qemu` and `cargo build --features qemu,measure` (both warning-free) plus `make run-spawn` and `make run-smoke`. A post-collapse 3-run diagnostic `make run-thread-scale` on the benchmark VM (`taskset 0,1,2,3`, enforcement disabled) on 2026-05-02 10:42 UTC measured 1-to-2 work/total `1.890x`/`1.792x` (slight improvement over the pre-collapse 1-to-2) and 1-to-4 work/total `1.504x`/`1.436x` (clear regression vs the pre-collapse 1-to-4): single-queue scheduler-lock contention dominates at 4 workers. The numbers live in `docs/benchmarks.md` as diagnostic. Phase D later brought per-CPU queues back with a fair-share enqueue policy and formal accepted evidence (capOS plus Linux baseline, full enforcement, multiple runs, recorded host caveats). -
2026-05-02 07:00 UTC: Lifted endpoint-cancellation retry storage out of the scheduler lock. The
pending_endpoint_cancellations: VecDequefield is gone fromScheduler; it now lives in a dedicatedstatic PENDING_ENDPOINT_CANCELLATIONS: Lazy<Mutex<VecDeque<...>>>with boundedtry_reserve_exact(MAX_PENDING_ENDPOINT_CANCELLATIONS)reservation, eagerly forced ininit_idleviaLazy::forceso the allocation never lands in a timer/exit cleanup path. The queue’slen()under its own mutex is the single source of truth forpending_endpoint_cancellationsnon-emptiness. Producers (queue_pending_endpoint_cancellation,remove_pending_endpoint_cancellations_for_pid,remove_pending_endpoint_cancellations_for_thread) and the drain (drain_pending_endpoint_cancellations) take only the queue mutex; the scheduler lock is acquired only briefly insidequeue_pending_endpoint_cancellationto validate the target thread is live and has a ring scratch.defer_endpoint_cancellationpreviously re-acquired the scheduler lock just to push to the fallback queue; that re-acquisition is gone.`note_timer_slow_path_completed_locked` (consumer) holds the queue mutex across both the `!is_empty()` check and the `TIMER_SLOW_PATH_REQUIRED.store`, and the producer `queue_pending_endpoint_cancellation` stores `TIMER_SLOW_PATH_REQUIRED = true` inside the queue lock alongside its push, so a concurrent producer cannot push between the consumer's read and store and have its slow-path mark be overwritten. The functional contract is preserved: a cancellation that cannot deliver immediately because the target ring scratch is contended still falls back to the bounded retry queue, still raises `TIMER_SLOW_PATH_REQUIRED`, and is still drained on the next scheduler tick. Bound is unchanged (`MAX_PENDING_ENDPOINT_CANCELLATIONS = MAX_CAP_SLOTS * MAX_ENDPOINT_CANCELLATION_OBJECT_SWEEPS * MAX_ENDPOINT_CANCEL_NOTIFICATIONS_PER_ENDPOINT * SCHEDULER_CPUS`); the open size-tightening question (whether the `SCHEDULER_CPUS` multiplier is still load-bearing now that producers no longer hold the scheduler lock) is deferred to a future slice with bench evidence. A possible follow-on slice would move retry storage to per-endpoint bounded slots so each endpoint object owns its own queue, but that requires reshaping the `(thread, user_data)` payload to be addressable from an endpoint object and is non-trivial. The current move is sufficient to get the storage out of the scheduler lock and unblock future scheduler-lock-hold-time analysis. Verified with `cargo build --features qemu` and `cargo build --features qemu,measure` (both warning-free) plus `make run-spawn` and `make run-smoke` on 2026-05-02. Review found and fixed a Lazy-init in interrupt paths and a slow-path-clearing race against producer publication. -
2026-05-01 21:38 UTC: Feature-gated the first
ThreadCpuAccountingexperiment end-to-end behindcfg(feature = "measure"). That slice temporarily compiled the whole accounting record, its accessors, and scheduler call sites only when the feature was enabled. Phase D later superseded this temporary shape:runtime_ns,virtual_runtime_ns, andlast_started_nsare now unconditional normal-build fields because WFQ ordering,SchedulingPolicyCap.snapshot, andSchedulingContextbudget charging depend on them. The remaining diagnostic counters (context_switches,preemptions,voluntary_blocks,migrations,last_cpu, blocked/exited stability observations, placement buckets, and per-phase attribution counters) stay behindcfg(feature = "measure"). The 2026-05-01 slice was verified withcargo build --features qemuandcargo build --features qemu,measure(both warning-free) plusmake run-spawn(non-measure default) on 2026-05-01.make run-measurewas broken onmainat the time of this slice for unrelated reasons; that regression was repaired on2026-05-02 20:23 UTC(seedocs/backlog/scheduler-evolution.mdand thedocs/changelog.mdMeasure Mode Repair entry). -
2026-05-01 21:02 UTC: Retired the
RUNNABLE_PROCESS_EXIT_CLEANUP_PROOF_PRINTED,RUNNABLE_THREAD_EXIT_CLEANUP_PROOF_PRINTED, andCPU_ACCOUNTING_PROOF_PRINTEDonce-flag log lines along with theirAtomic*gating booleans, the threeprint_*_once/maybe_print_*_for_thread_lockedhelpers inkernel/src/sched.rs, and their four call sites. The runnable-cleanup invariants remain enforced by the unconditionalassert_no_runnable_pid_entry_lockedandassert_no_runnable_thread_entry_lockedpanics already inkernel/src/sched.rs; a regression that leaves stale runnable owner state still panics the kernel and failsmake run-spawn. Thetools/qemu-spawn-smoke.shharness lost its three matchinggrep -Fqlines for the same reason. The orphanedProcess::account_thread_exited_stable_observed/ThreadCpuAccounting::observe_exited_stablehelpers were deleted with the print; the remainingThreadCpuAccountingwrites stay untouched for the upcoming feature-gate slice. Thepub fn thread_cpu_accountingaccessor moved behindcfg(feature = "measure")because its only remaining caller is the measure-gatedaccount_thread_selected_lockedplacement counter bridge. -
Cache the active CPU id in the per-CPU GS-relative slot.
arch::percpu::current_cpu_idreads the LAPIC ID MMIO register and then linearly scansCPU_LAPIC_IDS[0..64]on every call. The timer fast-path consumer was retired on 2026-05-02 (see the “Retired the timer continuation fast path” entry above), but the function still runs from the syscall path and from non-syscall kernel contexts:arch::context::advance_bsp_tick, the scheduler’s CPU-slot accounting and dispatch lookups insched.rs,arch::tlb::flush_pending_for_current_cpu, andmem::paginginvalidation paths. The hot caller is the syscall entry path; the non-syscall callers are why a drop-in GS-relative replacement is harder than the cleanup item first suggested. The single-movlookup conceptually wantsmov %gs:offset, %eax, but the slice is blocked on a kernel-mode GS-base invariant: today the kernel setsKernelGsBaseviaset_kernel_gs_baseand only the syscall assembly doesswapgsto makegs:0..16resolve at PerCpu while handling a syscall. In normal kernel context (timer ISR, scheduler from non-syscall paths, paging init, AP bring-up), the active GS base is whatever Limine left, not the PerCpu address. A drop-in replacement ofcurrent_cpu_idwithgs:[offset]therefore faults outside syscall context (verified 2026-05-02: reorderinginit_bspto setKernelGsBasebeforeset_kernel_entry_stackis necessary but not sufficient because the active GS base is still not the PerCpu address). The enabling work is establishing a kernel-mode invariant that GS_BASE = PerCpu in CPL0 (typically byswapgs-ing on every kernel entry/exit, including interrupt handlers), or by adopting a hybrid: GS-relative read in the syscall path plus the existing LAPIC-based path everywhere else. Both paths are larger than a single retirement slice and should land with their own gates. Until then this item stays open andcurrent_cpu_idkeeps the LAPIC MMIO +CPU_LAPIC_IDSscan. -
Reassess the scheduler-lock-site instrumentation breadth.
SchedulerLockSite, theSchedulerLockGuard/measured_lockwrappers, the dualcfg(feature = "measure")scheduler_lock/scheduler_lock_sitepaths, and the eight per-site counter axes inkernel/src/measure.rswere added when the global scheduler lock was the suspected scaling bottleneck. After the runqueue/dispatch split landed and the documented per-CPU ownership invariants stabilized, decide which sites still justify dedicated counters and which should fold back into the aggregatescheduler_lockline. Keep thecfg(feature = "measure")gating; reduce the surface so reading the scheduler still reads as one lock acquisition path under non-measure builds. -
Reassess
single_cpu_owner_pids,direct_ipc_target, andhandoff_currentbefore Phase E starts. The single-owner pinning policy, the one-slot direct-IPC handoff, and the per-CPU handoff guard each special-case a small subset of the dispatch flow; document or delete each one against the accepted Phase D fair-policy behavior beforeSchedulingContextwork depends on it. Do not delete them speculatively: the cross-process IPC and process/thread exit cleanup proofs depend on the current direct-IPC and handoff invariants. -
Keep an honest scaling proof when scheduler work resumes. Completed
2026-05-02 21:38 UTCon the benchmark VM againstmaincommit374f8556. Five-run controlled paired evidence, both runs pinned to physical-core logical CPUs0,1,2,3on a 4-core/8-threadn2-highcpu-8host with KVM:| Comparison | capOS | Linux pthread | capOS gate | capOS verdict | | --- | ---: | ---: | ---: | --- | | 1→2 work | `1.883x` | `1.988x` | ≥ `1.6x` | accepted | | 1→2 total | `1.787x` | `1.987x` | ≥ `1.6x` | accepted | | 1→4 work | `1.566x` | `3.963x` | ≥ `1.6x` | diagnostic | | 1→4 total | `1.538x` | `3.858x` | ≥ `1.6x` | diagnostic | Linux scales near-linearly on the same physical CPU set (1-to-2 `1.99x`, 1-to-4 `3.96x`), so the workload shape is sound and the capOS 1-to-4 gap is a scheduler bottleneck, not a benchmark artifact. The 1-to-2 result was the formal accepted gate against the single-global-queue scheduler. The 1-to-4 result became the bottleneck-attribution diagnostic that justified Phase D's fair-share enqueue policy; Phase D later manually accepted the `2026-05-10` WFQ 1-to-4 diagnostic pair recorded above while the harness-enforced gates remained the 1-to-2 work/total speedups. Benchmark shape: blocking parent join, 262,144 blocks (16 MiB), `work_rounds=64`, 5 runs per case (the capOS harness default is 3 runs; this collection explicitly set `CAPOS_THREAD_SCALE_RUNS=5` for parity with the Linux baseline default). Host caveats: internal benchmark VM in a single GCP zone, status `RUNNING` during collection, machine `n2-highcpu-8` with nested virtualization enabled, `/dev/kvm` readable+writable without sudo, SSH operator account, kernel `Linux 6.17.0-1012-gcp x86_64`, CPU `Intel(R) Xeon(R) CPU @ 2.80GHz`, distinct physical-core layout (logical CPUs 0-3 are core IDs 0-3 thread 0; logical CPUs 4-7 are the SMT siblings), `qemu-system-x86_64 8.2.2`, `rustc 1.97.0-nightly (c935696dd 2026-04-29)`. Exact commands: ```sh # capOS PATH="$HOME/.cargo/bin:$PATH" \ CAPOS_THREAD_SCALE_QEMU_TASKSET_CPUS=0,1,2,3 \ CAPOS_THREAD_SCALE_RUNS=5 \ CAPOS_THREAD_SCALE_REQUIRE_SPEEDUP=1 \ CAPOS_THREAD_SCALE_REQUIRE_TOTAL_SPEEDUP=1 \ CAPOS_THREAD_SCALE_TIMESTAMP=20260502T213544Z \ make run-thread-scale # Linux pthread baseline PATH="$HOME/.cargo/bin:$PATH" \ LINUX_THREAD_SCALE_TASKSET_CPUS=0,1,2,3 \ LINUX_THREAD_SCALE_RUNS=5 \ LINUX_THREAD_SCALE_TIMESTAMP=20260502T213445Z \ make run-linux-thread-scale-baseline ``` Raw artifacts on the benchmark VM at `target/thread-scale/20260502T213544Z/` and `target/linux-thread-scale/20260502T213445Z/`. The instance was stopped after collection.
Research And Design Gaps Backlog
This file tracks important OS design, development, and user-story areas that
are absent, thinly covered, or only indirectly owned by existing capOS
proposals. It is a triage register, not an execution queue. Listing a gap here
does not change the selected milestone in docs/tasks/state.toml and does not
mean the project should immediately create a full proposal.
Promote an entry out of this file only when a visible milestone, paper evidence gap, review finding, or explicit user direction makes the area actionable. Promotion targets:
docs/research/for prior-art survey or external precedent.docs/proposals/for a concrete reviewed design direction.- A focused
docs/backlog/file when the design is accepted enough to decompose implementation. docs/design-risks-register.mdwhen the gap is an active architectural risk with an owner.
Status Vocabulary
- Uncovered: no owned design exists yet.
- Thin: mentioned indirectly, but no coherent owner or decision record.
- Backlog-only: task decomposition exists without a full proposal.
- Research-needed: design should not start before prior-art review.
- Ready-for-proposal: enough constraints exist to draft a proposal.
- Deferred: intentionally future work, not a near-term blocker.
- Rejected: considered and explicitly not pursued.
Promotion Checklist
Before creating a proposal from an entry here:
- Identify the visible user or operator outcome the work would enable.
- List existing capOS docs that already partially cover the area.
- List the
docs/research/files actually read, or explain why no research file applies. - Decide the first capability boundary or trust boundary that must be designed.
- Define one QEMU, host-test, documentation, or review gate that would prove the proposal made progress.
Display, GUI, And Input
Status: Uncovered.
User story: a user boots capOS on a desktop, laptop, or remote graphical session and uses multiple graphical apps with keyboard, pointer, clipboard, accessibility, and app isolation.
Current coverage: browser, browser/WASM, agent, GPU, and shell proposals point toward future visual sessions, but there is no native display server, compositor, input routing, window authority, clipboard, screenshot, or accessibility model.
Missing decisions:
- Display ownership and framebuffer/GPU authority.
- Compositor trust boundary and per-window capability model.
- Keyboard, pointer, touch, IME, and focus authority.
- Clipboard and drag/drop data-transfer policy.
- Screen capture and remote desktop authority.
- Accessibility-service authority and privacy boundaries.
Research needed:
- Genode GUI/session routing and report-ROM style composition.
- Wayland compositor security model and clipboard limitations.
- Fuchsia Scenic/input pipeline if the native GUI track becomes near-term.
- seL4/CapROS precedents for trusted path or secure attention, if applicable.
Promote when: native graphical sessions, browser UI, desktop app isolation, or rich web/agent interaction becomes a selected milestone.
Driver Framework And Hotplug
Status: Thin.
User story: an operator plugs in a device, capOS identifies it, starts or restarts the correct isolated driver, and exposes only the intended typed capabilities.
Current coverage: docs/dma-isolation-design.md,
docs/backlog/hardware-boot-storage.md, networking, storage, cloud, and GPU
proposals cover pieces of device work. There is no general driver framework
for discovery, binding, isolation, recovery, firmware, or hotplug.
Missing decisions:
- Device discovery authority and driver matching policy.
- Driver process lifecycle, crash restart, and stale handle behavior.
- Firmware loading and firmware provenance.
- Hotplug attach/detach semantics.
- Interrupt, MMIO, DMA, and power authority handoff.
- User-space driver SDK boundaries and test harnesses.
Research needed:
- Genode driver components and session routing.
- Zircon/Fuchsia driver framework concepts.
- Linux VFIO/uio and userspace-driver isolation tradeoffs.
- seL4 device-driver partitioning examples.
Promote when: userspace NIC, block-device, USB, GPU, or real hardware bring-up requires reusable driver lifecycle rules.
Power, Suspend, Resume, And Thermal Policy
Status: Uncovered.
User story: a laptop or VM can sleep, wake, preserve sessions, and report power or thermal limits without leaking stale authority or corrupting timers.
Current coverage: tickless scheduling covers timer cleanup and idle mechanics, but not power management as an OS product area.
Missing decisions:
- Suspend/resume authority and system-wide quiesce protocol.
- Wake-source capabilities and audit.
- Battery, charger, lid, and thermal sensor surfaces.
- CPU frequency, C-state, and thermal-throttling policy.
- Timer and network behavior across sleep.
- Session and service liveness after resume.
Research needed:
- ACPI power-state model and Linux suspend blockers/wakeup sources.
- Fuchsia power framework if relevant.
- Genode power-management patterns for component systems.
Promote when: laptop hardware, cloud hibernation, low-power idle, or interactive remote-shell reliability needs sleep/resume semantics.
Time, Clock, And Trusted Timestamp Services
Status: Promoted to proposal (2026-05-22). See Time and Clock Authority and the prior-art note Time and Clock Authority research. Residual research (servo/loop-filter, holdover/error-bound, suspend recovery) is noted in that proposal. Original gap status was Thin.
User story: services can distinguish monotonic time, wall-clock time, and trusted audit time, and cannot silently forge system time.
Current coverage: scheduler and tickless proposals mention clocks, timers, deadlines, and clocksource/clockevent split. There is no user-facing time authority model.
Missing decisions:
- Monotonic, boot, realtime, and coarse clock capability surfaces.
- Who can set wall-clock time and how changes are audited.
- NTP/PTP/cloud-metadata time synchronization authority.
- Timezone and locale data ownership.
- Leap-second and clock-step behavior.
- Timestamp trust level carried into audit records.
Research needed:
- Linux clock ids, adjtimex/NTP discipline, and time namespaces.
- Fuchsia clock objects and UTC maintenance.
- Cloud metadata time and attestation interactions.
Promote when: audit log completion, TLS certificate validation, distributed services, or durable storage needs trusted timestamp semantics.
Software Installation, Packages, And Rollback
Status: Thin.
User story: a user or operator installs an app, inspects requested authority, updates it, rolls it back, and removes its state without ambient filesystem assumptions.
Current coverage: repository composition, storage/naming, userspace binaries, live upgrade, cloud deployment, and public-release proposals cover adjacent pieces. There is no package/app distribution model.
Missing decisions:
- Package manifest schema and authority-request review.
- Signed repositories, update channels, and revocation.
- Dependency resolution and build provenance.
- App install/remove lifecycle and state ownership.
- Rollback, staged rollout, and compatibility policy.
- Vulnerability advisory and emergency update workflow.
Research needed:
- Nix/Guix, OSTree, Flatpak portals, Android package permissions, and Fuchsia package/update system.
- Supply-chain signing systems such as TUF/in-toto/Sigstore if this becomes release-critical.
Promote when: capOS needs installable demos, sibling repositories, public release, or cloud image update flow.
Crash Recovery, Supervision, And Diagnostics
Status: Promoted to proposal (2026-05-22). See Crash Recovery and Supervision and the prior-art note Crash Recovery and Supervision research. Residual research (Fuchsia component-manager escrow semantics) is noted in that proposal. Original gap status was Thin.
User story: a service crashes; init or an authorized supervisor restarts it or enters a known degraded mode without leaking authority, hiding the cause, or looping forever.
Current coverage: service architecture already sketches SpawnRequest restart
policy, supervisor-owned respawn, and always/on-failure restart modes;
capos-service covers service lifecycle pieces; live-upgrade planning ties
fault containment to supervisor respawn; and system monitoring covers
logs/metrics/crash records at a high level. Crash-loop budgets, core/minidump
capture, degraded-mode semantics, watchdog policy, and stale/in-flight cleanup
are still not owned as one recovery design.
Missing decisions:
- Restart policy authority and failure budget.
- Crash-loop backoff and operator override.
- Core dump or minidump capture with capability redaction.
- Watchdog and health-check capabilities.
- Degraded boot and emergency shell semantics.
- Stale capabilities and in-flight calls after service death.
Research needed:
- Erlang/OTP supervision trees, systemd restart policy, Kubernetes probes, and Fuchsia component lifecycle.
- Capability-system precedent for crash propagation and service replacement.
Promote when: shared services, remote shell, storage, or agent workloads need production-grade recovery behavior.
Backup, Restore, Snapshots, And Migration
Status: Thin.
User story: an operator loses a disk or VM and restores users, services, keys, and app state while avoiding stale authority and accidental data disclosure.
Current coverage: storage/naming, cloud deployment, and the hardware/boot/ storage backlog already cover narrower pieces: user-owned encrypted save transport, fake Drive/Firebase restore rejection tests, rollback/stale handling, and cloud-backed snapshot material. System-wide disaster recovery for users, services, keys, machine identity, and authority state is still not owned as one design.
Missing decisions:
- Snapshot capability boundary and consistency protocol.
- Encrypted export/import and restore identity.
- Key recovery and disaster recovery drills.
- Partial restore and per-service state ownership.
- Backup retention, deletion, and privacy policy.
- Migration between machines or cloud instances.
Research needed:
- ZFS/Btrfs snapshot semantics, Borg/Restic encrypted backup models, and cloud snapshot/key-management practices.
- Capability-specific concerns from EROS/CapROS persistence if applicable.
Promote when: writable storage, durable local accounts, volume encryption, or cloud deployment becomes near-term.
Human-Facing Administration And Explainability
Status: Thin.
User story: an operator can answer who has access to a service, why, since when, what will happen if access is revoked, and why a request was denied.
Current coverage: shell, system info, local users, system monitoring, configuration, and security proposals cover pieces. There is no unified administrator UX or policy explainability track.
Missing decisions:
- Account and role management commands or UI.
- Grant inspection, diff, revoke, and dry-run behavior.
- Denial explanation format across kernel, broker, and services.
- Audit search and incident timeline views.
- Diagnostics bundle generation and redaction.
- Safe repair workflow for broken configuration or policy.
Research needed:
- Kubernetes RBAC
can-i/audit practices. - Cloud IAM policy simulators and access-analyzer tools.
- Genode configuration/reporting UX for component graphs.
Promote when: local users, ABAC/MAC, remote shell, or operator configuration needs day-2 administration rather than proof-only commands.
Developer Debugging, Profiling, And Tooling
Status: Partially promoted to proposal (2026-05-22). The debug/trace/profile authority slice is now Debug and Trace Authority with the prior-art note Debug, Trace, and Profiling Authority research. The broader developer-tooling surface (service templates, local SDK, schema explorer, request-replay) remains Thin and is not yet owned by a proposal.
User story: a developer writes a capOS service, runs it locally, debugs a failed capability call, profiles it, and ships it with reproducible evidence.
Current coverage: harness engineering, benchmarks, generated-code checks, run-targets, and the paper evidence track cover pieces. There is no full debugger/profiler/developer-tooling proposal.
Missing decisions:
- Debug authority and process attach policy.
- Symbols, stack traces, crash dumps, and source maps.
- Ring/syscall/capability-call tracing.
- Service schema explorer and request replay tooling.
- Guest profiling, flamegraph, and benchmark attribution workflow.
- App/service templates and local developer SDK.
Research needed:
- GDB remote protocol, Linux
perf/eBPF-style tracing boundaries, Fuchsia diagnostics, and seL4 debug authority practices.
Promote when: non-trivial third-party services, public release, or performance claims need repeatable developer workflows.
Compatibility And App Porting Strategy
Status: Thin.
User story: a developer ports a small existing CLI or server to capOS and knows which Unix assumptions work, fail, or require explicit capability adapters.
Current coverage: userspace binaries, Go, Lua, POSIX adapters, WASI, C/C++, and language-runtime proposals mention porting targets. There is no concrete compatibility profile matrix.
Missing decisions:
- Minimal libc/POSIX surface and unsupported-call policy.
- Filesystem, environment, argv, signal, pipe, socket, and process semantics.
- Dynamic linking and shared-library policy.
- WASI adapter authority model.
- Build recipes and package corpus selection.
- Porting report template and acceptance tests.
Research needed:
- WASI preview models, CloudABI history, Redox, Hermit, Fuchsia POSIX layer, and Genode libc/VFS integration.
Promote when: a language runtime, POSIX adapter, or real application corpus becomes a selected milestone.
Accessibility And Internationalization
Status: Uncovered.
User story: non-English users and assistive-technology users can operate capOS shells, graphical sessions, and web/agent surfaces without privileged workarounds.
Current coverage: none beyond general shell/browser surface discussions.
Missing decisions:
- Unicode, locale, collation, and timezone data ownership.
- Input methods and keyboard layout authority.
- Screen reader or accessibility tree service boundary.
- High-contrast, font scaling, and reduced-motion policy.
- Translation and message-catalog strategy.
- Accessible denial/audit messages and setup flow.
Research needed:
- Web accessibility platform architecture, Wayland accessibility status, Fuchsia accessibility manager, and terminal accessibility conventions.
Promote when: graphical sessions, public demos, web shell, or production interactive setup becomes user-facing beyond developer/operator proof flows.
Fleet Operations And Remote Management
Status: Thin.
User story: an operator manages many capOS nodes and can prove which version, policy, keys, services, and update state each node is running.
Current coverage: cloud deployment, cloud metadata, system monitoring, configuration, hosted agents, and public release cover adjacent concerns. There is no fleet-management design.
Missing decisions:
- Node enrollment and identity bootstrap.
- Remote attestation and inventory reporting.
- Configuration rollout and drift detection.
- Remote logs, metrics, and audit aggregation.
- Staged update and rollback policy.
- Break-glass access and emergency revocation.
Research needed:
- Kubernetes node/bootstrap models, cloud instance identity, SPIFFE/SPIRE, TPM/measured boot attestation, and OSQuery-style inventory.
Promote when: cloud deployment, hosted agent swarms, public release, or remote administration becomes more than a single-node proof.
Privacy And Data Governance
Status: Thin.
User story: a user can see and revoke what data a service can access, and deleted data does not unintentionally persist in logs, backups, or derived indexes.
Current coverage: capability authority, session privacy, audit redaction, identity policy, storage, monitoring, and browser/agent proposals cover parts of the problem. There is no explicit data-governance design.
Missing decisions:
- Data classification and purpose-bound access metadata.
- Retention, deletion, and legal-hold semantics.
- Derived data, indexes, caches, and backup deletion behavior.
- User consent and service data export.
- Audit redaction versus forensic retention.
- Cross-service data-sharing policy and review UX.
Research needed:
- Object-capability privacy patterns, GDPR-style data lifecycle controls, browser permission UX, and cloud DLP/data catalog practices.
Promote when: persistent user data, browser/agent activity, hosted services, or public release introduces real privacy expectations.
Security And Verification Backlog
Detailed decompositions for security and verification work. docs/tasks/README.md
links here but should not inline these subtasks.
Stage-6 Trust-Boundary Refresh
- Refresh trust-boundary docs after Stage 6 IPC/capability-transfer work.
Untrusted-Service Hardening Pass
Cover unmapped pointers, kernel-half pointers, invalid capability IDs, corrupted rings, SQ/CQ overflow behavior, and a service without Console authority. Audit manifest, ELF, SQE, params, and result-buffer paths so untrusted input fails closed instead of reaching kernel panic paths.
Completed context:
- Panic-surface inventory: audited
panic!,assert!,unwrap, andexpectreachable from manifest, ELF, SQE, params, result-buffer, IPC, and spawn inputs. - Ring/user-pointer hostile demos: added unmapped params/result-pointer,
kernel-half params-path, invalid-capability-ID, corrupted RETURN
call_id, corrupted SQ/CQ head, undersized-params, undersized-result, and SQ/CQ overflow coverage. - No-authority smoke: empty-CapSet service verifies expected cap lookups
fail and invalid-cap CALLs return controlled CQEs; after removal of
syscall 0, it proves a no-authority process cannot write and can only
exit/cap_enter.
Remaining decomposition:
- Quota and exhaustion smokes (
make run-untrusted-exhaustion, two QEMU passes; covered 2026-05-25 06:42 EEST):- Cap-table and endpoint-queue exhaustion fail closed without corrupting
existing calls. Endpoint-queue is proven by the small-scratch core pass
(per-owner queue ceiling ->
Overloaded, then a held console call still completes). Cap-table is proven by the small-scratch core pass and the default-profile*-captablecompanion pass: single-frameMemoryObjectallocations first return boundedFrameAllocatorsuccess replies, then continue until the per-process cap-slot ledger fails closed (Overloaded: failed to reserve MemoryObject cap slot); a held console call still completes after the boundary. - Scratch/result-buffer pressure returns controlled errors and later
valid calls still complete (core pass: ring-scratch oversize CALL
rejected with
CAP_ERR_INVALID_REQUEST, reply-scratch clamp returns a serialized exception, then a valid console write completes). - Repeated invalid submissions stay bounded: each structurally invalid
SQE returns a controlled per-SQE error CQE and the ring stays usable (a
recovery NOP completes). Note: the per-key token-bucket log aggregation
in
docs/authority-accounting-transfer-design.md§3 (D1/D2 suppressed- count summary line) is still a design target, not implemented; the smoke asserts bounded per-SQE rejection, not the summary line. - Frame-grant-page exhaustion: not cleanly reachable from a smoke. For
single-page allocations the cap-slot ceiling (
PROCESS_CAP_SLOT_LIMIT, 256) is reached far before the frame-grant ledger (PROCESS_FRAME_GRANT_PAGE_LIMIT, 4096 pages), and reaching 4096 grant pages needs large contiguous allocations whose failure mode is physical fragmentation, not the grant ledger. The cap-table pass exercises the same fail-closed preflight-reserve path. Remaining gap.
- Cap-table and endpoint-queue exhaustion fail closed without corrupting
existing calls. Endpoint-queue is proven by the small-scratch core pass
(per-owner queue ceiling ->
- Fail-closed cleanup: the
FrameAllocatorsuccess-path result serialization now honors the caller’s effective reply-scratch capacity, so small-scratch processes can receive boundedMemoryObjectresult caps before cap-slot exhaustion fails closed. Closed bysecurity-reply-scratch-success-path-limit-local-proof.
Kani Harness Bounds Refresh
- Revisit Kani harness bounds and proof shape once capability transfer,
resource accounting, or user-buffer validation has more concrete proof
obligations. Keep current bounds practical for
make kani-lib; expand only when the added verifier cost buys a specific kernel invariant.
DMA Assurance Model Operationalization
dma-assurance-model-v0 (2026-05-24) landed the accepted proposal
(docs/proposals/dma-assurance-model-proposal.md) and inspectable-only TLA+/Alloy
skeletons (models/dma/), but stopped there: no run target, no CI gate, no
reconciliation with DMA code landed since. Kickoff task:
dma-assurance-model-operationalization
(decomposition — reconciled the v0 model with landed code and emitted the
per-tool slices below).
- Reconcile
models/dma/with landed invariants (ownership-generation on recycle, map-record-before-PTE-install ordering, drive-pin, epoch fence, scrub-before-free): gap table inmodels/dma/README.mdgrounded against the landed symbols, done 2026-06-04. -
make model-dma-tla— bounded TLC run ofdma_authority.tla(pinned TLC 2.19 / tla2tools 1.7.4 + pinned Temurin JRE 17.0.19), lifecycle ordering plus generation-keyed stale completion, record-before-PTE-install split, drive-pin/quarantine, and queue-enable epoch-fence interleavings, checked clean at 2 devices / 2 domains / 2 pages / 2 iovas, generations 0..1, done 2026-06-04:dma-assurance-model-tla-checked-gate. -
make model-dma-alloy— Alloy analysis ofdma_authority.als(pinned Alloy Analyzer 6.2.0), device/domain/IOVA/page/alias authority graph plus the ownership-generation stale-handle gate, checked at scopefor 4, done 2026-06-04:dma-assurance-model-alloy-checked-gate. -
make kani-dma-authority— bounded Kani over an extracted pure DMA-authority core (capos_lib::dma_authority: ownership-generation bump on recycle, stale-handle rejection without mutation, no-re-expose before completion),make kani-libstyle, done 2026-06-04. Faithful extraction of thedevice_dma.rsauthority arithmetic; routing the kernel call site through the core is a tracked follow-up (kernel isno_std/no_main, not host-built):dma-assurance-model-kani-authority-core. -
make model-dma-deferred-completion-loom— focused Loom (pinned 0.7.2) over theDeferredCompletionQueuereservation budget and the multi-CPU TLB shootdown generation re-read (deferred-EOI / completion concurrency the ring Loom does not cover), done 2026-06-04:dma-assurance-model-deferred-completion-loom. - CI wiring into the GitHub gate and local aggregate now that each target has
a checked result.
make dma-assurance-model-checkruns Alloy/TLA+/Loom/ Kani locally whencargo-kaniis installed; GitHub CI runsmodel-dma-alloy,model-dma-tla, andmodel-dma-deferred-completion-loomindma-assurance-models, andkani-dma-authorityinkani-proofs. Done 2026-06-05:dma-assurance-model-ci-wiring.
Scheduler & IRQ Assurance Models
The scheduler is the densest unmodeled concurrency surface in the kernel
(per-CPU atomics read lock-free from ISR context while another path holds the
scheduler mutex via try_lock, plus IPI cross-CPU activation) and has zero
formal coverage today (smoke + measured suppressed-tick counters only). The IRQ
MSI-X waiter race was fixed by reproduction, not a model. Mirrors the DMA
operationalization pattern; tasks reuse the TLC/Alloy/Kani pins that track lands.
- S1
scheduler-nohz-activation-model(done 2026-06-04 09:00 UTC) – TLA+/TLC for the nohz activation/rollback lifecycle + a focused Loom for the lock-freeNOHZ_ACTIVE_CPUSbit vs lockednohz_activation[slot]record race.make model-scheduler-nohz-tlachecks no timer-less CPU (NoTimerlessStall+EventuallyReArmed), bit/record agreement (EventuallyConsistent), and that a staled remote activation is dropped not applied to a newer lease (NoStaleActivation+StaledRecordEventuallyCleared);make model-scheduler-nohz-loomchecks the lock-free-bit ↔ locked-record reconciliation keeps the timer armed. Checked results + mutation/non-vacuity evidence inmodels/scheduler/README.md. - S2
scheduler-lapic-oneshot-timer-model(done 2026-06-04) – Kani over the extracted pure count/clamp arithmetic (capos_lib::clockevent) + a TLA+ mode-transition lemma pinning the halt-first reprogram ordering.make kani-lapic-oneshotproves the clamp window is well-formed, the armed count is in[1, u32::MAX]with nou128overflow, and the count round-trips to the request within one LAPIC count (3/3 SUCCESSFUL).make model-scheduler-lapic-oneshot-tlachecks that after the halt-first reprogram the next fire is the one-shot at the armed count, never the periodic reload (OneshotModeBoundedCount+HaltedDisarmed+ theperiodicFiredInOneshotModesentinel), and that every fire path (user- and kernel-mode consumption) restores a timer source (NoTimerlessStall+EventuallyReArmed/FiredEventuallyRestoredliveness). Checked results + mutation/non-vacuity evidence inmodels/scheduler/README.md. - S3/S4
scheduler-cpu-isolation-lease-authority-model(done 2026-06-04 07:04 UTC) – Alloy for the lease/grant relational invariants + TLA+ for the two-lock teardown and the documented non-atomic createLease-vs-revokeGrant SMP window.make model-scheduler-lease-alloychecks the unforgeable grant->lease binding, no live lease through a revoked grant outside the explicitly modeled bounded window, capacity never undercounting a live lease, and the stale-handle generation gate;make model-scheduler-lease-tlachecks generation advances exactly once per termination, no capacity double-free, the single chokepoint always runs unregister + SQPOLL-stop + nohz-rollback before recycle, no stranded generation (liveness), and that the renew deadline branch never resurrects. Checked results + non-vacuity evidence inmodels/scheduler/README.md. - IRQ
irq-msix-waiter-determinism-model(done 2026-06-04 06:10 UTC) – TLA+/TLC for the waiter <-> delivery <-> deferred-EOI ordering the RX MSI-X waiter fix established.make model-irq-waiter-tlachecks no spurious/early injection (NoCompletedEarly), exactly-once delivery/EOI/completion accounting, EOI drain before route re-arm (EpochDrainSound), and theNoLostWakeliveness property; checked result + mutation evidence inmodels/irq/README.md.
Preserved Completed Security Context
These are completed and should not be re-read by default. They remain here so
future work can find their design context without bloating docs/tasks/README.md.
- Authority graph and resource accounting design for transfer model:
docs/authority-accounting-transfer-design.md. - Supply-chain and generated-code TCB hardening: pinned Limine and
external build downloads, generated-code drift checks, dependency policy,
pinned Cap’n Proto compiler, shared
tools/capnp-build, and deterministic generated-binding comparisons. - DMA isolation model before PCI/virtio/user-driver work:
docs/dma-isolation-design.mddefines short-term QEMU bounce-buffer decision,DMAPool,DeviceMmio,Interruptinvariants, and the userspace-driver transition gate. - ELF parser arbitrary-input coverage: proptest coverage plus a bounded
cargo-fuzztarget. - Telnet IAC filter fuzz coverage:
TelnetFilterextracted tocapos-lib::telnet,fuzz/fuzz_targets/telnet_filter.rsexercises the state machine with structural assertions (Normal/AfterIac emission rules, monotonic emit count). Will travel with the parser when networking moves to userspace perdocs/proposals/networking-proposal.md. - Telnet IAC filter differential round-trip fuzzing
(
fuzz/fuzz_targets/telnet_filter_roundtrip.rs): synthesize arbitrary RFC 854 event streams (Data, WILL/WONT/DO/DONT, SB blocks with payload), encode to wire bytes, and assert that filter output equals the concatenatedDatapayloads. Found a real EXOPL handling bug in the original filter – the option byte right afterIAC SBwas being interpreted as the start of anIAC IACescape, leaving the filter stuck in the subnegotiation state and silently dropping all subsequent data bytes. Fixed via a newAfterSbstate that consumes the option byte unconditionally before entering payload parsing. - Line discipline extraction and fuzz coverage: pure
LineDisciplinelives incapos-lib::line_discipline, returningLineStep { outcome, echo }descriptions. The kernel transport drives it and translatesEcho::Byte/Echo::Backspace/Cancelled/ Submitted/Reprompt into the existingsend_*_track_crcalls. Backed byfuzz/fuzz_targets/line_discipline.rswith structural invariants (line_len <= max_bytes, ±1 line_len delta per Pending step, Cancelled clears, echo only when buffer grows/shrinks). - Future: differential fuzzing against an external Telnet library (libtelnet or a Rust port) to catch RFC conformance bugs the structural and round-trip targets cannot express. Tracked as a follow-up to Track S.14.
- Ring SQE wire validation extraction and fuzz coverage: lifted the
per-opcode
*_sqe_has_unsupported_fieldspredicates from the kernel intocapos_config::ring, exposed a unifiedsqe_wire_validation_errorentry point, and reroute the kernel through it. Addedfuzz/fuzz_targets/sqe_validation.rsplus 12 host unit tests covering the classification boundaries each opcode imposes. Closes the originally planned three-parser fuzz set (elf::parse,manifest::decode, ring SQE decoder). - Well-formed SQE generator oracle
(
docs/tasks/done/2026-06-06/security-sqe-well-formed-generator-fuzz-local-proof.md): the test/fuzz-onlysqe-validation-oraclefeature exposescapos_config::ring::sqe_oracle, which generates validator-accepted SQEs for every opcode accepted by the current build, plus one-field rejecting mutations for reserved fields, unsupported flags, constrained cap/pointer/size fields, opcode-specific constraints, and session-disclosure reserved bits. The existingsqe_validationfuzz target keeps arbitrary-byte coverage and runs the positive/negative oracle on each input. This closes the shared wire-validator oracle gap only; it does not claim cap-table lookup, userspace pointer mapping, transfer-descriptor loading, or full kernel ring semantic coverage. - Track S.17 – sanitizers on host tests – partially landed.
make sanitizer-host-testsruns AddressSanitizer over thecapos-libandcapos-confighost suites (crate set / features mirror thetest-lib/test-configaliases). Outcome: zero findings – both suites pass clean under ASan, including the namedunsafesuspects (FrameBitmap slot indexing, CapTable generation counters,lazy_bufferraw&mut [u8]). The “cheap to add” claim holds for ASan only: it needs no-Zbuild-stdbecause its libc interceptors cover the uninstrumented precompiled std. - ThreadSanitizer (make sanitizer-host-tests-tsan) is blocked upstream, not by a capOS defect. TSan changes the crate ABI, so rustc refuses to link sanitized code against the uninstrumented precompiled std (mixing -Zsanitizer will cause an ABI mismatch). Instrumenting std needs-Zbuild-std, which then fails withduplicate lang item in crate core: sizedfor build-script-bearing dependencies (typenum / libc / cfg-if / subtle) when the sanitizer target equals the host triple – reproduced four ways (plain-Zbuild-std, renamed--targetJSON spec,target-applies-to-host=false, and-Zhost-config). The TSan target is kept wired so it starts passing once the upstream-Zbuild-std+ build-script issue is fixed. capOS concurrency invariants are meanwhile covered by the dedicated Loom model (cargo test-ring-loom); these host unit tests spawn no threads, so TSan’s marginal value here is low.
Hardware, Boot, And Storage Backlog
Detailed decompositions for hardware, boot packaging, block devices, and local
storage. docs/tasks/README.md links here but should not inline these subtasks.
This is a forward-decomposition reservoir: it carries the open frontier
(explicit DDF follow-up tasks, cloud/network next gaps, and the DMA-authority
invariants that constrain new slices). Landed proof-by-proof chronology lives in
docs/tasks/done/,
docs/changelog.md, and git history; this file keeps only one-line “Landed:”
pointers to it where a reader needs to know a capability exists.
DDF Dispatch Budget
Device Driver Foundation was the previously selected milestone and its
production-authority closeout is recorded for the current brokered-bounce path.
Future DDF slices should not reopen the retained review finding as a generic
blocker; they should advance explicit follow-up tasks such as
direct-remapping/vIOMMU production hardware support, device-autonomous MSI-X
delivery, broader writable-DeviceMmio region selection, or follow-on
provider/device variants. Harness-only updates should protect one of those
authority steps rather than add another standalone proof layer.
Landed: the IOMMU/remapping groundwork and its disabled scaffold (DRHD/source/
domain records, MMIO-status diagnostics, disabled IOVA ledger, mapping-lifecycle
preflight) through the bounded QEMU Intel path; see the IOMMU section below and
docs/tasks/done/2026-05-12/ .. done/2026-05-23/.
docs/proposals/device-manager-refactor-proposal.md core refactor has landed:
the device manager is the kernel/src/device_manager/ module tree. Remaining
refactor work is optional risk reduction only: run behavior-preserving
registry, ledger, or proof-internal splits when they reduce the risk of
upcoming DeviceMmio, Interrupt, or DMAPool authority work or unblock that
work’s review. Those slices remain subordinate to behavior-moving DDF authority
slices and to scheduler SMP/nohz prerequisites.
Landed local follow-up: multi-PRP brokered NVMe BlockDevice windows
(ddf-nvme-multiprp-blockdevice-window-local-proof).
Landed local follow-up: the read-cap reply-scratch fail-closed clamp
(storage-file-read-reply-scratch-clamp).
Landed local follow-up: DeviceMmio map/unmap stale-generation proof
(ddf-devicemmio-map-unmap-stale-generation-local-proof).
Landed local follow-up: production DeviceMmio teardown transaction manager
hold proof
(ddf-devicemmio-production-teardown-transaction-local-proof).
Landed local follow-up: production DMAPool buffer lifecycle over the manager
ledger
(ddf-dmapool-production-buffer-lifecycle-local-proof).
Landed local follow-up: manager-owned DMABuffer free/reuse generation
(ddf-dmabuffer-free-reuse-generation-local-proof).
Landed local follow-up: Interrupt waiter reset-generation
(ddf-interrupt-waiter-reset-generation-local-proof).
Landed local follow-up: production Interrupt routed waiter / deferred-EOI
lifecycle over the manager ledger
(ddf-interrupt-production-waiter-lifecycle-local-proof).
Landed local follow-up: provider IRQ/MSI stale-notification hostile lifecycle
proof
(ddf-provider-interrupt-stale-notification-hostile-local-proof).
Landed closeout: the retained DDF production-authority review finding is closed
by ddf-production-authority-closeout.
Keep direct-remapping/vIOMMU and broad umbrella tasks blocked until their named
gates are actually satisfied.
Growing the inline AttachedDmaPoolRecord::proof_buffers slot count beyond
three slots is blocked on a prerequisite refactor: boot-time proof emissions
pass AttachedDmaPoolRecord by value through nested paths starting at
validate_dmapool_budget_policy_for_record
(kernel/src/device_manager/dma_pool.rs) and the descriptor lifecycle
emissions in kernel/src/device_manager/proofs.rs. A direct slot bump to four
double-faulted make run-net with the BSP boot stack exhausted. Prerequisite:
ddf-attached-dmapool-record-by-ref
is done; any future proof-buffer growth should verify it still avoids by-value
stack expansion before increasing the inline slot count.
Device Manager Refactor Track
The refactor keeps the kernel device manager as the single authoritative
ledger for claimed devices. It must preserve the same ownership transactions
across DMAPool, DMABuffer, DeviceMmio, and Interrupt; it should not
create independent managers or move authority decisions into userspace.
Landed: proof split, handles/errors split, domain modules
(mmio.rs/dma_pool.rs/dma_buffer.rs/interrupt.rs), and the
transaction-helper cleanup, all while PciDeviceRecord remains the aggregate
ledger owner. See
ddf-device-manager-proof-split-closeout,
ddf-device-manager-handles-errors-split,
ddf-device-manager-domain-modules,
and
ddf-device-manager-transaction-helper-cleanup.
Open:
- Optional follow-up splits. Further registry, ledger, or proof-internal
splits may run when they are behavior-preserving and reduce near-term DDF
review risk. They must preserve cap semantics, audit labels, proof
labels, QEMU smoke output, lock ordering, and the single aggregate
PciDeviceRecordownership ledger.
Conflict guidance: treat this as part of the DDF kernel-core serial surface.
It owns kernel/src/device_manager/ and overlaps with any DDF slice touching
kernel/src/cap/device_mmio.rs, kernel/src/cap/interrupt.rs,
kernel/src/cap/dma_pool.rs, kernel/src/cap/dma_buffer.rs,
kernel/src/device_dma.rs, kernel/src/device_interrupt.rs, or DDF QEMU smoke
assertions. Do not run it in parallel with scheduler SMP/nohz kernel slices
that need kernel/src/process.rs or kernel/src/sched.rs review capacity if
those prerequisites are the selected blocker.
Bootable Disk Image
Landed (complete track): make image raw hybrid BIOS+UEFI disk image, make run-disk (OVMF) and make run-disk-bios boot proofs, and provider packaging
helpers (make package-cloud-image / package-gcp-image / package-aws-image)
plus the import notes in docs/backlog/cloud-image-import.md. See
docs/tasks/done/ (disk-image-*, closed 2026-05-25). Cloud NIC/storage driver
ownership remains a separate, blocked track below.
Serial Diagnostics Console
Visible outcome: before cloud NIC/storage drivers are trusted, a cloud VM can boot to a COM1 diagnostics prompt and expose enough state to debug ACPI, PCI, interrupt, DMA, storage, and NIC bring-up through the provider serial console.
Landed: the COM1 diagnostics mode (no network/disk), the bounded command set
(help/status/reboot/halt/cpu/mem/acpi/pci/irq/timers/
devices/logs; reboot is a recognized placeholder), the ACPI/PCI and
virtio-net/DMA-ledger/interrupt-route dump slices, and scripted QEMU coverage.
Open:
- Keep the serial path for command/control and bounded diagnostics only. Do not require large binary upload, in-place kernel replacement, or high-volume tracing over provider serial consoles.
ACPI And PCIe Discovery
Landed: Limine RSDP map, MADT LAPIC/I/O APIC enumeration, and MCFG parse with PCIe ECAM config-space access beside legacy QEMU I/O-port access.
Interrupt Infrastructure
Depends on ACPI and SMP Phase C LAPIC timer/IPI.
The MSI-X proof is kernel-owned: virtio-net config/RX/TX sources are recorded in
the device interrupt registry against a bounded first-fit LAPIC device MSI
vector pool, programmed through the typed PCI MSI-X table helper, claimed and
unmasked by the in-kernel virtio-net owner, assigned to virtio vector registers,
and proved by the TX source’s dispatch counter. A metadata-only QEMU
virtio-rng function reuses the same path with a distinct claimed-masked owner.
That virtio-rng function is a QEMU-only proof fixture, not a production
driver and backs no userspace-facing capability (see
virtio-rng); the entropy service
is the separate RDRAND-backed EntropySource cap.
Legacy I/O APIC routes have a bounded QEMU proof through the same registry.
Landed (kernel-side proof evidence, docs/tasks/done/2026-05-* / done/2026/;
also make run-net, make run-interrupt-grant, make run-hardware-audit*):
masked I/O APIC routing foundation, MSI/MSI-X capability discovery, the static
and registry-backed virtio-net source-route proofs, the device MSI vector pool +
exhaustion policy, claimed-route lifecycle / vector reassignment / stale-route
rejection, driver-owned mask/unmask, the second-device (virtio-rng) proof, the
first device-manager ownership and interrupt-source handoff proofs, the bounded
teardown-trigger contract (seven object-backed rows), cap-specific
release/process-exit/driver-crash/reset-disable/interrupt-waiter teardown hooks
for DeviceMmioCap/InterruptCap/DmaPoolCap/DmaBufferCap, the read-side
HardwareAuditLog.snapshot coverage, pending-IRQ token validation through
capos-lib::device_authority, and bounded Interrupt
wait/acknowledge/mask/unmask admission promoted to bounded route-state
control plus one manager-grant-source routed waiter / deferred-EOI lifecycle
proof (make run-interrupt-grant).
Open:
- Continue real interrupt-source teardown beyond the manager-grant-source routed waiter proof: provider-driver IRQ/MSI waiters now have a local hostile stale-notification proof for reset/release/provider-death/waiter- cancel boundaries, but broader process-exit/driver-crash/reset-disable smoke coverage must keep using the proven ownership lifecycle rather than a separate route cleanup path.
- Expose userspace
Interruptauthority only after source ownership, generation checks, broader stale-notification lifecycle wiring, and the S.11.2 hostile IRQ smokes are implemented. - Add a selected-mode x2APIC QEMU proof over the landed x2APIC MSR backend
(
kernel/src/arch/x86_64/lapic.rs):make run-interrupt-grant-x2apicboots with-cpu qemu64,+smep,+smap,+rdrand,+x2apic, assertsLapicMode::X2Apic, and reuses the routedInterrupt.wait/Interrupt.acknowledgeproof. This remains a bounded local proof, not a high-core hardware readiness claim.
PCI/PCIe Infrastructure
Promotes PCI enumeration from a networking substep to a reusable subsystem consumed by all device drivers.
Landed: PCI config access via legacy I/O ports and PCIe ECAM, the ECAM function
mapping cache/ledger, full Q35 bus enumeration (scanned_buses=256), BAR
parsing + reusable kernel MMIO subregion mapping, MSI/MSI-X metadata discovery,
the second-device (virtio-rng) PCI proof, and the metadata-only QEMU NVMe PCI
proof (make run-pci-nvme). See docs/tasks/done/.
NVMe userspace-bind chain (forward-relevant; landed steps with their successor gaps preserved):
- Landed: the Model B kernel on-notify DMA validator
(
nvme-doorbell-dma-validator,kernel/src/cap/nvme_doorbell_validator.rs,validate_doorbell_scan/completion_wakes_waiter): provider-writes / kernel-validates, fails closed outside the owner’s granted DMA window. Synthetic owner windows stand in for the live grant ledger; wiring the validator into a live NVMeDeviceMmiodoorbell claim is valid only on a verified direct-remapping/vIOMMU or synthetic-address lane. The current no-IOMMU QEMU/GCP lane must use brokered queue-base/PRP materialization instead. Design:docs/proposals/nvme-model-b-doorbell-dma-validator.md; provenance:docs/devices/nvme.md; reconciliation:docs/dma-isolation-design.md(Provider-Written Addresses And No-IOMMU Brokered Bounce). - Landed: the read-only userspace NVMe bind (
nvme-bind-claimed-mmio-read), userspace NVMe controller reset (nvme-controller-reset-selected-write,CC-scoped fail-closed selected write), and the no-IOMMU brokered controller enable (DeviceMmio.brokeredNvmeControllerEnable, schema@6; kernel-authoredAQA/ASQ/ACQfrom the liveDMAPoolledger, no provider-supplied CC bits or host-physical/device-visible address). Proofmake run-pci-nvme; provenancedocs/devices/nvme.md§§5-6. - The provider-written Model B enable
(
nvme-userspace-bind-and-controller-bringup) remains a separate direct-remapping/vIOMMU lane (still open / blocked).
Open:
- Add userspace
DeviceMmioauthority and ownership boundaries for out-of-kernel drivers only after the device-manager andDMAPoolgates below are in place. - Extend beyond metadata-only discovery to virtio and NVMe driver binding as those reusable driver paths land.
Device Authority And Userspace Driver Gate
Ordered after the generic MSI/MSI-X dispatch table and second-device proof.
The current brokered-bounce provider paths have landed their local/GCE evidence;
future direct-remapping/vIOMMU, provider-written-address, hostile-hardware, or
broader device-owner paths remain gated by the selected backend contract and
Security Verification Track S.11.2 in docs/dma-isolation-design.md.
DMA authority invariants (settled; these constrain every new slice — do not
weaken them). Per docs/dma-isolation-design.md (accepted): backend selection
is a runtime, fail-closed kernel decision — direct IOMMU remapping only when a
probe verifies usable hardware, otherwise kernel-owned bounce buffers. On the
no-IOMMU lane the manager is the single owner of every bounce page’s
host-physical address and IOVA: host_physical_user_visible=0,
direct_dma=blocked, iova_export=disabled-future-only, real DMA
not-attempted. Pool/buffer/handle lifecycle is generation-checked and
fail-closed on stale/freed/wrong-owner/wrong-state; pages stay committed,
resident, and unswappable while device-visible and are scrubbed before release;
quiesce + scrub precede free; stale completions and stale IRQs after reset must
not wake a waiter or mutate accounting. The device-manager ledger is the single
record of DMA pool bytes, buffer count, descriptor/ring depth, page-rounded MMIO
mappings, interrupt holds, in-flight DMA submissions/completions, ownership
generations, budget/OOM policy, and teardown state.
Landed (prerequisite proofs and the first production userspace surface;
docs/tasks/done/, make run-net / make run-dmapool-grant /
make run-dmapool-grant-exit / make run-devicemmio-grant /
make run-interrupt-grant / make run-hardware-grant-cycle /
make run-hardware-audit* / make run-ddf-provider-consumer /
make run-iommu-remapping):
- the in-kernel device-manager object model, interrupt-source attach/detach, the
kernel-owned
DMAPoolaccounting / budget / OOM / tamper / over-budget proofs bound to attached records, the imported-live-accounting record over thedevice_dmaledger, and thedevice_dmazero-live / stale-handle / stale-completion / publication scratch proofs routed through the purecapos-lib::device_authorityvalidators; - the documented production handle epoch invariants plus their pure validator and host tests; the manager-attached DMA-buffer record proof;
- the production
DMAPool.allocateBufferresult-cap method and its manifest grant, plus admission/typedDMABuffersubmitDescriptor/completeDescriptor/mapcoverage, the userspace-VMA bounce-buffer map + protection hardening, the shared descriptor validator and manager-inflight accounting, the userspace-visible completion effect, and the provider-visible shadow-descriptor / selected-queue-entry side effects feeding the provider-consumer gate; the selected virtio-net TX backend + notify-offset claim policy; - the bounded sequential
DeviceMmio/Interruptgrant-cycle reuse proof; admission + shared-validator + real-effectDeviceMmiomap/read32/write32coverage; cursorable/edge read-side audit snapshots; and the first productionDeviceMmioCap/InterruptCapcap-release + process-exit hooks; the exposedDeviceMmiouser-map path records a manager-owned user hold (borrowed VMA, page-rounded BAR window, mapping generation, and selected-write policy label) and explicit unmap, cap release, and process exit clear it before detaching the reusable mapping generation; driver-crash and reset/disable hook markers remain bounded no-userspace-MMIO proofs that assert no user hold is live before detach; - the manager-handle identity fields carried into the result-only
DMAPool/DMABuffer/DeviceMmio/Interruptinfosurfaces; - the real pinned-page
DMAPoolpage-lifecycle slice (ddf-real-dmapool-pinned-page-realness, done 2026-05-26): the kernel ledger owns real scrubbedframe::alloc_frame_zeroedpages and the manager imports a live snapshot on the honest bounce-bufferrun-netpath; - the S.11.2 hostile smokes (stale DMA handles, descriptor abuse, revoke/reset
races, stale IRQ after reset, stale DMA completion after reset, exit-under-DMA;
S.11.2.7/8 over real free/realloc on
make run-net; the IOMMU-backed production matrix onmake run-iommu-remapping); seedocs/tasks/done/2026-05-26/and the IOMMU section; - the first exposed userspace
DeviceMmio+Interruptsurface (ddf-userspace-writable-devicemmio-interrupt, done 2026-05-26): read-only BAR map + brokeredread32+ a realwrite32on a claimed register, manager-capwait/mask/unmaskwith deferred delivery and no-stale-wake-after-revoke, and real-route userspacewait/acknowledgewith deferred LAPIC EOI through the providertx_interrupt/rx_interruptcaps driven by a userspace process (make run-ddf-provider-consumer), plus the non-implication negative-authority assertions on both grant smokes.
Open:
- Require DDF authority-surface hazard preflight before new behavior slices. The slice handoff/review prompt should state the relevant paging/MMIO, DMA, IRQ, ABI, and docs-authority invariants before code changes start. This is a workflow gate for avoiding bounded-proof overclaims and late review discovery of known infrastructure hazards.
- Broader writable-
DeviceMmioregion selection remains out of scope until a separate manager-selected register-window design lands. - Direct-remapping/vIOMMU, provider-written device addresses, and hostile bus-mastering hardware isolation remain future work. The current no-IOMMU cloud path stays on brokered bounce-buffer authority.
- Physical Store-backed hardware-audit local persistence, keyed segment
seals, and runtime subscriber refusal are closed by
hardware-audit-physical-persistence-signing-local-proof: the QEMU proof reuses onepersistent_storedisk across two boots, recovers pass-1 audit segment blobs through Store inventory before pass-2 drain, verifies development-source RAM-local HMAC segment seals, reports key lifecycle caveats, and refuses runtime reader admission until an authority-broker path exists. External verifier key custody, production rotation/revocation, rollback resistance, and broader runtime admission remain future; audit is observer evidence and does not grant DMA/MMIO/IRQ authority. - Device-autonomous MSI-X local APIC delivery is closed by
cloud-prod-qemu-kvm-virtio-net-msix-apic-delivery-resolutionand the dependent RX waiter proofcloud-prod-virtio-net-rx-device-autonomous-msix-raise-local-proof. The current provider path can still use polled completion when interrupt delivery is not required, and live-GCE device-autonomous interrupt evidence remains future work.
IOMMU/DMAR/AMD-Vi Staging
Deferred-with-known-dependency planning gate. capOS has a bounded QEMU Intel
remapping implementation for the selected smoke path, not a general hardware
isolation claim for production NIC or storage ownership. The selected QEMU Intel
path programs manager-owned per-device domains for two claimed DMA-capable
functions, exports only domain-scoped IOVAs, hides host physical addresses, and
fails closed for stale or wrong-owner domain assignment; it emits an honest
direct-DMA posture (real_dma=attempted, direct_dma=enabled,
remapping_tables=programmed) over the real ledger, with mappings installed
before the doorbell and invalidated/IOTLB-flushed before reuse, while
hostile_hardware_isolation stays not-claimed (QEMU-emulator evidence).
Current no-IOMMU cloud/user-provider paths use brokered bounce-buffer authority,
not direct DMA. Direct-remapping/vIOMMU work, trusted sharing groups, and
hostile-hardware isolation remain blocked on their own future gates in
docs/dma-isolation-design.md.
Landed (umbrella + children, docs/tasks/done/2026-05-12/ ..
done/2026-05-26/; make run-iommu-acpi, make run-iommu-remapping):
the IOMMU dependency record, bounded Intel DMAR / AMD-Vi IVRS ACPI discovery,
DMA-capable-function attach + uncovered marking, the per-device DMA domain
policy and its pure fail-closed admission helper, the COM1 diagnostics mirror,
the disabled table scaffold + MMIO-status diagnostics + disabled IOVA ledger +
mapping-lifecycle preflight, the first real QEMU Intel table-programming smoke
(real VT-d table programming, hardware-DMA translation, two-phase
invalidation/IOTLB-flush revocation, IOMMU-backed hostile stale-DMA smokes),
production DMAPool ledger integration, domain-scoped IOVA export discipline,
fault recording/diagnostics, per-device domain granularity, the no-usable-IOMMU
fallback policy, the IOMMU-production teardown/bounce-buffer S.11.2 matrix, and
the honest direct-DMA posture line
(ddf-iommu-remapping-production-closeout).
Open (future, not on the bounce-buffer critical path): AMD-Vi programming,
scalable-mode / interrupt-remapping / device-IOTLB, aw-bits=48 4-level tables,
trusted multi-device sharing groups, and production cloud NIC/storage driver
ownership remain separate future tasks. kernel/src/iommu.rs stays
cfg(feature = "qemu")-gated as a separate verified-remapping lane.
Reusable Block-Device Path
Landed: the device-generic virtio queue/transport helpers factored into
kernel/src/virtio.rs pub(crate) mod transport
(ddf-virtio-transport-helper-factor), the device-agnostic VirtqueueDma
DMA/notify seam + seam-driven Virtqueue/DmaPage + parameterized
discover_modern_transport (ddf-virtio-driver-foundation-boundary), the
virtio-blk sector read/write smoke (make run-virtio-blk,
ddf-blockdevice-boundary-virtio-blk-smoke), the first BlockDevice
trait/CapObject boundary (kernel/src/cap/block_device.rs), and multi-device
virtio-blk support + a target-disk grant source (make run-multi-virtio-blk,
KernelCapSource.blockDeviceTarget @44, ddf-multi-virtio-blk-device-support).
Landed: block_device_target now resolves by manifest PCI
segment:bus:device:function identity and fails closed when the selector is
absent, mismatched, or names the resolved boot disk; proof
make run-blockdevice-target-identity.
See docs/tasks/done/2026-05-25/, done/2026-05-26/, and
done/2026-06-05/.
Open:
- Add storage services behind userspace ownership:
storage-userspace-persistent-store-namespace-service-local-proofmovedStore/Namespaceserving onto a persistent userspace service (make run-storage-persist-service), andstorage-userspace-directory-file-service-local-prooffollowed withDirectory/Fileserving and result-cap transfer from userspace (make run-userspace-directory-file-smoke). - Retire the ambiguous kernel-owned
Store/Namespace/Directory/Fileproduction storage routes:storage-legacy-kernel-storage-cap-backer-retirementgated the RAM-backedfile/directory/store/namespacekernel grant sources behindqemu(fail-closed in the default production kernel, joining the already-gated virtioread_only_fs_root/persistent_store/writable_fs_rootmount sources) and named all remaining kernel storage backers as proof/fixture surface in code and docs. Production storage is userspace-served; the defaultsystem.cueboot grants no kernel storage caps. - Retire the transitional kernel virtio-blk production owner:
storage-legacy-kernel-virtio-blk-path-retirementratified that the kernel-owned virtio-blk driver, itsBlockDevicecap arm (BlockDeviceBackend::Virtio), and its PCI discovery (diagnose_qemu_virtio_blk) are allqemu-feature-gated; the default production kernel never binds virtio-blk and resolvesblock_deviceto the userspace-brokered NVMe arm (BlockDeviceBackend::NvmeBrokered, fail-closed without a verified controller and a livedevice_mmiogrant), withblock_device_targetfail-closed (requires the qemu feature). virtio-blk is named as a qemu fixture / regression in the device doc, smoke scripts, and fixture manifests; the production-storage gate is therun-cloud-provider-nvme-blockdevice-*chain. The kernel broker responsibilities (PCI claim arbitration, MMIO/IRQ/DMA admission, bounce/IOMMU isolation, stale-generation rejection, and revocation) stay kernel-owned and are the same surfaces the userspace storage driver binds into.
Local Disk Storage Milestone
Visible outcome: default storage-focused QEMU boots from a disk image, exposes a read-only directory from local disk, and proves one capnp object can be persisted and read back after reboot. Milestone complete.
Landed (docs/tasks/done/2026-05-14/ .. done/2026-05-25/): the
Store/Namespace + file-I/O schema slices and RAM-backed naming round-trip
proof (make run-storage-naming); virtio-blk wired into BlockDevice
(make run-virtio-blk); the read-only filesystem service over BlockDevice
(kernel/src/cap/readonly_fs.rs, CAPOSRO1, make run-storage-fs); and the
disk-backed persistent Store with a two-boot reboot proof
(kernel/src/cap/persistent_store.rs, CAPOSST1, make run-storage-persist).
Disk-backed delete tombstones entries in place; a later put that would hit
the entry-table or data-cursor limit now compacts live CAPOSST1 store entries
through a shadow generation before recommitting the canonical front generation
(make run-storage-persist). Store/persistent durability across passes
rests on host page-cache coherence; a virtio FLUSH for write-back-cache media
durability is deferred to the Writable milestone.
Writable Local Storage Milestone
Visible outcome: a storage-focused QEMU image can create, overwrite, truncate,
rename, and remove files through capability-scoped Directory/File caps,
persist both file and store mutations across reboot, and recover to a
consistent state after an unclean shutdown test. Milestone complete.
Landed (docs/tasks/done/2026-05-26/; make run-storage-writable,
make run-storage-writable-recovery): the fail-closed single-writer policy
(documented in the storage proposal); directory mutation
(create/mkdir/remove/rename, additive Directory.create @5/rename @6)
and writable File paths (overwrite/append/truncate/sync/close, bounded by
MAX_FILE_BYTES 64 KiB) over kernel/src/cap/writable_fs.rs; disk-backed
write-through persistence of the CAPOSWF1 sub-volume co-located with the
CAPOSST1 Store in one combined image (now produced by
tools/mkstore-image --writable); real File.stat created/modified
timestamps with internal ClockProvenance labels carried from the same
WallClock source in the CAPOSWF1 node record;
and one forced-poweroff unclean-shutdown recovery proof (proof-only
storage_writable_recovery feature) verifying the superblock-commit-ordering
invariant.
Bounded-proof caveat: the recovery proof exercises one record-vs-commit window
under host-page-cache durability (no VIRTIO_BLK_F_FLUSH; kill -9 preserves
the host page cache); it proves the kernel’s superblock-commit-ordering
invariant, not general media crash-consistency against host power loss. The
co-located CAPOSST1 Store now has bounded tombstone reclamation through
make run-storage-persist; writable-file extent reclamation remains future
work.
Managed Cloud Store Bridge
Visible outcome: application services can persist bounded Cap’n Proto records through a cloud-backed capability while local QEMU tests exercise the same semantics through a fake bridge.
Open gates:
- Define a provider-neutral
CloudStoreBridgeor app-specificSaveStoreinterface with put/get/compare-and-set/append operations, explicit size limits, profile or tenant scoping, schema version, and stale-write rejection. - Add a local fake-cloud bridge used by host tests and QEMU smokes. It must reject wrong-profile loads, stale mutable writes, oversized records, and ledger rewrites.
- Add a GCP deployment note for Cloud Run bridge service, Firestore Native mode mutable indexes/profile summaries, Cloud Storage versioned blobs, and Secret Manager credentials.
- Add Cloud KMS keying notes for managed game-world storage: key ring/key per world or shard, narrow encrypt/decrypt IAM authority, rotation, retired world revocation, and audit logging.
- Keep provider credentials outside ordinary capOS clients. Only the bridge service receives cloud credentials; game/storage clients receive narrow capabilities.
- Add lifecycle/retention/cost controls before writing real snapshots or evidence blobs to Cloud Storage.
- Treat local disk-backed
Storeas the offline/QEMU baseline even when cloud persistence is available.
User-Owned Browser Save Transport
Visible outcome: private user data can be backed up through the user’s browser to Google Drive or Firebase as encrypted capsules while capOS never receives provider tokens.
Landed (policy + host-test gates): the provider-neutral browser transport
policy for opaque encrypted save capsules / opaque provider handles / capsule +
wrapped-DEK metadata; fake Drive and fake Firebase host-test adapters modeling
deletion / duplicate writes / stale versions / rollback / missing network /
non-opaque handles / authenticated-user mismatch / Firebase auth UID path
injection; the Drive appDataFolder (drive.appdata) and Firebase/Firestore
per-user-capsule notes; and the KMS / token / key-capability boundary records
(browser transports ciphertext + handles only).
Open (future real-provider integration):
- Implement real Google Drive and Firebase browser-companion adapters after the provider-token boundary is exercised outside ordinary capOS clients.
- Reuse the existing save-capsule restore rejection tests as the acceptance gate for real provider adapters: tampered, wrong-profile, stale, oversized, unknown-content, and unsigned capsules must still fail before provider bytes can mutate save state.
- Add real-provider failure-mode coverage for deletion, duplicate writes, stale versions, rollback attempts, offline cache/sync replay, and missing network using the same semantics as the fake adapters.
Boot Binary ISO Layout
Move ELF payloads out of the Cap’n Proto manifest blob and into explicit boot
package sources. The CD-ROM path uses ISO 9660 files read on demand through a
minimal kernel ISO driver; the raw disk and cloudboot paths use Limine-loaded
modules staged on the FAT ESP. Both keep the manifest as topology and decouple
ordinary service binary bytes from NamedBlob.data. capOS remains Limine-backed
for the current boot line; Limine supports FAT and ISO9660/CD-ROM media, so
CD-ROM/ISO is a planned boot/install variant rather than a path to delete.
Landed (docs/tasks/done/2026-05-24/; make run-boot-iso-read,
make run-boot-iso; producer guard added 2026-06-06): the minimal ATA PIO
CD-ROM read_sectors reader (boot_iso_read), the read-only ISO 9660 driver
(open_file(name) -> (lba, size), fail-closed bounds), mkmanifest --copy-bins (names-only manifest, empty NamedBlob.data) with producer-side
rejection for names whose ISO 9660 d-character form exceeds the level-3
31-character limit or collides after normalization, the opt-in make capos-name-only.iso, the kernel
run_init() on-demand-read switch + BootBinary registry behind the boot_iso
feature (make run-boot-iso), and the BOOT_MANIFEST_MAX_BYTES doc + the
-iso-level 3 name-only ISO build recipe. Landed 2026-06-07 21:36 UTC in
commits 22320411 and f0695442: the default make image raw disk and
make capos-cloudboot-image cloudboot targets now use a name-only manifest plus
Limine module payloads staged under /boot/bins/; see
boot-limine-disk-boot-binary-source-local-proof.
Landed 2026-06-07 21:59 UTC: the default make, make run, and
make run-smoke ISO paths now use name-only manifests plus boot_iso
on-demand reads from /boot/bins/, so ordinary service ELF bytes no longer ride
in NamedBlob.data for the default bootable ISO paths. The generic embedded ISO
rule remains available for focused fixtures that have not moved to a name-only
boot source.
Closed:
- After that source is proven,
boot-embedded-data-retirement-and-atapi-userspace-servingretires the embedded-data branch for ordinary service binaries. The retained ATAPI/ISODirectory/Filecap is explicitly a QEMU install-source fixture over the early boot reader, not a general kernel filesystem service; broader post-bootstrap package browsing remains a userspace-service concern outside this fixture.
Cloud Device Tracks
These are portability notes, not implementation evidence. The first cloud
milestone is imported-image serial-console boot; provider NIC/storage drivers
are later usable-instance work and remain blocked by cloud-provider binding,
DMA/IOMMU or explicitly accepted bounce-buffer policy, interrupt, teardown, and
network/storage evidence gates above. Local implementation and *-local-proof
records in this track run under host tests, QEMU, or local cloudboot-image QEMU
unless their acceptance explicitly says otherwise; they must not be blocked on
cloud access. The local bounded provider-consumer closeout does not implement a
cloud-ready userspace virtio-net, virtio-blk, virtio-scsi, NVMe, gVNIC, ENA, or
cloud storage/NIC driver. The GCP-first usable-instance provider rollup is
closed by
cloud-usable-instance-provider-nic-storage;
future public ingress, AWS, Azure, broader storage, high-throughput NIC, and
direct-remapping lanes remain separate work.
Access correction (2026-05-27, updated 2026-06-06). The GCP cloud tracks are
NOT prefix-blocked on cloud access. Local implementation and *-local-proof
tasks stay dispatchable once their local prerequisites are satisfied, including
tasks named cloud-prod-* that only boot the production cloudboot kernel under
QEMU. Only live/billable proof tasks that cross a provider API, provider
hardware, public ingress, public CA/DNS, or explicit make cloudboot-test
acceptance require access authorization. GCE access is provisioned for the
configured cloud sandbox project: tools/cloudboot/run-test.sh is hardcoded to
it (no public IP,
no service account/scopes), the 2026-05-24 GCE live probes recorded
n1-standard-1, e2-small, c3-standard-4, and n2d-standard-2
Confidential shapes (IOMMU disabled → SWIOTLB → labeled bounce-buffer) in
Cloud DMA Provider Evidence Inventory,
and Cloud Build runs Kani proofs (tools/cloudbuild-kani.yaml). The local QEMU
virtio-net/NVMe foundations and the local production cloudboot bind markers
exist. The GCP live NVMe Persistent Disk read proof is now closed by
cloud-gcp-storage-driver; remaining live driver slices are blocked only by
their own local authority, product-scope, and real-provider evidence gates.
Slices that only need structured serial evidence from already-production code
(for example cloud-network-terminal-access path 1) are runnable on real GCE.
Cloud-Leg Decomposition Track (2026-05-24)
The cloud-usable-instance-provider-nic-storage umbrella was decomposed into
discrete slices and is now closed as the GCP-first provider rollup.
Landed foundation (docs/tasks/done/2026-05-24/ .. done/2026-05-30/):
the cloud DMA provider-evidence inventory, the runtime fail-closed DMA backend
selection mechanism (cloud-dma-backend-selection: probe → fail-closed select →
manifest override; authoritative contract in the “Cloud DMA Backend” section of
docs/dma-isolation-design.md), the local-QEMU GCP virtio-net binding precursor
- cloud-shape classification, the production (non-
qemu)cloudboot-evidence: dma-backend/device-class/device-inventorymarkers, the minimal read-only production PCI enumeration surface, and the production DDF/PCI bind-stack decomposition.
Landed production bind-stack children (terminal local-bind markers settled with
a kernel-side dispatch-slot proxy where the userspace driver authority surface
is cfg(feature = "qemu")-gated out of the non-qemu build):
cloud-prod-pci-claim-inventory, the DeviceMmio BAR-readback grant
(make run-cloud-devicemmio-grant), the DMAPool bounce-buffer grant
(make run-cloud-dmapool-grant), the interrupt route-alloc + live-delivery
proofs, the terminal provider-nic-bound / storage-bound proxy markers, the
three production userspace-provider grant-source proofs
(DeviceMmio/DMAPool/Interrupt), the aggregate grant-surface closeout, and
the real provider-cap-side Interrupt.wait/acknowledge cap-waiter proof
(cloud_provider_cap_waiter_proof, make run-cloud-provider-cap-waiter). See
docs/tasks/done/2026-05-28/ and the task-graph reconcile
cloud-live-driver-task-graph-reconcile.
Landed virtio-net userspace-provider chain (the stale parent is closed by its
child sequence, make run-cloud-provider-virtio-net*;
docs/tasks/done/2026-05-28/ .. done/2026-06-07/): the
non-qemu-buildable virtio modern-transport host surface
(kernel/src/virtio_transport.rs), the device bring-up proof, the same-BDF
DeviceMmio+DMAPool+Interrupt authority bundle, TX and RX queue
materialization, MSI-X function-enable, TX submit/doorbell + polled completion,
the userspace DMABuffer map/submit live-publish path, TX and RX MSI-X
wait/ack, RX userspace-submit, the production-IDT real-interrupt-gate dispatch
wiring, the RX polled-completion-no-inject proof, the always-built polled
provider graduated off the per-proof feature, the real-polled-driver
provider-nic-bound re-point (removing the proxy as source), the polled
teardown + driver-death/process-exit stale-authority discipline, the
legacy/transitional virtio 0.9 PIO + INTx local bind, and the real-GCE
legacy-polled provider-nic-bound run.
Landed NVMe brokered userspace-provider chain (the parent is closed by its child
sequence; make run-cloud-provider-nvme-*;
docs/tasks/done/2026-05-29/ .. done/2026-06-05/): read-only bind →
controller reset (selected CC-clear write) → admin queue materialization →
brokered controller enable (manager-op DeviceMmio.brokeredNvmeControllerEnable
@6, manager-authored AQA/ASQ/ACQ; raw CC.EN-set fails closed) → admin
IDENTIFY (@7, then split SUBMIT @8 / COMPLETE @9) with the admin-completion
Interrupt.wait/acknowledge handoff over the cap-waiter MSI-X route → I/O
queue-pair create (@10/@11) → I/O READ (@12/@13) → WRITE (@14/@15,
read-back match) → arbitrary/second LBA (@16/@17) and multiblock
(@18/@19) → single-call synchronous poll-read (@20/@21, no
Interrupt.wait on the data path) and inline read-bytes (@22) → the
BlockDevice.readBlocks-shaped fixed-LBA then arbitrary-LBA read arm
(BlockDeviceBackend::{Virtio,NvmeBrokered}) → readonly_fs over the NVMe
BlockDevice (single-file then multi-file dir-walk) → writeBlocks @1
durability + real FLUSH @3 (opcode 0x00) + clean-reboot persistence +
forced-poweroff crash-consistency → persistent_store and writable_fs (plus
recovery) over the NVMe write arm → File.sync/Store-commit routed to a real
NVMe FLUSH → the capstone read-arm graduation into always-built production
(fail-closed runtime capability probe kernel/src/nvme_storage_backend.rs) and
the always-built device_manager::nvme_sync_io_state sync-I/O state seam →
dedicated data-path completion interrupts for BlockDevice.writeBlocks @1 and
readBlocks @0 (make run-cloud-provider-nvme-io-completion-interrupt).
All brokered NVMe steps hold the no-IOMMU discipline: PRP1/queue-base addresses
are manager-owned bounce buffers, never exported; no provider-written
queue-base/PRP/SGL address, no host-physical or IOVA export, no direct-DMA
claim, no cloud/guest IOMMU assumption. QEMU caveat: “an unflushed write rolls
back” is not provable under QEMU’s -device nvme cache=writeback model
(unflushed_rollback=not-provable-under-qemu-nvme-model).
Open / blocked:
-
cloud-usable-instance-provider-nic-storage(done 2026-06-07) — closeout-only rollup over the landed GCE evidence: serial-console operator access (1779868872-2424), live legacy virtio-net raw-frameprovider-nic-bound(1780412056-e1cb), live NVMe Persistent Disk brokeredREAD(1780806087-bf69), and the separate gVNIC raw-frame / typed-Nic portability runs (1780794927-1aa9,1780796615-decc). This closes the GCP-first provider NIC/storage bar without claiming public L4 ingress, AWS/Azure, broader storage variants, direct DMA/remapping, or high-throughput NIC readiness. -
cloud-gcp-nic-enumeration-evidence(blocked/decomposed 2026-05-27) — coupled honest production-path enumeration markers to aprovider-nic-bound+--require-provider-nic-proofgate the harness reserves for the driver slice, plus a billable real-GCE run an autonomous worker cannot self-authorize. The honest production-marker slice landed; theprovider-nic-bound+ real-GCE proof folds intocloud-gcp-virtio-net-nic-driver. -
cloud-prod-virtio-net-userspace-provider-local-proof(done/closed 2026-06-07 02:54 UTC) — this stale parent is closed by the landed child chain above. The local non-qemucloudboot/QEMU path has the modern TX/RX provider proofs, always-built polled provider, honestprovider-nic-boundmarker sourced from real polled TX+RX progress, and clean-release plus process-exit teardown. The GCE-compatible legacy-polled path also passed real GCE through the billablecloud-prod-gce-billable-boot-real-polled-nic-boundrun. Remaining future lanes are L4 socket/smoltcp relocation, literalsystem.cueprovider fold, reusable full-NIC/multiqueue readiness, and live-provider device-autonomous MSI-X evidence. -
cloud-prod-nvme-brokered-userspace-provider-local-proof(done/closed 2026-06-07 02:08 UTC) — this stale parent is closed by the landed child chain above. The local non-qemucloudboot/QEMU path has the brokered controller/admin/I/O provider proof,BlockDeviceread/write/flush and filesystem consumers, dedicated data-path completion interrupts, and NLB > 8 multi-PRP windows with manager-authored PRP lists. Remaining future lanes are a second namespace, FUA/DSM, live GCP evidence, device-autonomous MSI-X completion delivery, and any direct-remapping/vIOMMU/provider-written-address model.
Production Bind-Stack Port (qemu-gate dissolution)
The cloud-prod-*-local-proof chain proved each behavior behind a focused
per-proof Cargo feature (cloud_*_proof) that compiles a kernel-side
cap::*_proof module into the non-qemu build only when its feature is on.
Those proofs are correct but do not graduate the underlying device surface to
always-built production code. The qemu feature conflates three jobs: (1)
test-harness affordances (isa-debug-exit shutdown, self-tests,
diagnostics/measure/debug_tap/boot_iso/storage_writable_recovery,
the VT-d smoke) that must stay compile-gated; (2) unproven-on-hardware device
surface kept dormant; (3) genuine host capabilities that should be
runtime-probed, not compile-gated. The unlock is not removing the cfg and
not an “am-I-QEMU” runtime branch (it links unproven MMIO/DMA into
production = fail-open against the brokered-DMA discipline, and forfeits
dead-code elimination as a TCB property). The unlock is to dissolve the gate
per-piece: port each dormant capability into always-built production code
as it is proven, fronting hardware-dependent behavior with a fail-closed
runtime capability probe (the kernel/src/dma_backend.rs
probe → fail-closed → manifest-override pattern). Hard caveat: the no-IOMMU
bounce-buffer discipline is preserved (host_physical_user_visible=0,
direct_dma=blocked, iova_export=disabled-future-only), and
kernel/src/iommu.rs stays cfg(feature = "qemu")-gated as a separate future
verified-remapping lane.
Umbrella: cloud-prod-ddf-bindstack-qemu-gate-dissolution
(done 2026-05-30).
Landed children (docs/tasks/done/2026-05-29/ / done/2026-05-30/): the RX
MSI-X waiter-determinism fix (the provider-consumer flake was a
synthetic-RX-dispatch delivery-ordering race; gating injection on the waiter
thread being parked in cap_enter, 28/28 green), grant-source despecialization
(stage_with_class + ProdGrantClass), ECAM/MCFG enumeration graduation
(fail-closed runtime MCFG probe), MSI-X programming graduation
(cap::interrupt_programmed::program_attach_arm_unmask +
device_interrupt::wait_kernel_injected_dispatch now always-built), the
device-manager backend port (always-built ProductionDeviceTable device-record
/ bounce-DMA / interrupt-route backend replacing the device_manager::stub
slot), and the qemu/test_harness feature split.
Open:
- [~]
ddf-provider-consumer-dmabuffer-page-fault-baseline(blocked/premise-refuted) — the reported deterministic DDF/QEMUDMAPool/DMABufferPAGE FAULT did not reproduce (0/28 ond2a342d2, byte-identical kernel to45c4beb9). Keep historical unless new evidence re-establishes the original fault.
The local virtio-net and NVMe userspace-provider parents are both closed by their child chains, so the live provider tasks now sit behind their own real-cloud evidence and product-scope gates rather than stale local-parent blockers. The cloud/GCP track stays brokered bounce-buffer authority; this does not reopen direct DMA, guest IOMMU, or direct-remapping assumptions.
-
cloud-gcp-virtio-net-nic-driver(DONE/superseded 2026-06-02 by the slice-6 billable run, see the GCE Polling Path track below) — the live legacy virtio 0.9 NIC was bound through the kernel-brokered legacy polled path, passing--require-provider-nic-proof. Honest scope:userspace_driver_authority=kernel-brokered-legacy-polled, so this closes the real-GCE bind bar without claiming L4 socket reachability, reusable multiqueue/full NIC readiness, or live-provider device-autonomous MSI-X delivery. -
cloud-gcp-storage-driver(done 2026-06-07) — the live GCE NVMe Persistent Disk path passedmake cloudboot-gcp-storage-nvme-io-read-teston run1780806087-bf69at source commit28518165518c29a48633682f4a6d9b5844c43335. Evidence identifiedstorage_interface=nvme,vendor.1ae0,device.001f,c3-standard-4,europe-west3-a, one brokered 512-byteREAD, no public IP, no service account, and complete teardown. The selected GCP path remains brokered-bounce queue-base/PRP materialization; provider-written Model B is reserved for a direct-remapping/vIOMMU or synthetic-address lane. This does not claim the older virtio-scsi PD path, Local SSD, a gVNIC datapath, or full filesystem integration. -
cloud-network-terminal-access(done 2026-05-27; path 1, serial-console shell, needs no NIC driver) — proved a reviewed cloud operator access path beyondcapos kernel startingover the GCE serial console (cloudboot-evidence: access-path serial-console-shell; real-GCE run1779868872-2424, no public IP, no service account). Paths 2/3 (TCP/Telnet) depend oncloud-gcp-virtio-net-nic-driver; path 4 (SSH) is a separate milestone. -
cloud-launch-teardown-policy-hardening(done) — hardened the cloudboot harness into the usable-instance gate:--require-provider-nic-proof, structuredprovider.jsonevidence, fail-closed launch-policy read-back, and nonzero exit on teardown failure or incomplete evidence.
Future provider slices (not required for the initial GCP usable-instance gate).
The AWS and Azure tracks are split by proof surface: standard storage
controllers (NVMe / virtio-scsi) are QEMU-emulable now, while the vendor-custom
NICs (ENA, MANA) get host-conformance gates plus a deferred live proof because
QEMU does not emulate them. The NVMe path’s shared GCP storage-provider
foundation has landed via nvme-io-queue-and-read, so the NVMe-only AWS (Nitro
EBS) and Azure (managed-disk) tracks re-scoped to a small cloud-shape
classification delta and landed (both done 2026-05-28). The virtio-scsi
alternative is not a shortcut: capOS has no userspace virtio-scsi provider
driver, and make run-virtio-blk proves the kernel-owned virtio-blk driver,
which leaves the hidden kernel DMA ownership the provider-authority acceptance
forbids — so the older-family SCSI path stays out of scope.
AWS:
-
cloud-aws-nvme-storage-driver(done) — the AWS Nitro EBS NVMe cloud-shape classification delta on the shared NVMe foundation (make run-pci-nvme;docs/devices/aws-nvme.md). Live AWS EBS evidence is the deferredcloud-aws-storage-live-proof. -
cloud-aws-ena-nic-protocol-conformance(done) — ENA protocol encode/decode incapos-lib/src/ena.rswith a host conformance suite vetted against the ENA spec / Linux driver headers. Gate:cargo test-lib(deliberate QEMU-exception; QEMU has no ENA device). -
cloud-aws-ena-nic-live-proof(blocked on conformance +cloud-gcp-virtio-net-nic-driver; deferred until AWS access) — end-to-end ENA bind/send/receive/teardown on real AWS hardware.
Azure:
-
cloud-azure-disk-storage-driver(done) — the Azure Boost managed-disk NVMe cloud-shape classification delta on the shared NVMe foundation (make run-pci-nvme;docs/devices/azure-disk.md). The older-family Hyper-V/virtio-scsi path is out of scope (azure_scsi_path=no-userspace-provider-driver-out-of-scope). Live Azure evidence is the deferredcloud-azure-storage-live-proof. -
cloud-azure-mana-nic-protocol-conformance(done) — MANA/GDMA protocol encode/decode incapos-lib/src/mana.rswith a host conformance suite vetted against the MANA Linux driver headers; provenancedocs/devices/azure-mana.md. Gate:cargo test-lib(QEMU has no MANA device). -
cloud-azure-mana-nic-live-proof(blocked on conformance +cloud-gcp-virtio-net-nic-driver; deferred until Azure access) — end-to-end MANA bind/send/receive/teardown on real Azure hardware, including SR-IOV VF revocation with fallback-to-synthetic.
Superseded umbrella records (do not dispatch):
-
cloud-aws-ena-nvme-driver— umbrella pointer to the three AWS slices above. -
cloud-azure-mana-driver— umbrella pointer to the three Azure slices above.
Cloud milestones and per-provider paths:
- First cloud milestone: imported-image serial-console boot. Closed for GCP
by run
1778230874-715a(2026-05-08) against source commit3951e275:make cloudboot-testimported thecapos-cloudboot-imagetarball, started ane2-smallwith no public IP and no service account, observedcapos kernel startingon serial, and tore down cleanly. Does not require or prove cloud NIC/block-device drivers beyond the boot path. - Second cloud milestone: GCP-first usable instance provider rollup. The
selected operator path, provider storage, and provider NIC data path are
closed by
cloud-usable-instance-provider-nic-storage: serial-console shell access on real GCE, live legacy virtio-net raw-frameprovider-nic-bound, live NVMe Persistent Disk brokeredREAD, and separate live gVNIC raw-frame / typed-Nic portability evidence. Scope split (decided 2026-06-02,network-reachable-datapath-scope-decision): the network data-path reachability sub-requirement is raw-frame TX/RX over the live NIC (GCE polling-path slices 1-4 + slice 6); the SSH/WebShell / network terminal access sub-requirement is L4 and is deferred to networking-proposal Phase C. - Add NVMe controller init (brokered admin queue pair + identify on no-IOMMU). Closed by the brokered enable / admin / IDENTIFY / interrupt-wake child chain ending 2026-05-28.
- Add NVMe I/O queue pair (submission/completion rings + doorbell writes).
Closed by
nvme-io-queue-and-readon 2026-05-28. - [~] Add NVMe read/write commands with PRP-based DMA transfers; no-IOMMU PRPs
are manager-materialized from live buffer authority. READ and WRITE are
done (see the NVMe chain above); multi-block PRP-list (
count > 8) remains. - Implement
BlockDevicefor NVMe. Done via theBlockDeviceBackend:: NvmeBrokeredread/write/flush arms (still per-proof-feature-gated for activation pending the capstone graduation). - Add QEMU NVMe metadata-only PCI testing via
-device nvme. - [~] Extend QEMU NVMe testing to cover controller init, queues, PRP DMA, and
BlockDevicebehavior. Controller/admin, I/O queue, READ/WRITE/FLUSH, andBlockDeviceread/write/flush plus dedicated data-completion interrupts over-device nvmeare covered; NLB>1 PRP-list and always-built graduation remain. - [~] GCP storage path: NVMe Persistent Disk on a third-generation GCE shape has
one live brokered READ proof (
cloud-gcp-storage-driver, run1780806087-bf69). The older virtio-scsi Persistent Disk path, Local SSD, and reusable filesystem-backed storage provider remain future work. Keep virtio-blk as a local/QEMU block-driver proof only unless a provider target explicitly exposes it. - GCP NIC path: virtio-net first where supported, then gVNIC for newer
machine families, Confidential VM paths, generation-3-or-later shapes, and
higher network performance tiers. The virtio-net raw-frame provider gate
passed on live GCE, and the gVNIC portability lane below now has live
raw-frame and typed
Nicevidence. High-throughput, multiqueue, public ingress, and first-public-Web-UI productization remain future tasks. - AWS storage path: NVMe on Nitro-backed EBS instances. Treat AWS Nitro as an NVMe storage dependency rather than a virtio-blk path.
- AWS NIC path: ENA driver, including ENA queue setup, MSI-X routing, and Nitro generation/version expectations. Do not claim AWS network support from QEMU virtio-net evidence.
- Azure NIC path: MANA driver and Mellanox mlx4/mlx5 accelerated-networking fallback awareness where Azure exposes SR-IOV VFs. Driver lifecycle must tolerate dynamic VF binding and revocation by falling back to the synthetic interface rather than assuming the VF is permanent.
Cloud Benchmark Reruns
Visible outcome: once capOS reaches a first real cloud-VM boot, rerun the current benchmark profiles on that boot path and separate cloud evidence from local QEMU/KVM evidence.
Open gates:
- Define the first supported cloud benchmark profile after the booted cloud
hardware surface is known. At minimum, rerun boot/session smokes and any
CPU-only benchmark such as
run-smp-process-scale, and laterrun-thread-scale, that does not depend on missing cloud NIC or block drivers. A GCEn2-highcpu-8-class nested-KVM host is a reasonable first CPU-only benchmark target if/dev/kvmis usable by the benchmark user. - Record provider, region, instance type, CPU topology, cloud image id, firmware/device model, nested-KVM state, QEMU CPU pinning/isolation policy, and serial-console collection method in the benchmark artifact.
- Retain provenance for the exact disk/cloud image, kernel, manifest, embedded binaries, host toolchain, and cloud image import path.
- Compare cloud-VM results with local QEMU/KVM results only as separate environments; do not replace the selected local proof gate with a cloud result unless the milestone explicitly changes.
Cloud Device Tracks – Real GCE Polling Path (decoupled from MSI-X)
Decision (2026-06-01): the real-GCE-boot milestone (userspace virtio-net driver
binding a real GCE NIC plus a reachable network data path) is decoupled from
device-autonomous MSI-X interrupt delivery. The production data path uses
polling the used ring, which already works on the non-qemu cloud kernel:
the landed cloud_virtio_net_rx_userspace_submit_proof does a real device->host
RX DMA (used_len=76) with zero interrupts, via the always-built
virtio_transport + poll_used_idx. Every TX/RX data movement and completion
in the repo is already polled; device-autonomous MSI-X remains a parallel
efficiency follow-up, not a boot blocker. The local MSI-X track is now closed:
the missing precondition was explicit PCI COMMAND memory-space/bus-master
enablement in the proof path. With pci_command=0x0107, local QEMU/KVM delivers
virtio-net RX MSI-X vector 0x50 through the guest IDT path with
int_injected=0, idt_handler_observed=true, and one deferred-EOI
acknowledgement. Live-GCE interrupt evidence remains outside the polling-path
critical path.
Production-kernel ground truth (verified): PCI/ECAM enumeration, device_manager,
the bounce-buffer DMA backend, MSI-X programming, and all three DDF grant
sources are already always-built. Still cfg(feature = "qemu")-stubbed in
production (the real gap): kernel/src/virtio.rs (legacy driver + smoltcp +
cap/network.rs TCP/UDP socket caps) → virtio_stub.rs returns
DeviceUnavailable.
Ordered slices (only the last is billable; none require interrupt delivery). Slices 1-5d are done; the legacy real-GCE blockers found in flight are all closed locally:
- RX polled-completion-no-inject local proof (done 2026-06-01) — flipped the
RX-submit proof’s completion observation from the kernel-injected dispatch
proxy to the already-latched polled used-ring state
(
make run-cloud-provider-virtio-net-rx-polled-completion). - Polled provider default manifest (done 2026-06-01) — graduated the polled
RX+TX provider off the per-proof feature into always-built
cap::virtio_net_polled_provider, staged by a manifest-observable condition (make run-cloud-provider-virtio-net-polled-provider-default). - Real-polled-driver
provider-nic-bound(done 2026-06-02) — re-pointedcap::provider_nic_bind_proof::reportso the marker fires only after the real polled provider completes a TX+RX over the live function, removing the kernel-side dispatch-slot proxy as the source. The literalsystem.cuefold remains the open remainder (make run-cloud-provider-nic-bound-real-polled-driver). - Polled teardown / stale-authority (done 2026-06-02) — ported the S.11.2 hostile-smoke discipline (DMA/MMIO/IRQ stale-authority rejection, release/reset/driver-death teardown, no host-physical export) to the real polled production provider.
- Network-reachable-datapath scope decision (done 2026-06-02) — Option A:
the milestone’s “reachable network stack” bar means raw-frame TX/RX
reachability over the live NIC, because the billable
make cloudboot-testgate checks no L4 socket round-trip. Slices 1-4 + slice 6 close that bar. L4 sockets (smoltcp +cap/network.rssocket caps offcfg(qemu)virtio.rs) are a separate future track (networking-proposal Phase C). Decision doc:network-reachable-datapath-scope-decision. 5b. [x] Legacy/transitional virtio 0.9 bind (decomposed 2026-06-02) — the real GCE NIC is a legacy/transitional virtio 0.9 device (PIO config BAR, INTx, no MMIO BAR, no MSI-X); the modern-only production polled provider returned no candidate on real GCE. Both decomposition slices landed 2026-06-02, so the local-proof acceptance is closed; the later billable slice-6 re-run also passed.cloud-prod-virtio-net-legacy-transitional-bind-local-proof.- 5b.1 [x] Legacy PIO select (done 2026-06-02) — kernel-brokered legacy PIO
config access (
pci::LegacyIoBar/pci::io_bar, scoped to the claimed I/O BAR, no ambient port authority) + legacy candidate selection with no MSI-X precondition (make run-cloud-provider-virtio-net-legacy-select,virtio-net-pci,disable-modern=on,vectors=0). - 5b.2 [x] Legacy datapath bind (done 2026-06-02) — single-PFN contiguous
virtqueue materialization (
frame::alloc_contiguous, reusing the modern ring helpers), legacy PIO notify, 10-byte legacy net header, polled TX (ARP) + RX over the legacy device with no MSI-X route (make run-cloud-provider-nic-bound-legacy). Sources exactly oneprovider-nic-boundfromreport_real_completion_legacy. 5c. [x] Legacy GCE-viable RX stimulus (done 2026-06-02) — the landed legacy proof’s RX stimulus was QEMU-SLIRP-only (spoofed ARP to10.0.2.2); replaced by a broadcast DHCP DISCOVER from the device’s real MAC (legacy config0x14), an accept-any inbound frame completion model, and a wall-clock (monotonic_ns) RX budget with an iteration-ceiling backstop. Marker carriesrx_stimulus=dhcp-discover-broadcast,eth_src=device-mac,-srcmac.<12hex>(make run-cloud-provider-nic-bound-legacy). 5d. [x] Legacy large-queue-size (landed 2026-06-02) — live GCE legacy virtio-net advertises a 4096-entry virtqueue, exceeding the proof’s defensiveMAX_LEGACY_QUEUE_SIZE = 1024. Raised to the virtio spec max 32768 (power-of-two enforced; non-power-of-two / over-bound / zero reject cleanly;alloc_contiguousfails closed without panic). QEMU caps queue size at 1024 and lockstx_queue_sizeat 256 for the non-vhost SLIRP legacy device, so the largest local shape isrx_queue_size=1024(8-page RX single-PFN vring); the full 4096-entry materialization is a real-GCE attestation (make run-cloud-provider-nic-bound-legacy-large-queue).
- 5b.1 [x] Legacy PIO select (done 2026-06-02) — kernel-brokered legacy PIO
config access (
-
cloud-gcp-virtio-net-nic-driver(reopen) — DONE 2026-06-02 (run1780412056-e1cb,e2-small,europe-west3-a, source commit1fb65683): the real GCE boot bound the live legacy virtio 0.9 NIC (00:04.0,1af4:1000) through the kernel-brokered legacy polled path and passed--require-provider-nic-proof. The full 4096-entry vring materialized on real hardware for the first time (rx_vring_pages=28contiguous), the real GCE device MAC was read (src_mac=42:01:0a:c8:00:12), a broadcast DHCP DISCOVER was transmitted, and a real device->host RX DMA completed within the TSC-governed wall-clock budget (rx_used_len=532 ethertype=0x0800). Closes the GCE Polling Path track and retires thecloud-gcp-virtio-net-nic-driverblocker. The billable run was authorized on 2026-05-27 and recorded at commit2aaeaa53; durable evidence is summarized in the completed task entry below. Dispatched ascloud-prod-gce-billable-boot-real-polled-nic-bound. To re-run the billable bind: build the cloudboot image from the legacy manifestsystem-cloud-provider-virtio-net-legacy-datapath.cue(not the modernsystem-cloud-provider-nic-bound-real-polled-driver.cue; the literalsystem.cuestages no provider), confirmmake run-cloud-provider-nic-bound-legacygreen on the build commit, thentools/cloudboot/run-test.sh --require-provider-nic-proof.
Real-Filesystem Track (2026-06-02)
The real-filesystem direction is decided in
Real-Filesystem Decision:
a role-split, not one on-disk format. capOS-managed state stays capnp-native
(CAPOSWF1/CAPOSST1, evolved not replaced; crash-consistency already proven by
make run-storage-writable-recovery); host-populated/interop images gain
read-only FAT32 via the fatfs no_std crate; a single host capnp image tool
retires the per-format tools/mkstorage-*.py byte-offset hazard. ext4-read is
deferred behind an explicit trigger (“must read a disk capOS did not format”);
FAT write is rejected (no crash-consistency story).
Landed: read-only FAT32 over virtio-blk (kernel/src/cap/fat_fs.rs, vendored
vendor/fatfs-no_std/, make run-storage-fat-read, storage_fat_read feature
on the existing read_only_fs_root source; provenance docs/devices/fat32.md),
and read-only FAT32 over the graduated NVMe read arm (the Nvme BlockSource
arm + deferred FatMount, cloud_fat_read_over_nvme_proof,
make run-cloud-provider-fat-read-over-nvme). See docs/tasks/done/2026-06-02/
and done/2026-06-03/.
Open (next): the real-FS slice chain continues with FAT-over-NVMe follow-ups
and timestamps/provenance on CAPOSST1/CAPOSRO1 where those layouts expose
time metadata. FAT32 now surfaces valid host-authored directory-entry timestamps
over both virtio-blk and NVMe through schema-stable File.stat values, with
proof logs labeling the source as FAT metadata rather than trusted wall-clock
custody. The capnp-native storage smokes and installable-system seeded
variants now use the Rust host capnp image tool as the maintained fixture path;
the retired Python capnp-layout fixture scripts are no longer referenced by the
local proofs. The FAT image path stays on real mkfs.fat / mcopy tooling.
ext4-read stays deferred behind its explicit trigger.
Phase C / L4 Track Opened (relocation, post raw-frame GCE proof) (2026-06-02; refreshed 2026-06-07)
The L4 socket reachability track — relocating the virtio-net driver and
smoltcp into userspace processes (networking-proposal Phase C), sequenced after
the cloud milestone per the
network-reachable-datapath scope decision
(Option A) — is designed in
Phase C Userspace NIC Driver Relocation.
It is no longer waiting on a new security ruling: the selected-write
common-config and DMA-address export pieces landed through the bounded Phase C
slices, reusing the accepted notify-doorbell discipline and the landed
bounce/IOVA-export DMA isolation posture. The lower-layer blocker for Web UI on
a GCE instance is production L4 plus live IPv4 configuration. The full
boot-resource UI bundle is separate parallel work: it is ready and should close
before claiming a useful public Web UI, but it is not the raw NIC/L4 blocker.
Current task chain:
cloud-prod-nic-driver-userspace-clean-tx-rx-split-local-proofis Phase C slice 6 (DONE 2026-06-03). It removed the last coupled raw-frameNic.receiveself-stimulus.cloud-prod-userspace-network-stack-smoltcp-local-proofis Phase C slice 7c-ii(b) (DONE 2026-06-07). It locally proves the selected serve-from-userspace architecture: the non-qemucloudboot manifest starts a userspace smoltcp network-stack service, the service spawns an application client with onlyConsoleplus a servedTcpListenAuthority, and the client completes one hostfwd TCP request/response through a servedTcpListenerandTcpSocket. The armed path now receives socket authority from the userspace smoltcp service for this proof rather than extending the legacy kernelcap/network.rs/virtio_stub.rssocket owner. The selected design is recorded in the Phase C proposal’s 7c-ii Mechanism and Decomposition section.cloud-prod-legacy-kernel-network-socket-path-retirementis done. Non-qemuproduction manifests now reject legacy kernelnetwork_manager/tcp_listen_authoritygrants, so the armed socket route stays behind the userspace network-stack service; remaining kernel socket grants are qemu-only fixtures.cloud-prod-phase-c-kernel-smoltcp-virtio-net-removalis done. It removes the kernelsmoltcpdependency, retires the qemu-only kernel TCP/UDP runtime behind fail-closed socket entry points, and leaves the remaining virtio-net code as lower-layer QEMU fixture evidence rather than production cloud socket ownership.cloud-prod-network-stack-dhcp-ipv4-config-local-proofis done. It follows the served-socket proof and locally proves DHCP/IPv4 lease acquisition, default-route installation, ARP/neighbor resolution, and userspace-servedNetworkManager.getConfigstatus needed by a GCE-hosted listener.- Network Usability and Post-smoltcp
decomposes the follow-on usability lanes: operator status tooling, DHCPv4
renewal/rebind/expiry/status beyond the first config proof, system
DnsResolver, POSIXgetaddrinfo, ping/ping6 diagnostics, socket readiness/cancel/backpressure, packet trace authority, and transport policy/status. These are not first public Web UI blockers except for the already-listed DHCP/IPv4 config proof. remote-session-self-served-full-ui-bundleis done and provides the reviewed fixed-name boot-resource operator bundle for follow-on Web UI proofs.cloud-prod-remote-session-web-ui-l4-local-proofnow consumes the done userspace L4 and DHCP/IPv4 config proofs; it provesremote-session-web-uilocally on the non-qemucloudboot socket path.cloud-gce-legacy-virtio-webui-serving-local-proofis done (2026-06-11 04:26 UTC), proved bymake run-cloud-gce-legacy-virtio-webui-serving. It closes the local legacy-datapath serving gap: a persistent kernel-brokered legacy virtio 0.9 polled runtime (cap::virtio_net_legacy_datapath_proof::legacy_nic_runtime, kernel featurecloud_gce_legacy_virtio_webui_serving_proof) backs the same typedNiccap the modern path serves, and the Phase C userspace network stack plusremote-session-web-uiserve the fixed UI bundle to a host HTTP peer over the GCE NIC shape (disable-modern=on, no MSI-X), byte-verified against the committed bundle pin with a singlecloudboot-evidence: legacy-virtio-webui-servingmarker. PIO/vring ownership stays kernel-side; no host-physical, IOVA, queue, or port-I/O authority crosses the cap boundary. This closes only the LOCAL serving story – it does not claim private GCE reachability.cloud-gce-private-self-hosted-webui-proofis on hold (2026-06-09). Its local prerequisites are done, and the legacy-datapath Web UI serving story is now locally proven (2026-06-11 04:26 UTC, above), but it still shares the missing firewall IAM / default-deny ingress blocker recorded oncloud-gce-private-icmp-echo-proof: the cloudtest credential cannot create firewall rules, so a private probe cannot reach the instance. It keeps the current no-public-IP cloudboot posture and requires a private probe that crosses the live GCE NIC under an explicit billable-run authorization.cloud-gce-public-webui-ingress-tls-policy-designis done and records the selected ingress, TLS/certificate, firewall/source, browser session, and teardown policy for public exposure work.cloud-gce-public-self-hosted-webui-ingress-tlsis blocked on the private proof; public operator access is a separate exposure slice that implements the recorded ingress/TLS policy.cloud-prod-phase-c-kernel-smoltcp-virtio-net-removalis the done Phase C exit cleanup after userspace L4 was proven. It is not the first GCE Web UI proof, and it does not claim private GCE reachability, public ingress, or TLS.
Networking diagnostics and stack-completeness follow-ups:
cloud-prod-icmp-echo-reply-local-proofis done (2026-06-08). It consumes the done userspace L4 and DHCP/IPv4 config proofs, acquires a local DHCP lease, proves a same-subnet ARP plus ICMP Echo Request / Echo Reply exchange that preserves identifier, sequence, and payload, and rejects malformed or oversized requests with a bounded per-poll budget. This is diagnostics, not Web UI readiness.cloud-prod-icmp-echo-reply-real-nic-datapath-local-proofis done (2026-06-08), proved bymake run-cloud-prod-icmp-echo-reply-real-nic-datapath. The done local responder proof above runs smoltcp over an in-processQueuePhyDevice: it injects the inbound Echo Request in-process and uses the realNiccap only for the DHCP lease and ARP probe, so no inbound ICMP traversesNic.receivePoll/Nic.transmit. The live GCE NIC is legacy virtio 0.9 (no userspace driver authority), so an inbound Echo Reply over the real NIC needs a kernel-owned responder on the legacy datapath. This task built that responder (cap::virtio_net_legacy_datapath_proof::run_icmp_echo_reply_real_nic_datapath) and locally proved it: a host peer over a QEMUsocketnetdev (not SLIRP, which drops inbound host->guest ICMP Echo) drives DHCP, ARP, multiple malformed Echo Requests (rejected;icmp_malformed_drops>=1), then a valid one, and the kernel answers an RFC 792 Echo Reply over the real RX/TX vrings, emittingcloudboot-evidence: icmp-echo-reply-real-nic-datapath <token>withrx_inbound_provenance=real-nic-rx-vring/in_process_queuephydevice=absent.cloud-gce-private-icmp-echo-proofis blocked (2026-06-09) on GCP firewall IAM. Its harness, GCE-importable image (make capos-gce-private-icmp-echo-cloudboot-image), and probe orchestration (tools/cloudboot/run-test.sh --require-private-icmp-proof) are implemented and pre-spend-validated locally, and a real billable run (1780962265-4a2e) proved the GCE datapath: capOS DHCP-leased the exact GCE-assigned IP10.200.0.38over the live legacy virtio 0.9 NIC and emittedcloudboot-evidence: icmp-echo-reply-real-nic-datapath-ready, with the probe pinging that IP during capOS’s responder window. The pings showed 100% loss because GCE default-denies ingress and the cloudtest service-account credential lackscompute.firewalls.create/.delete/.list, so no temporary ICMP rule could be created; all resources tore down cleanly. Unblock by granting those firewall permissions to the cloudtest credential or pre-provisioning a persistent allow-ICMP rule in the cloudtest VPC network, then re-runningmake cloudboot-gce-private-icmp-echo-test. It proves private same-VPC ping over the live NIC with no public ICMP exposure and should not become a public HTTPS Web UI closeout condition unless a later ingress policy explicitly chooses ICMP health checks.
IPv6 Support Lane, Non-Blocking For First Public Web UI
The current Web UI cloud path is deliberately IPv4-first: Phase C userspace L4,
DHCP/IPv4 configuration, ARP, private GCE reachability, and reviewed public
HTTPS ingress remain the required blockers for the first public proof. IPv6 is
not a reason to hold that path. It is a separate network-stack capability lane
because the old qemu-only runtime remains IPv4-only and the legacy
kernel-owned non-qemu socket fallback is retired; the Phase C userspace
service path now carries the explicit address-family ABI. Private GCE IPv6
reachability and public IPv6 ingress policy remain unproven. Local ICMPv6 Echo
Reply, GCE-style DHCPv6 configuration, and IPv6 TCP listener/connect behavior
now have bounded local proofs.
The task chain is:
cloud-prod-ipv6-architecture-status-groundingis done (2026-06-03). It recorded the explicit current-state audit and the non-blocking decision, then unblocked the address-ABI task.cloud-prod-network-address-abi-ipv6is done (2026-06-03, the lane’s entry point). The socket/interface address ABI now represents IPv4 and IPv6 explicitly throughIpAddressFamilyand a documented address-length contract:getConfigreports the family plus anipv6Supportedflag, and the IPv4-only stack rejects IPv6 with a distinctipv6Unsupportedclass and malformed lengths withmalformedAddress, source- compatible for existing 4-byte IPv4 callers. Proofmake run-cloud-prod-network-address-abi-ipv6.cloud-prod-ipv6-link-local-nd-local-proofis done (2026-06-08). It enables the local smoltcp IPv6 feature set, installs a link-local address, verifies all-nodes plus solicited-node multicast joins, and proves a bounded Neighbor Solicitation / Neighbor Advertisement exchange plus cached-peer UDP egress locally. Proofmake run-cloud-prod-ipv6-link-local-nd.cloud-prod-ipv6-ra-slaac-local-proofis done (2026-06-08). It proves Router Solicitation, Router Advertisement acceptance, SLAAC address configuration, default-route installation, invalid-RA rejection, and prefix/default-route expiry locally. Proofmake run-cloud-prod-ipv6-ra-slaac.cloud-prod-ipv6-dhcpv6-gce-config-local-proofis done (2026-06-08). It proves a local GCE-shaped DHCPv6 Solicit / Advertise / Request / Reply exchange, installs the assigned/128, keeps default-route provenance tied to Router Advertisement, and rejects wrong source, wrong port, transaction-id, identifier, oversized-option, lease-lifetime/timer, and timeout cases. Proofmake run-cloud-prod-ipv6-dhcpv6-gce-config.cloud-prod-icmpv6-echo-reply-local-proofis done (2026-06-08). It proves bounded local ICMPv6 Echo Request / Echo Reply handling through the Phase C userspace smoltcp substrate, including identifier, sequence, payload preservation and checksum, type/code, address-family, and oversized-input rejection. It is diagnostics and stack completeness, not Web UI readiness.network-ping6-diagnostics-tool-local-proofis done (2026-06-08). It proves a bounded local ping6-style diagnostic over the smoltcp ICMP socket path, including link-local scope reporting, configured global address status, malformed-reply drop, timeout/unreachable classification, one bounded retry after the neighbor-discovery timer, payload bounds, and one-outstanding-request enforcement. It remains diagnostics only and does not change the IPv4-first Web UI critical path or authorize public IPv6 ingress.cloud-prod-ipv6-tcp-l4-local-proofis done (2026-06-08). It proves TCP listener and connect behavior through the production socket contract with IPv6 endpoints. Proofmake run-cloud-prod-ipv6-tcp-l4.cloud-prod-ipv6-real-nic-datapath-local-proofis ready. The done IPv6 proofs above run smoltcp over an in-processHarnessPhyDevicepeer (markers self-declaremetadata_only=true/public_ingress=not-attempted) and use the realNiccap only for MAC/link status; the real-NIC TX/RX datapath exists today only for IPv4 (cloud-prod-network-stack-dhcp-ipv4-config-smoke). This task builds the IPv6 DHCPv6/RA + probe datapath over the real bound NIC and proves it locally, emittingcloudboot-evidence: ipv6-real-nic-datapath <token>.cloud-gce-private-ipv6-reachability-proofis on-hold on missing GCP IAM access. The real-NIC IPv6 datapath proof above is now done, but its live-GCE acceptance fundamentally requires a dual-stack subnet (so the GCE NIC receives an IPv6 assignment at all) plus an IPv6 ingress firewall rule for the same-VPC probe. The cloudtest service-account credential lackscompute.networks.create/compute.subnetworks.*(the only existing cloudtest subnet is IPv4-only) andcompute.firewalls.create/.delete/.list, so neither can be provisioned. Unblock by granting those permissions, or by pre-provisioning a dual-stack subnet plus an IPv6 ingress rule in the cloudtest VPC scoped to the probe. See the on-hold record for the consolidated blocker analysis and the parkedcodex/cloud-gce-private-ipv6-reachability-proofharness checkpoint.cloud-gce-public-ipv6-ingress-tls-policy-updateis blocked on the private IPv6 proof, then updates the selected public Web UI ingress/TLS policy for DNS/AAAA, IPv6 firewall, TLS coverage, and teardown before any public IPv6 exposure.
Non-blocking GCE gVNIC portability lane:
cloud-gce-gvnic-protocol-grounding-device-mapis done. It landed the GCE gVNIC provenance map from the Google Cloud gVNIC docs and the Google/Linux GVE driver documentation: PCI identity (0x1ae0:0x0042), BAR/admin-queue/MSI-X wire subset, GQI/DQO formats, QPL/RDA addressing, and the planned DDF (DeviceMmio/DMAPool/DMABuffer/Interrupt) authority mapping. No capOS gVNIC driver or QEMU model exists yet.cloud-gce-gvnic-image-launch-inventory-proofis done. It requestsGVNICimage/instance launch posture, reads the GCE image/instance policy back, proves serial PCI inventory for the1ae0:0042function with BAR and MSI-X metadata, and records that no gVNIC driver bind was claimed. The live run used a private no-public-IP/no-service-account VM and completed teardown.cloud-gce-gvnic-adminq-register-proofis done. It builds a proof-only cloudboot image, maps the live GCE gVNIC BAR0 throughDeviceMmio, allocates manager-owned bounce-buffer DMA pages for the admin queue and descriptor, issues oneDESCRIBE_DEVICEcommand, releases the admin queue, and checks staleDeviceMmio/DMAPool/DMABufferhandles. The live privateGVNICrun completed teardown and recorded no userspace host-physical/IOVA export and no provider NIC bind.cloud-gce-gvnic-raw-frame-tx-rx-proofis done. It builds a proof-only cloudboot image, configures one GQI/QPL TX queue and one RX queue over the live GCE gVNIC, sends one DHCP DISCOVER raw Ethernet frame from the device MAC, receives one inbound IPv4 frame, destroys queues, unregisters QPLs, deconfigures resources, releases/resets the admin queue, and records no providerNicbind claim.cloud-gce-gvnic-nic-cap-adaptation-proofis done. It adapts the proven GQI/QPL queue path behind the existing typedNicsemantics and emitsgvnic-nic-cap-adaptationevidence with inline-frame TX/RX, MAC/link metadata, hidden queue addresses, no host-physical or IOVA export, and noprovider-nic-boundclaim. It remains a portability/future-machine-family lane; the first public Web UI proof can stay on the already-proven GCE virtio-net path.
Certificates / TLS Backlog
Bounded implementation slice chain for the certificates/TLS track. It
decomposes Certificates and TLS
into dispatchable slices and is owned by the Certificates / TLS track in
docs/tasks/README.md. The dispatchable records live under
docs/tasks/; this file is the long-form decomposition and sequencing rationale.
Grounding
- Certificates and TLS
– the schema surface and Phase 1-9 ordering. Phases 1-2 are the near-term
target; Phase 1 is
Certificate/CertificateChain/TrustStore/CertVerifierover a RAM-onlywebpki-rootsstore and arustls-webpkiverifier. The Phase 2 client-only local proof now completes a TLS 1.3 handshake over a userspace-servedTcpSocketcap withembedded-tls; the server/config service surface remains future Phase 2 work. - Cryptography and Key Management
– partial implementation. The minimal
SymmetricKey,PrivateKey, andPublicKeyABI, RAM-only XChaCha20+HMAC/P-256 key cores, RAM-onlyKeyVaulthandle custody, and development-only softwareKeySourcebootstrap exist for local proofs. There is still no persistence or production custody source, so production/public TLS and ACME remain blocked on a reviewed source that can mint key handles without exposing raw private-key material. - Time and Clock Authority
–
WallClockPhase 1 landed (88cf4b5d): a read cap withwallTimeand aClockProvenancelabel, but the fixed-boot-base source reportsUntrusted. Cert-validity (notBefore/notAfter) and OIDCexp/iatcompare against it. Host-tested verify logic passes an explicitatEpochSecondsand needs no live clock; security-grade validity against an adversarial clock wants the trusted-provenance upgrade (WallClock Phase 2). Recorded as a sequencing dependency on the live consumer slices, not on the host verifier slice. - Phase C Userspace NIC Driver Relocation
– the userspace
TcpSocketcap the TLS stack wraps arrives via Phase C slice-7 (cloud-prod-userspace-network-stack-smoltcp-local-proof). The TLS stack is a userspace consumer of that cap and must not move into the kernel.
Sequencing Rationale
The proposal’s suggested shape (library -> handshake -> cert caps -> consumer) is reordered to land the lowest-risk real logic first, grounded in what exists:
- The verifier path (TrustStore + CertVerifier over webpki-roots) needs
no socket and no private key – it is pure
no_std + allochost-testable logic. It lands before the handshake. - A TLS client handshake needs a
TcpSocketcap but no private key. - A TLS server (the Web UI consumer) needs a
KeyVault-issuedPrivateKeyhandle and a server cert source, so it remains the most-blocked terminal slice.
Slice Chain
- Vendor the Phase-1 verifier crates.
rustls-webpki+webpki-rootsas static-pinned no_std+alloc snapshots withVENDORED_FROM.mdprovenance, recorded underdocs/trusted-build-inputs.md, proved to build for the bare-metal target. Slice 3 later selectedembedded-tlsfor the local client proof’s no_std TLS state machine; the broader server/config service stack remains future work. Certificate/TrustStore/CertVerifier(Phase 1). Schema additions plus host-tested verify logic overrustls-webpkiseeded bywebpki-roots, with chain verification proved against committed good/bad vectors and explicitatEpochSeconds. No running cap service, no socket, no key – the lowest-risk real cert logic.- Client TLS handshake over
TcpSocket(Phase 2, client-only). Done 2026-06-08. A userspace process completes one TLS 1.3 client handshake over the Phase C userspaceTcpSocketcap, validating the peer chain with the slice-2 verifier, with an observable local QEMU proof. The no_std determination selected a vendoredembedded-tls 0.19.0client state machine for this local proof rather than fullrustls. - capOS-terminated Web UI endpoint (terminal consumer). Serves the Web UI
over capOS-held TLS as a direct-termination successor after the first GCE
public Web UI proof closes through provider-terminated HTTPS. Deeply blocked:
needs a
KeyVault-issuedPrivateKeycap and a server cert source (ACME / provisioned). - Minimal key-custody decomposition. Done. It decomposes the missing
PrivateKey/KeyVault/KeySourcesubset into the three implementation records below, keeping production hardware/cloud custody out of the local TLS/ACME bootstrap. PrivateKey/PublicKeyRAM signing proof. Done 2026-06-04. Adds the minimal asymmetric-key ABI and host-tested RAM signing core: sign/public/info, public verify/export/info, purpose metadata, and no raw private-key export.- RAM
KeyVaultcustody. Done 2026-06-05. Adds handle-based key generation/open/list/destroy and a local QEMU proof for TLS and ACME account key handles, still RAM-only and not production custody. - Development-only
KeySourcebootstrap. Done 2026-06-05. Grants local proofs a development software key source that mints key handles without putting raw private keys in manifests, images, logs, task records, or evidence, and is rejected for production/public profiles. - ACME account/order local proof. Done 2026-06-08.
capos-tlsnow has a no_std+alloc ACME account/new-order/CSR-finalize/certificate-retrieval core, with ES256 JWS signing throughAcmeAccountPrivateKeycaps and CSR signing through a separate TLS-purposePrivateKeycap. Challenge validation stays fake or pre-authorized here; the proof does not call Let’s Encrypt staging or production. - Scoped
http-01solver. Done 2026-06-09.capos-tlsadds a bounded, token-scopedHttp01ChallengeSolverand the RFC 8555http-01authorization flow (pending order, authorization fetch, key-authorization derivation via the RFC 7638 account-key thumbprint, challenge response, out-of-band validation, and cleanup).remote-session-web-uiserves only/.well-known/acme-challenge/<token>for currently-published tokens through that same solver; retired, unknown, sub-path, and traversal tokens fail closed (404). The host ACMEhttp-01test proves the protocol and cleanup; the Web UI L4 QEMU proof fetches the challenge through the served origin. It grants no generic route, static-file, DNS, or Web UI authority and adds no public CA call. CertificateStore.watchrenewal and rotation. Proves local renewal with short-lived test certificates, storing the fresh chain under a stable handle and rotating the Web UI TLS server without restart.- Public GCE Let’s Encrypt direct-termination proof. A separately reviewed successor after the provider-managed first public proof. It requires a public DNS name controlled for the run, explicit billable/public-ingress authorization, and explicit authorization before any Let’s Encrypt production call; staging remains the default external CA target.
Let’s Encrypt / ACME Public TLS Decomposition
Let’s Encrypt support is implementable for the public TLS milestone only as the capability-native, capOS-terminated successor path. It is not the already selected closeout path for the first public GCE Web UI proof. That first proof continues to terminate HTTPS at the GCP external load balancer with a provider-managed certificate, no capOS private-key custody, and no raw public HTTP closeout.
The missing prerequisites are represented as named task records:
- Minimal
PrivateKey/PublicKeyABI and RAM signing proof for TLS server keys and ACME account keys (crypto-privatekey-publickey-ram-signing-local-proof). - RAM-only
KeyVaultcustody for TLS and ACME private-key handles (crypto-keyvault-ram-privatekey-custody-local-proof). - Development-only software
KeySourcebootstrap for local TLS/ACME proofs, rejected for production/public profiles (crypto-development-keysource-tls-acme-bootstrap-local-proof). - A TLS client over the userspace
TcpSocketcap (cloud-tls-client-handshake-over-tcpsocket-local-proof). - Server-side TLS and
TlsServerConfigforremote-session-web-ui(cloud-tls-self-hosted-webui-terminated-endpoint). - An RFC 8555
AcmeClientaccount/order/finalize path against a local Let’s Encrypt-compatible directory (cloud-tls-acme-account-order-local-proof). - A scoped
http-01challenge solver under the Web UI service boundary (cloud-tls-acme-http01-challenge-solver-local-proof). CertificateStore.watchrenewal and TLS-chain rotation without a Web UI restart (cloud-tls-acme-renewal-certstore-rotation-local-proof).- Public DNS/name control plus explicit billable/public-ingress authorization
before a real GCE run, and explicit CA authorization before any Let’s Encrypt
staging or production request
(
cloud-gce-public-webui-letsencrypt-direct-termination-proof).
Local proofs and public CA/cloud proofs stay distinct. The ACME account/order, challenge, and renewal slices use a local RFC 8555-compatible directory and local QEMU/cloudboot paths. A public GCE/Let’s Encrypt run requires a separately authorized harness mode, a controlled public DNS name, public-ingress teardown evidence, and no private key material in manifests, images, logs, task records, or evidence directories.
Next Gap
Slices 1 and 2 landed on 2026-06-03: rustls-webpki and webpki-roots are
vendored as static-pinned no_std+alloc snapshots, and capos-tls contains the
Phase 1 Certificate / TrustStore / CertVerifier host verifier proof over
those crates. K1 landed on 2026-06-04: capos-tls also contains the minimal
RAM-only P-256 PrivateKey / PublicKey signing core. K2 landed on
2026-06-05: RAM-only KeyVault generation/open/list/destroy handle custody for
those keys. K3 landed on 2026-06-05: local development software KeySource
bootstrap now mints TLS and ACME account key handles without raw private-key
material in manifests or evidence and rejects production/public profiles.
Capability-infrastructure key-cap reconciliation landed on 2026-06-06: the
minimal RAM-only SymmetricKey ABI and local AEAD/MAC proof now exist. Slice 3
landed on 2026-06-08: the local QEMU proof now completes one TLS 1.3 client
handshake over a userspace-served TcpSocket cap and validates the peer chain
with capos-tls. ACME slice 5 landed on 2026-06-08: capos-tls now proves
account registration, order creation, CSR finalize, and returned-chain parsing
against a local RFC 8555-style directory using purpose-scoped key caps. ACME
slice 6 (proposal item 10) landed on 2026-06-09: the scoped http-01 solver now serves bounded
/.well-known/acme-challenge/<token> responses through remote-session-web-ui,
with the http-01 authorization/validation/cleanup flow proven host-side and the
served route proven in the Web UI L4 QEMU proof. The next ACME gap is renewal and
certificate-store rotation (slice 11). The next server-side TLS behavior gap
remains the Web UI consumer, still blocked on reviewed server key custody and a
certificate source. The behavior chain then
advances slice-by-slice – each kernel/lib-first with a local proof – until the
Web UI consumer slice can add a separately reviewed direct-termination successor after
cloud-gce-public-self-hosted-webui-ingress-tls
closes with provider-terminated HTTPS. The key-custody local-proof precursor is
now complete for PrivateKey / PublicKey, RAM KeyVault, and development
KeySource; production custody remains future.
Installable System Backlog
Detailed decomposition of Installable System: an installed, persistent capOS that boots from disk and keeps mutable system configuration across reboots, composed with the immutable boot manifest.
docs/tasks/README.md links here. Installable System became the selected
milestone after the Device Driver Foundation closeout and is now closed for the
bounded local/QEMU installable-system contract. The behavior track below landed
through item 8, and item 9 reconciled the proposal/body wording to the landed
install, provision, update, and rollback contracts. This milestone does not pull
public L4 ingress, AWS/Azure live support, direct-remapping production hardware,
userspace smoltcp/L4 readiness, secure boot/signing, or production release
authority into selected scope.
Landed Foundations (What This Builds On)
These contracts exist today and the track decomposes against them, not against the proposal’s projected shapes. Present tense is landed behavior.
| Building block | Landed contract | Source |
|---|---|---|
BlockDevice | readBlocks/writeBlocks/info/flush over a real cfg(qemu) virtio-blk device; blockDevice grant source | kernel/src/cap/block_device.rs; proof make run-virtio-blk |
| Read-only filesystem | CAPOSRO1 fixed superblock at LBA 0; Directory.list/open/sub + File.read/stat; mutating methods fail closed; readOnlyFsRoot grant source | kernel/src/cap/readonly_fs.rs; proof make run-storage-fs |
Persistent content-addressed Store | CAPOSST1 superblock at LBA 0; put/get/has/delete keyed by SHA-256 hash; superblock rewrite is the durable commit point; survives reboot; persistentStore grant source | kernel/src/cap/persistent_store.rs; reboot proof make run-storage-persist |
| Writable filesystem | CAPOSWF1 superblock at LBA 256; full Directory mutation set + File.write/truncate/sync/close; fail-closed single-writer policy; writableFsRoot grant source | kernel/src/cap/writable_fs.rs; reboot proof make run-storage-writable |
| Co-located storage image | One disk co-locates CAPOSST1 Store (LBA 0) and CAPOSWF1 filesystem (LBA 256) so both survive reboot together | tools/mkstore-image --writable |
| Read-only packaged-image source fixture | QEMU-gated read-only Directory/File over the booted CD-ROM ISO 9660 /boot/bins/ tree (boot_iso ATAPI reader); installable_image_source grant source; physically scoped to the ATAPI medium, cannot reach the writable target disk; not a general post-bootstrap filesystem service | kernel/src/cap/installable_image.rs; proof make run-installable-image-source |
| Default bootable disk image | Single hybrid BIOS+UEFI raw image with one GPT ESP (FAT32) carrying Limine, kernel, a name-only manifest.bin, and Limine module payloads under /boot/bins/; make image, run-disk, run-disk-bios; GCP/AWS provider packaging. | tools/mkdiskimage.sh, tools/package-cloud-image.sh; proof make run-limine-disk-boot-modules |
Namespace | RAM-backed resolve/bind/list/sub name-to-hash bindings; not persistent; namespace grant source | kernel/src/cap/namespace.rs |
| Boot manifest / init | Baseline boot loads the manifest module. Default raw disk and cloudboot images resolve ordinary service binaries from checked Limine modules; default ISO build/run/smoke paths resolve ordinary service binaries from the name-only ISO /boot/bins/ tree through boot_iso; the reserved init selector still uses the kernel-embedded init ELF. run_init parses SystemManifest, builds init’s bootstrap caps, and enters initConfig.init. The installable data-region path additionally reads and validates system/config/overlay.bin when the data region mounts and the base manifest declares matching extension points | kernel/src/main.rs, kernel/src/boot.rs, proof make run-installable-overlay; make run-smoke |
Divergences From The Proposal (Structural Reconcile Closed)
The proposal was written before the storage prerequisites landed and projects shapes that differ from the landed contracts. The initial reconcile task recorded the landed storage contracts and placement decisions before the behavior track ran; the behavior track then landed through item 8 below. The structural docs reconcile task updated the proposal’s structure and body wording to the landed install, provision, update, and rollback contracts without broadening selected scope.
- On-disk layout. The proposal projected three partitions (boot / system /
data) on the installed disk. The base
make imageraw boot image still produces a single hybrid BIOS+UEFI image with one GPT ESP, and the co-locatedCAPOSST1Store+CAPOSWF1writable filesystem remains available as a separate data-region image for the focused storage and early data-region smokes. The landed installable-system disk path no longer stops there: task 5 (installable-bootable-disk-system-data-regions) folds that co-located data-region image into the bootable disk as GPT partition 2 at the fixed data-region LBA, somake run-installable-diskboots from one disk carrying both the ESP and the persistent data region. Tasks 2-4 describe the separate auto-mounted data-disk model they originally built on; item 5 is the landed integrated single-disk packaging. - Persistent naming. The proposal stores the overlay “under a well-known
system
Namespaceroot (system/config/<generation>)”. The landedNamespaceis RAM-only and does not survive reboot. Persistent naming and theactive/known-good pointers are therefore grounded in the landed writable filesystem (CAPOSWF1paths and small marker files) plus the content-addressed persistentStore(CAPOSST1) for immutable generation objects. No new persistent-Namespacekernel cap is assumed. - Generations and epochs. No landed
Store,Namespace, orSystemManifestschema carries a system-generation/epoch field today (other caps such asAccountRecordand the DDF revocation generations do, but not the installable-system path). The monotonic epoch + content hash the proposal relies on for stale-write rejection and rollback are carried inside the overlay’s own capnp object and the writable-filesystem marker files, not by extendingStoreorNamespace. - Composition machinery. Landed in track item 3 (2026-05-26): the
SystemConfigOverlaycapnp object plusSystemManifest.extensionPoints, and init’s read/decode/validate/compose with base-pins-win / overlay-adds-within- declared-extension-points / no-new-authority precedence and fail-closed base-floor fallback (make run-installable-overlay). Generation/rollback selection of which overlay is active landed in task 4, and the install/provision/update/rollback flows landed in tasks 6-8.
Ordered Track
Each item names its acceptance and the landed APIs it builds on. Dispatchable
records live under docs/tasks/ with the same ids.
-
installable-system-proposal-reconcile(docs). Done 2026-05-26. Reconciledinstallable-system-proposal.mdto the landed contracts above: single hybrid ESP boot image (no three projected partitions), persistent config region grounded in the writable filesystem + content-addressed persistentStore, RAM-onlyNamespace(naming/pointers are writable-filesystem files), and no system-generation field on theStore/Namespace/SystemManifestpath. Recorded the data/system region placement decision. The structural proposal/body wording update is closed by item 9. Builds on: all five storage/boot prerequisites. -
installable-data-region-boot-mount(behavior). Done 2026-05-26 02:02 UTC. Wires the persistent data region (the co-locatedCAPOSST1Store+CAPOSWF1writable filesystem) into the boot path: under theinstallable_data_regionkernel feature,run_initbest-effort callscap::grant_data_region, which mounts the auto-attached data disk, scopes a writableDirectoryto thesystem/configsubtree (writable_fs::mount_config_root), and grants init thatDirectory(data-config) plus the persistentStore(data-store) under well-known CapSet names – granted together or not at all. It fails closed wholesale to the base manifest (caps unchanged, “no data region; base floor” diagnostic) when the disk is absent, has a malformed superblock, or is missing thesystem/configroot; the disk caps stay out of the manifest so the fail-closed boots do not abort on a mandatory cap. No new kernel cap type or schema change. Proof:make run-installable-data-regionboots the same ISO three times – seeded disk (init prints the resolvedsystem/configcontents), no disk (base floor), and zeroed-superblock disk (base floor). Builds on:writable_fs.rs,persistent_store.rs, the co-located image tool (--seed-config), andrun_init. -
installable-config-overlay-schema-and-merge(behavior, schema). Done 2026-05-26 02:55 UTC. Added the persistentSystemConfigOverlaycapnp object (overlay version, monotonic epoch, SHA-256 content hash, additional services, network/runtime settings, account-store location) and theSystemManifest.extensionPointsdeclared extension points (ManifestExtensionPoints: additional-services allowance/count, allowed service caps,minOverlayEpoch, settings allowances – closed by default). Init now readssystem/config/overlay.binfrom the granted config region, decodes it (SystemConfigOverlay::from_capnp_bytesre-validates the overlay version and content hash), and composes it over the base plan (compose_onto): base pins win (no base-service-name orinitcollision), overlay services name only base-shipped binaries and request onlyallowedServiceCaps(no new authority classes),epoch >= minOverlayEpoch, count<= maxAdditionalServices, and carried settings require their allowance. Any schema-invalid, version-mismatched, content-hash-mismatched, stale-epoch, extension-point-violating, or missing overlay is rejected whole; init boots the base floor and surfaces[init] overlay rejected: <reason>. Host encodertools/mkmanifestmkoverlaybin emits the overlay bytes (filling the canonical hash);tools/mkstore-image --writable --seed-overlayseedssystem/config/overlay.bin. Proof:make run-installable-overlayboots the same featured ISO three times – valid overlay (theoverlay-extraservice runs), base-pin collision (rejected, base floor preserved), corrupt overlay (rejected, base floor). Did NOT add generations/rollback (task 4). Builds on: task 2,capos-configmanifest validation,schema/capos.capnpSystemManifest. -
installable-system-generation-rollback(behavior). Done 2026-05-26 03:41 UTC. Userspace-only over the already-granted persistentStore+ writablesystem/configDirectory; no schema or kernel change. Represents system-config generations as content-addressedStoreobjects keyed by SHA-256 (immutable, deduped), tracks the known-goodactivepointer and a staged/attemptingcandidatepointer as monotonic-pointer-epoch marker files (gen-active/gen-candidate) in the writable config region, records a boot attempt durably before applying a candidate, and auto-falls-back to the known-good generation when a candidate is left unconfirmed (a boot that does not reach the health checkpoint) – the brick-proofing guarantee. Also promotes a confirmed candidate, rolls config back to a retained prior generation (monotonic pointer advance pointing at older content), and rejects a stale/replayed (lower-or-equal-epoch) pointer.init/src/main.rsrun_generation_rollback_checks, gated by a base service namedgeneration-proof, exercises all of this end-to-end against the real durable primitives with observable[gen] ...assertions. Did NOT add the installer (task 6) or update flow (task 8). Proof:make run-installable-generationboots a--seed-configdisk twice – boot 1 exercises the full mechanism in one boot and durably leaves an unconfirmedattemptingcandidate; boot 2 re-reads the committed markers from a fresh mount and proves across-reboot auto-fallback to the known-good generation. Builds on: task 3 and the persistentStore/writable filesystem durability. -
installable-bootable-disk-system-data-regions(behavior). Done 2026-05-26 04:31 UTC. One integrated bootable disk now carries the boot ESP (GPT partition 1) and the co-locatedCAPOSST1Store+CAPOSWF1writable data region (GPT partition 2), and boots through the landed task 2-4 path reading the data region from the same disk it booted from – not a separate smoke-only drive.tools/mkdiskimage.shgained--data-image/--data-offset-bytes(it folds thetools/mkstore-image --writableimage into a second GPT partition) and derives the ESP size from--esp-sectors(the integrated disk uses the same 128 MiB ESP as the raw disk-image targets so a debug kernel fits). The kernelinstallable_diskfeature (impliesinstallable_data_region) adds a fixed data-region base LBA (cap::data_region_base_lba= 264192) applied at the singlepersistent_store/writable_fsread_range/write_rangechoke points; the kernel trusts that fixed tool/kernel layout contract rather than parsing the GPT, exactly as the superblock LBAs already are. Proof:make run-installable-diskbuilds one disk (boot ESP + seeded data region carrying a valid config overlay) and boots it as a single virtio-blk device; the gate isdata region: mountedfrom the boot disk plus[overlay-extra] started via overlay– a service only the data region supplies – not a clean boot alone. Did NOT add the installer (task 6). Builds on:mkdiskimage.sh,tools/mkstore-image --writable, task 4.
5a. ddf-multi-virtio-blk-device-support (behavior, DDF milestone). Lift the
single-virtio-blk limit (per-device driver instance, DMA pool key, interrupt
route, PCI claim) and add a target-disk BlockDevice/Store grant source
scoping a cap to a specific device. Owned by the Device Driver Foundation
milestone (docs/backlog/hardware-boot-storage.md “Reusable Block-Device
Path”); the install flow depends_on it. Builds on: the landed
device-agnostic transport seam and per-queue-keyed DMA ledger.
5b. installable-userspace-image-source (behavior, DONE 2026-05-26
08:15 UTC). Expose a userspace-readable read-only packaged-image source so a
userspace installer can read the packaged boot/system bytes. Chosen shape:
a QEMU-gated read-only Directory/File cap over the existing boot_iso
ISO 9660 reader (kernel/src/iso/), not the boot-package payload
alternative and not a general post-bootstrap filesystem service. The
installable_image_source grant source (KernelCapSource @45) mounts the
booted CD-ROM ISO 9660 /boot/bins/ tree and serves Directory.list/open
File.read/stat; every mutating method fails closed. It is physically scoped to the ATAPI CD-ROM medium and cannot reach or mutate the writable virtio-blk target disk (blockDeviceTarget/writableFsRoot) – that write authority belongs to the install flow (task 6). Offsets/lengths are validated against the file extent before any device access, reusing the driver’s in-bounds checks; a past-EOF read clamps to empty and an absent name is rejected. Broader package browsing remains userspace-service work rather than an expansion of this fixture. Proof:make run-installable-image-source(kernel cap modulekernel/src/cap/installable_image.rs; consumer demodemos/installable-image-source/; manifestsystem-installable-image-source.cue; harnesstools/qemu-installable-image-source-smoke.sh). Builds on: task 5, theboot_isoISO 9660 reader.
-
installable-system-install-flow(behavior). Done 2026-05-26 10:12 UTC. Thecapos-system-installuserspace service (demos/installable-system-install/) installs a bootable capOS onto a blank target disk using only two granted caps: the read-onlyinstallable_image_sourceDirectoryover the booted CD-ROM/boot/bins/and the target-scopedblock_device_targetBlockDeviceselected by manifest PCI identity, never the boot disk. It copies the packaged bootable boot-region head (BOOTHEAD.BIN: protective MBR + primary GPT + the FAT ESP with Limine + release kernel + base manifest) to LBA 0, writes the backup GPT (BOOTGPT.BIN) at the LBA read from the primary GPT header (Limine validates it), and initializes an empty data region (DATAIMG.BIN: emptyCAPOSST1Store +CAPOSWF1filesystem with just thesystem/configdirectory) at the fixedcap::data_region_base_lba. It validates every sector range before writing and verifies the read-back. The empty data region is the install floor; the operator’s first non-empty config generation is provisioning (task 7), not install. Proof:make run-installable-install– pass 1 installs into the manifest-selected virtio-blk target disk; pass 2 boots that disk standalone (no CD-ROM) and reaches the base service with its data region mounted (data region: mounted+[init] data-region mounted: system/config entries=0+[console-paths] Console paths ok.), not a clean boot alone. The build packages the boot region split into head + backup GPT (tools/split-boot-region.py) so the installer reads only the populated ~15 MiB over the slow ATAPI PIO path rather than the whole FAT32 ESP. Did NOT add provisioning (task 7) or update/rollback (task 8). Builds on: task 5, theBlockDevicesector path, content-addressedStore, and the precursors 5addf-multi-virtio-blk-device-supportand 5binstallable-userspace-image-source. -
installable-system-provision-flow(behavior). Done 2026-05-26 11:09 UTC. Thecapos-system-provisionuserspace service (demos/installable-system-provision/) runs as PID 1 over an installed system’s persistent data region and performs the proposal’s “Provision” flow, holding only three caps: aConsole, the writable filesystem root (writable_fs_root, navigated tosystem/config), and the content-addressed persistentStore(persistent_store). On a disk whosesystem/configcarries no active generation yet (the empty install floor task 6 leaves), it writes the operator’s first non-emptySystemConfigOverlaygeneration (epoch 1: an operatorAccountRecordstored as a content-addressedStorerecord named from the overlay’saccountStoreLocation, a hostname, a log level, and one additional service), commits the generation object to theStore, writessystem/config/overlay.bin(the shape init’sapply_config_overlayconsumes, proven by task 3), and advances thegen-activepointer. It dispatches on durable state: a second boot of the same disk re-reads thegen-activepointer, resolves the generation object and operator account from theStore, and verifies the provisioned account/settings are the active durable config that survived the reboot. Did NOT add the update/rollback flow (task 8); reuses the overlay object and the existingAccountRecordschema with no schema change. Proof:make run-installable-provisionboots the same--empty-configdisk twice – pass 1 provisions and commits, pass 2 verifies the active generation + operator account + settings survived; a clean boot alone is not the gate. Builds on: task 6 (the empty install floor), task 3 (the overlay object and merge), task 4 (the generation/gen-activerepresentation), and the writable-filesystem + persistent-Storedurability. -
installable-system-update-rollback-flow(behavior). Done 2026-05-26 11:35 UTC. Thecapos-system-updateuserspace service (demos/installable-system-update/) performs the proposal’s “Update” flow on top of the landed generation/rollback mechanism (task 4), userspace-only over the same three caps provision holds (Console,writable_fs_rootnavigated tosystem/config, persistentStore); no schema or kernel change. It writes a newSystemConfigOverlaygeneration into the content-addressedStoreas a new root hash (old generation objects remain; the shared operatorAccountRecorddedups), stages it as anattemptinggen-candidatepointer without advancing the known-goodgen-activepointer, and on the next boot commits by advancingactiveonly when the candidate reaches its health checkpoint – otherwise the boot-attempt-vs-confirmed auto-fallback keeps the prior known-good. The overlay re-validation against the new base reuses the productionSystemConfigOverlay::compose_ontoagainst a base plan whose extension points revoked the overlay’s authority, so an update whose new base no longer admits the overlay falls back to the base floor with a surfaced error rather than applying. The data region (operator account + active config) is carried across every transition. It dispatches on durable state (update-phasemarker) so commit-on-success and auto-fallback are both proven across a REAL reboot, not one process. Proof:make run-installable-updateboots the same--empty-configdisk THREE times – boot 1 provisions known-good gen1, rejects an overlay against a revoked-cap new base (kept base floor), and stages a healthy candidate gen2; boot 2 commits gen2 across the reboot and stages a failing candidate gen3; boot 3 auto-falls-back from gen3 across the reboot to known-good gen2 – distinct per-generation content hashes and a stable account hash on every line, the staged/commit/fallback/ marker-survival/data-region-carried assertions, not a clean boot alone. Builds on: task 6 (the install floor), task 7 (the provision/overlay shape), task 4 (the generation/gen-active/gen-candidaterepresentation), and task 3 (overlay compose/validation). -
installable-system-structural-doc-reconcile(docs-status). Done 2026-06-07 18:20 UTC through commit12b8334a(committed 2026-06-07 18:19 UTC). Reconciled Installable System structural and body wording to the landed local/QEMU data-region, overlay, generation, install, provision, and update/rollback contracts. Preserved the RAM-onlyNamespacecaveat and kept secure boot/signing, production release authority, public ingress, AWS/Azure live support, direct-remapping production hardware, userspace smoltcp/L4 readiness, and full durable account policy out of the closed Installable System scope.
Design Grounding
- Installable System – the design being decomposed.
- Hardware, Boot, and Storage – the Local Disk Storage, Writable Local Storage, and Bootable Disk Image milestones the landed prerequisites came from.
- Local Users, Storage, and Policy – the account store, a consumer of the persistent-config region.
- Run Targets, Init Mandate, and Default-Run Integration – the init mandate and boot-manifest policy any installed-system boot path must respect.
- Storage and Naming
– the
Store/Namespace/Directory/Filemodel, content-addressing, and attenuation the persistence layer reuses. - Manifest and Service Startup – the immutable base manifest the persistent overlay composes with.
Cloud Image Import And Serial-Console Boot
Operator notes for importing the locally-boot-proven hybrid BIOS+UEFI disk
image into GCP and AWS and reaching a serial-console boot. This is packaging and
documentation only: tools/package-cloud-image.sh operates on a local artifact,
adds no provider credentials, and performs no live cloud calls. Cloud NIC and
storage driver readiness remain separate, blocked tracks
(docs/backlog/hardware-boot-storage.md “Cloud Device Tracks”); the first cloud
milestone is an imported-image serial-console boot, not a driver claim.
Local Artifacts
make image builds target/capos-image.raw (default 256 MiB, GPT, 128 MiB
hybrid ESP + Limine MBR) and make run-disk / make run-disk-bios prove it
boots under OVMF (UEFI) and SeaBIOS (legacy BIOS). Only run the import steps
below once those local boot proofs pass; provider import only makes sense for a
known-good image.
make package-cloud-image (or package-gcp-image / package-aws-image)
repackages that artifact into target/cloud-image/:
| Provider | Output | Shape |
|---|---|---|
| GCP | disk.raw.tar.gz | disk.raw grown to a whole multiple of 1 GiB, GPT backup relocated, inside a gzip tar --format=oldgnu archive |
| AWS | capos-aws.raw | RAW (exact image size) |
| AWS | capos-aws.vhd | fixed VHD (conectix footer, disk-type fixed) |
| AWS | capos-aws.vmdk | stream-optimized VMDK |
The helper self-verifies each shape (gzip + oldgnu tar member for GCP; the VHD
fixed-disk footer and VMDK create-type for AWS) and fails if a conversion is
wrong. It differs from make capos-cloudboot-image, which builds a from-scratch
10-GiB GCE disk for the tools/cloudboot/ end-to-end harness; the packaging
helper repackages the small, already-boot-proven make image artifact instead.
GCP Custom-Image Import
GCP custom-image import requires a single file named exactly disk.raw, sized
to a whole multiple of 1 GiB, in a gzip tar --format=oldgnu archive – exactly
the disk.raw.tar.gz the helper emits.
make package-gcp-image
gsutil cp target/cloud-image/disk.raw.tar.gz gs://<your-bucket>/capos-disk.tar.gz
gcloud compute images create capos-hybrid \
--project=<your-project> \
--source-uri=gs://<your-bucket>/capos-disk.tar.gz \
--guest-os-features=UEFI_COMPATIBLE
UEFI_COMPATIBLE lets the image boot through the GCE UEFI path
(/EFI/BOOT/BOOTX64.EFI); the same image still carries the Limine MBR for the
legacy boot path. After creating an instance from the image, read the boot
landmark on the serial console:
gcloud compute instances create capos-test \
--project=<your-project> --image=capos-hybrid
gcloud compute instances get-serial-port-output capos-test \
--project=<your-project> | grep 'capos kernel starting'
The reference end-to-end GCE serial-console-boot flow (build, upload, import,
boot, evidence capture, teardown) is the tools/cloudboot/ harness; see
tools/cloudboot/README.md.
AWS VM Import
AWS VM Import/Export accepts RAW, fixed VHD, and stream-optimized VMDK. Upload one shape to S3 and import it as a snapshot, then register an AMI:
make package-aws-image
aws s3 cp target/cloud-image/capos-aws.vhd s3://<your-bucket>/capos-aws.vhd
aws ec2 import-snapshot --disk-container \
Format=VHD,UserBucket="{S3Bucket=<your-bucket>,S3Key=capos-aws.vhd}"
# after the snapshot import task completes, register an AMI from the snapshot:
aws ec2 register-image --name capos-hybrid \
--architecture x86_64 --root-device-name /dev/xvda \
--boot-mode uefi \
--block-device-mappings \
'[{"DeviceName":"/dev/xvda","Ebs":{"SnapshotId":"<snap-id>"}}]'
Boot-mode notes
The hybrid image boots either firmware path, so the AWS boot mode is a deployment choice, not an image rebuild:
--boot-mode uefi– Nitro-based instance types boot the ESP/EFI/BOOT/BOOTX64.EFI(Limine UEFI). Recommended on modern Nitro instances.--boot-mode legacy-bios– older/legacy instance types boot the Limine MBR path. Use this only if the target instance type does not support UEFI.
uefi-preferred is also valid and lets the instance type decide. RAW
(Format=RAW, file capos-aws.raw) and stream-optimized VMDK
(Format=VMDK, file capos-aws.vmdk) import the same way; choose the container
your upload path prefers. AWS rounds the EBS volume size up to whole GiB on
import, so the RAW shape is not pre-rounded.
Scope Boundary
These notes cover import + serial-console boot only. They do not enable cloud
NIC or storage drivers, do not automate live cloud runs, and add no new trusted
build inputs beyond format conversion of the already-pinned-Limine image. The
provider-NIC/storage and cloud usable-instance tracks remain blocked in
docs/backlog/hardware-boot-storage.md.
Local Users, Storage, And Policy Backlog
Design and task decomposition for manifest-seeded and disk-backed local user
management. This work belongs to the User Identity, Sessions, And Policy track
and depends on capability-native storage reaching at least a RAM-backed
Store/Namespace proof before durable account mutation is meaningful.
Grounding
This decomposition is grounded in the current capability, identity, manifest, storage, and authority-broker documents:
docs/capability-model.mddocs/architecture/manifest-startup.mddocs/proposals/user-identity-and-policy-proposal.mddocs/proposals/userspace-authority-broker-proposal.mddocs/proposals/storage-and-naming-proposal.mddocs/proposals/oidc-and-oauth2-proposal.mddocs/proposals/cryptography-and-key-management-proposal.mddocs/security/trust-boundaries.mddocs/roadmap.mddocs/tasks/README.md
Relevant prior-art research files:
docs/research/eros-capros-coyotos.mddocs/research/genode.mddocs/research/plan9-inferno.mddocs/research/sel4.mddocs/research/zircon.md
Design Position
user remains a human-facing policy term, not a kernel subject. The kernel
should not learn uid, role, group, tenant, or external-claim semantics.
Account records, roles, attributes, labels, profiles, and federation claims
decide which capabilities a trusted broker may mint or delegate; they are
never independent authorization tokens.
Terms
The identity vocabulary should be precise enough that later schemas do not accidentally recreate Unix users.
- Principal: the stable identity key used by auth, policy, audit, and ownership metadata. A principal can represent a human, service, guest, anonymous caller, deployment, or pseudonymous external subject.
- User: a user-facing category for a principal/session that represents a
human or human-adjacent actor. It is not a kernel object, not a UID, and not
an authority source. In design text, prefer
principal,session, oraccountwhen one of those is meant. - Account: a durable local record for a principal. It binds credential references, status, roles, attributes, storage roots, quotas, and default policy/resource profile names. Some principals have no account: anonymous callers, some guests, and some one-shot external sessions.
- Profile: a named policy template selected by account data, manifest seed
data, external admission rules, or service configuration. A profile contains
no authority by itself. It selects bundle fragments, quotas, ABAC defaults,
labels, and approval eligibility that the broker may use when minting actual
capabilities. Use
policy profileorresource profilewhen the narrower meaning is intended; use plainprofileonly for prose that intentionally covers both. - Policy profile: the authorization template: roles, ABAC defaults, allowed bundle fragments, approval paths, label defaults, and external admission constraints.
- Resource profile: the quota and default-resource template: storage, memory, CPU share, process/thread/cap limits, IPC limits, log volume, network posture, and launcher posture.
- Session: a live authenticated, guest, anonymous, or external context. It has freshness, expiry, source, auth strength, audit identity, and a selected policy profile plus resource profile. A session receives capabilities; an account does not run.
- Session liveness cell: mutable trusted session-manager state behind the
immutable process
SessionContext. It records whether the session islive,logged_out,revoked,expired, orrecovery_only, plus session and policy epochs used by renewal and grant decisions. - Role: an RBAC label attached to accounts or sessions. It is used by a broker to decide eligibility for bundle fragments or leased grants. It is not authority after the corresponding cap is absent.
- Workload: a process or supervised subtree launched with a concrete CapSet. It may carry session/account metadata for audit and policy, but it runs with capabilities, not as a user.
There are three account and admission sources:
- Manifest seed accounts: immutable or append-only bootstrap records in the boot package. These create first local operators, recovery identities, service identities, emergency guest policy, and initial policy bundles.
- Local account store: mutable disk-backed account, credential, role, attribute, quota, and resource-profile records. This is the normal source for durable local accounts after storage is available.
- External identity admission and bindings: OIDC, passkey, cloud, deployment, or certificate-backed principals mapped to system policy profiles or existing local accounts. External claims are inputs to ABAC and account binding; they do not grant local authority by themselves.
Manifest seed data should be sufficient to boot, recover, unlock storage, and create or repair the local account store. It should not become a permanent mutable account database. Disk state should be authoritative for ordinary accounts after the account store is initialized, with explicit versioning, rollback detection, and recovery import/export.
Account Model
The first durable data model should be small and cap-shaped:
struct AccountRecord {
recordId @0 :Data;
principalId @1 :Data;
kind @2 :PrincipalKind;
displayName @3 :Text;
status @4 :AccountStatus;
credentialRefs @5 :List(Data);
roles @6 :List(Text);
attributes @7 :List(Attribute);
resourceProfile @8 :ProfileRef;
policyProfile @9 :ProfileRef;
homeRoot @10 :StorageRootRef;
createdAtMs @11 :UInt64;
updatedAtMs @12 :UInt64;
schemaVersion @13 :UInt32;
storeEpoch @14 :UInt64;
recordVersion @15 :UInt64;
policyEpoch @16 :UInt64;
previousHash @17 :Data;
contentHash @18 :Data;
}
struct ProfileRef {
profileId @0 :Data;
versionId @1 :Data;
epoch @2 :UInt64;
}
struct StorageRootRef {
storageServiceId @0 :Data;
rootObjectId @1 :Data;
rootKind @2 :StorageRootKind;
schemaVersion @3 :UInt32;
rootVersion @4 :Data;
}
enum StorageRootKind {
namespace @0;
}
enum AccountStatus {
active @0;
disabled @1;
locked @2;
recoveryOnly @3;
}
struct ResourceProfile {
profileId @0 :Data;
versionId @1 :Data;
epoch @2 :UInt64;
homeQuotaBytes @3 :UInt64;
tempQuotaBytes @4 :UInt64;
processLimit @5 :UInt32;
threadLimit @6 :UInt32;
capLimit @7 :UInt32;
memoryCommitLimitBytes @8 :UInt64;
frameGrantLimitPages @9 :UInt64;
endpointQueueLimit @10 :UInt32;
inFlightCallLimit @11 :UInt32;
retired12 @12 :UInt32; # was pending IPC submission quota; do not reuse
ringScratchLimitBytes @13 :UInt64;
logQuotaBytesPerWindow @14 :UInt64;
networkProfile @15 :Text;
cpuBudgetUsPerWindow @16 :UInt64;
cpuWindowUs @17 :UInt64;
timerWaiterLimit @18 :UInt32;
launcherProfile @19 :Text;
}
homeRoot is a persistent reference that the account/storage broker resolves
into a live Namespace capability at session-bundle time. It is not itself a
capability, not a path, and not a raw Directory. The first implementation
should use capability-native Namespace as the account home source of truth;
Directory is a compatibility projection returned by a filesystem or POSIX
adapter when a workload needs file-like APIs. storageServiceId names the
trusted storage service instance, rootObjectId names the stored namespace
root within that service, rootKind keeps the record extensible while v1 only
accepts namespace, and schemaVersion lets future storage-root encodings
fail closed.
External identities should bind to accounts through explicit records:
struct ExternalIdentityBinding {
bindingId @0 :Data;
provider @1 :Text; # oidc issuer, cloud provider, cert authority
subjectHash @2 :Data; # hash(provider kind, issuer, tenant, subject)
principalId @3 :Data; # local or pseudonymous principal
tenant @4 :Text;
acceptedClaims @5 :List(Text);
expiresAtMs @6 :UInt64;
policyProfile @7 :ProfileRef;
resourceProfile @8 :ProfileRef;
schemaVersion @9 :UInt32;
storeEpoch @10 :UInt64;
recordVersion @11 :UInt64;
policyEpoch @12 :UInt64;
previousHash @13 :Data;
contentHash @14 :Data;
}
Claims such as OIDC groups, acr, amr, tenant IDs, device posture, source
network, and token age are ABAC inputs. They must be normalized before use and
discarded or refreshed when stale.
Gate 0 schema-plan decisions are recorded in
docs/proposals/user-identity-and-policy-proposal.md: durable account records
belong in a separate account-store schema/service slice, while UserSession
keeps only session/profile summaries and opaque broker result handles. Durable
joins use fixed opaque binary IDs rather than display names. Disk-backed
records require schema versions, monotonic store and record versions, policy
epochs, previous hashes, content hashes, and compare-and-set mutation
preconditions. Recovery import from manifest seed data is additive and
conservative: preserve validated IDs, disable stale bindings, avoid automatic
authority widening, and emit audit records or stay in bounded emergency mode.
Default Session Resources
The default resource bundle for a session backed by a local account should be useful but narrow:
terminal: the foregroundTerminalSessionfor this login.session: read-onlyUserSessionorSessionContextfor audit identity, auth freshness, and display.home: read-writeNamespaceorDirectoryscoped to the user’s home root.config: read-write user config namespace, separated from application data.cache: bounded user cache namespace with eviction policy.tmp: bounded per-session temporary namespace deleted at logout or expiry.logs: read-only view of this user’s own session logs plus a write-only application log sink.launcher: restricted launcher for approved applications and demos.approval: client for requesting broker-reviewed grants.credentials: self-service credential update interface that never exposes verifier material.keyring: scoped secret unwrap/use interface for this user’s data classes, not raw global key export.status: read-only system status with sensitive device and security state redacted unless a role grants more.
No entity should receive implicitly unbounded consumption of limited system
resources. Every default bundle needs an associated ResourceProfile covering
at least memory, CPU share, storage bytes, process/thread/cap counts, endpoint
queue state, in-flight calls, network posture, and log volume. Ring
submissions remain fixed-bound by ring depth and dispatch budget instead of a
profile quota. This backlog can name the requirement, but the general
resource-accounting model should be a separate design proposal because it
applies to users, services, guests, anonymous callers, drivers, storage,
network stacks, and test workloads.
Default guest resources should be explicitly weaker: terminal, session, ephemeral tmp/home, restricted launcher, self-contained logs, tight memory and CPU quotas, and low process/thread/cap limits. Guests should not receive durable home storage, persistent credentials, network listeners, service management, or administrative approval paths unless policy names that exception.
Anonymous remote sessions should receive almost nothing: a login/account- creation path, optional read-only documentation/help caps, the minimum auxiliary state needed for the protocol, tight memory quota, low CPU share, short expiry, and no default shell, home, launcher, network listener, durable namespace, or broad service cap. Authentication or explicit account creation is the normal path from anonymous to durable authority.
External sessions should be admitted only by explicit configuration. The
configuration either maps the external subject to an existing local account, or
permits auto-creation of a pseudonymous/tenant-scoped account with a named
policy profile and resource profile. A federated login may receive a durable
namespace only when an ExternalIdentityBinding or auto-creation rule maps it
to a local principal and the provider assertion is fresh enough for that
profile.
Service accounts should receive no terminal and no interactive bundle. Their default resources are measured-binary launch authority, service-specific state namespace, log writer, bounded network or IPC caps, and supervisor-approved credential/keyring usage.
Roles
Roles are bundle selectors and approval eligibility, not authority by themselves. The first role set should be conservative:
guest: interactive temporary session with no durable storage.local-user: normal local account; owns a home/config/cache profile.developer: may launch development tools, read own build logs, and request scoped test network/client caps.storage-admin: may inspect and repair selected storage services and quota records, but cannot read user homes or unwrap user keys by default.net-operator: may request leased network-stack and listener management caps for named services.service-operator: may restart or inspect named services through init-owned supervisor caps.security-auditor: may read selected audit/security logs but not user private content.account-admin: may create, disable, lock, and bind accounts; cannot read credential verifier material or user homes.policy-admin: may update role, ABAC, and label policy after fresh strong authentication; cannot directly mint end-resource caps.recovery-operator: manifest-seeded break-glass identity with local-console and storage-recovery constraints.system-updater: may update trusted boot packages, policy schema, and service packages through measured update workflows.service-account: non-interactive role profile constrained by measured binary, supervisor, and service name.
External groups should not be imported as roles automatically. A binding rule may map a provider group to a local role only for a named tenant/provider, with expiry, audit, and conflict handling.
Permission Rules
Initial rules should be expressed in terms of cap bundles and wrappers:
- No session receives raw
ProcessSpawner, rawFrameAllocator, broadDeviceManager, or unrestrictedStoreAdminby default. homegrants are owner-scoped. Sharing returns attenuated sub-namespace or file capabilities through a broker and records the grant in audit state.configwrites are allowed for user-owned preferences. Security-relevant changes such as credential policy, role bindings, and external identity bindings require broker approval and fresh authentication.- Credential services expose verify, enroll, rotate, disable, and recovery operations. They never return password hashes, PHC verifier blobs, private passkey material, or raw MFA secrets to ordinary sessions.
- Keyring caps expose use or unwrap operations scoped to a data class and session. Exportable key material requires a separate explicit backup grant.
- Storage-admin repair caps should operate on volume metadata, namespace integrity, quota ledgers, and snapshots. They should not imply decrypt/read authority for user content.
- Network listener authority is opt-in. Normal users may receive client network
caps by profile; listeners require
net-operator, service policy, or an application-specific grant. - Service management is named and leased.
service-operatorgrants must name the service or service group and should not include arbitrary spawn authority. - External identity sessions are denied local administrative roles unless a local binding explicitly allows that provider, tenant, principal, role, auth-strength, and expiry.
- Disabled or locked accounts may authenticate only to recovery flows that are explicitly allowed by account state.
- Role changes, external binding changes, policy changes, and recovery actions emit audit events with principal, session, source, previous value, new value, policy version, and approval grant.
RBAC And ABAC Split
RBAC should answer coarse questions:
- Which default bundle profile does this session receive?
- Which approval requests is this session eligible to make?
- Which service, storage, or account-management roles can appear in a grant?
ABAC should narrow or deny based on context:
- auth strength and authentication age,
- local console vs remote terminal vs browser companion,
- external provider, tenant, normalized claims, and token freshness,
- session age, account state, recovery mode, and boot mode,
- requested capability interface, method class, target object owner, target sensitivity/integrity label, quota impact, and lease duration,
- service package measurement and supervisor identity for service accounts.
The broker should return capabilities, wrapper caps, leases, or denials. A
plain PolicyDecision.allowed = true is not authority and must not be usable
outside the broker/minting path.
Username-Aware Local Password Login
The current shell-led login command is username-aware for the local password
path as of 2026-04-30 02:18 UTC: it prompts username> before hidden
password>, sends an account/principal selector plus proof/source metadata to
SessionManager.login, and lets SessionManager choose the account-owned
credential reference before minting a session. This is still a bootstrap
implementation over one console verifier; disk-backed credential records and
multi-verifier account storage remain future work.
Status 2026-05-01 08:47 UTC: default password-authenticated local operator
sessions mint with expiresAtMs = u64::MAX; the shell renders that as
expires_at_ms=never. Manifests that set a non-default operatorMs still
exercise wall-clock expiry for focused stale-session proofs.
The target console UX is:
loginprints generic login text and promptsusername>.- The shell reads the account name with visible echo and bounded line length.
- The shell prompts hidden
password>only after a submitted username. - All denials print the same
authentication denied.text, regardless of whether the account name is missing, disabled, locked, recovery-only, profile-incompatible, or the password proof is wrong. - Setup remains explicit. A fresh-image
setuppath must either create the first local operator-kind account name or clearly state which volatile compatibility account owns the credential.
The implementation should change the request shape before adding user-visible multi-account behavior:
SessionManager.loginshould carrymethod, an account/principal selector, proof bytes, and source metadata. For the password path, the selector is a normalized local account name or opaque account ID; proof bytes remain the submitted password until a challenge/response verifier exists.SessionManagerverifies the bootstrap console password only after the selected manifest/default account owns credential referenceconsole-password. FutureCredentialStorerecord APIs should preserve the same account-owned reference rule without exposing whether the selector, account, credential reference, or verifier failed.SessionManageruses the account store as the source of principal ID, display name, principal kind, policy profile, resource profile, account status, and credential references when seed account data exists. The no-store fallback accepts the normalized manifest operator seed account name when one exists, and retainsoperatoronly as the bare compatibility default when no seed account exists.- Default password-authenticated local operator sessions should not use fixed wall-clock expiry as their normal lifecycle. They should end through explicit logout, terminal/connection/process-tree close, or administrator revocation; configured hard maxima remain opt-in policy for proofs or deployments that require them.
AuthorityBroker.shellBundlecontinues to derive the shell bundle from the minted session’s policy and resource profiles; it must not trust the typed username as authority after session minting.
Audit and redaction rules are part of the contract:
- Failed pre-auth attempts record only a terminal-local event ID, source class, generic password-denied or password-unavailable reason, auth method, and volatile flag. They leave principal, account, profile, and session fields blank so account enumeration is not possible through logs.
- Successful login records stable principal/session/profile metadata from the minted session, not from raw username text. The password proof, verifier, credential reference secret, and full terminal line never appear in audit, kernel logs, QEMU transcripts, or panic text.
- Wrong username and wrong password should have indistinguishable terminal text and audit shape except for terminal-local event IDs and timing/backoff.
Migration from the existing seeded operator password is explicit:
kernelParams.consolePasswordVerifierPhcmaps to a manifest seed account namedoperatorwith a stable credential reference such asconsole-passwordwhen no richer seed account owns that verifier.- If the manifest already declares a seed account with that credential reference, the verifier belongs to that account and no synthetic account is created.
- The shell accepts
operatoras the username for migrated manifests. A wrong or unknown username follows the same denial path as a wrong password. - Setup-created credentials remain volatile until disk-backed account storage lands; the prompt and audit record must keep saying so.
- Documentation and smoke transcripts should stop treating a bare password as sufficient identity once the username-aware flow lands.
Required proof coverage for the first implementation slice:
make run-loginandmake run-smokeprompt forusername>before hiddenpassword>, acceptoperatorplus the existing demo password, and reject a wrong username and wrong password with identical terminal denial text.make run-login-setupcovers first-credential setup and then username-aware login for the resulting account or the default migration account.make run-local-usersproves manifest-backed operator account lookup, resource/profile inheritance, and account-status denial without exposing account existence in failed audit records.- Host tests cover manifest migration from
consolePasswordVerifierPhcto the default seed account, duplicate credential-reference rejection, and normalized account-name lookup.
Ordered Backlog
Gate 0: Grounding And Schema Plan
- Update the identity proposal with this manifest/disk/external account model once the current Telnet milestone no longer owns serial focus.
- Update identity docs to use
principal,account,session, andprofileconsistently, reservinguserfor human-facing prose. - Publish the terminology in user-facing mdBook pages, not only this
backlog. At minimum update
docs/overview.md,docs/capability-model.md,docs/proposals/user-identity-and-policy-proposal.md, anddocs/proposals/oidc-and-oauth2-proposal.md, anddocs/security/trust-boundaries.mdso readers encounter the same terms from normal documentation entry points. - Decide whether account records live in the existing user-identity schema slice or a separate account-store schema slice.
- Define stable IDs for local principals, external bindings, resource profiles, storage root references, and policy versions.
- Define rollback and version checks for local account-store records.
- Add design notes for how recovery imports a damaged or missing local account store from manifest seed data.
- Write the cross-cutting limited-resource and quota proposal before treating any guest, anonymous, local-account, service, or external profile as complete.
Gate 1: Manifest Seed Accounts
- Extend boot/init config with manifest seed accounts, service accounts, resource profiles, and initial role bindings.
- Validate that seed account names, principal IDs, roles, resource profiles, and credential references are unique and resolvable.
- Reject manifests that grant ordinary users privileged kernel caps directly instead of broker-mediated policy inputs.
- Add host tests for duplicate principals, missing resource profiles, invalid bootstrap roles, and service-account/binary mismatches.
- Add a QEMU smoke that boots a manifest-seeded local operator and proves the session receives only the expected default bundle.
Gate 2: AccountStore And ResourceProfile Services
- Add
AccountStoreReaderandAccountStoreManageruserspace interfaces for lookup, create, disable, lock, role binding, external binding, and profile updates while keeping read and mutation authority separate. - Add
ResourceProfileReaderandResourceProfileManageruserspace interfaces, keeping mutation authority separate from session reads. - Implement a RAM-backed prototype for account records and resource profiles before durable storage.
- Add broker integration that assembles default local-account, guest,
anonymous, external, and service-account bundles from account/profile
records.
- [x] Add the config-side default bundle planner over account/profile
records for a follow-up
AuthorityBrokerCapwiring slice. - [x] Wire the bootstrapAuthorityBrokerCapshell-bundle path to the config-side planner for manifest-backed local/operator sessions. - [x] Add manifest-backed guest identity/planner wiring for shell bundles and QEMU proof coverage without preserving a bootstrap guest fallback. Guest sessions now require an explicit manifest seed, guest shell bundles receive no default service endpoints, and guest launchers are empty unless a resource-profile launcher posture names a narrow proof binary. - [x] Add a local-users QEMU proof that the initial anonymous shell bundle is minimal before password login. - [x] WireSessionManager.sshPublicKeythroughRamAccountStoreso SSH-minted sessions inherit account-status enforcement. SSH denial causes are exposed as stableauth=audit codes (ssh-account-missing,ssh-account-disabled,ssh-account-locked,ssh-account-recovery-only,ssh-account-lookup-failed, plus the existing key/signature/ profile codes); failed records keepprincipal/profileblank by policy. End-to-end QEMU proof of the account-status denial paths waits forAccountStoreManagerCapas a kernel cap source (Gate 2 follow-up below). - [x] Migrate local password login planning into schema and proof work: add a username-awareSessionManager.loginselector, move password verification to account-owned credential references, preserve anti-enumeration audit/terminal behavior, and keep the existing single seeded operator password as an explicitoperatoraccount migration path. Implemented 2026-04-30 02:18 UTC as prioritized ad-hoc work; durable multi-verifier credential storage remains future Gate 2 account-store work. - [ ] AddAccountStoreManagerCapandResourceProfileManagerCapas kernel cap sources so a focused QEMU demo can disable an account and proveSessionManager.sshPublicKeyrejects withauth=ssh-account-disabled. This is also the prerequisite for external-binding admission tests below. - Complete mutable session lifecycle methods before treating short session
expiry as production shell UX. The first
live/logged_outliveness cell andUserSession.logoutpath is implemented forSessionManager-minted sessions, including explicit remote DTO gateway logout and owned-session connection-close propagation. Remaining work includes owner-shell exit, terminal close, administrator revocation, renewal/recovery, full audit reason separation, and in-flight endpoint result cancellation after logout.SessionManager.renewshould extend or rotate a session only after account status, auth freshness, policy/resource profile epochs, requested duration, and revocation state pass. Renewal must mint fresh grant leases or wrappers when policy needs a new decision and must not silently revive stale ordinary grants. - Add host tests proving account-admin cannot read homes, credential verifier material, or key material through account-management caps.
Gate 3: Disk-Backed Local Account Store
- Store account records, credential references, resource profiles, role
bindings, external bindings, and policy-version metadata in
capability-native
Store/Namespacerecords. - Add atomic update or compare-and-set semantics for account mutations.
- Add monotonic version/epoch checks to reject stale or replayed account records after reboot.
- Add local snapshot/export records for recovery and rollback inspection.
- Add QEMU reboot proof that a created local account, role binding, disabled state, and home namespace survive restart.
Gate 4: Default Resource Bundles
- Implement bundle construction for
guest,local-user,developer,service-account, andanonymousprofiles. - Allocate per-account
home,config,cache, and per-sessiontmpnamespaces through storage caps instead of synthetic path strings. - Add quota checks for home bytes, temp bytes, processes, threads, caps, memory, CPU share, endpoint queue state, in-flight calls, and log volume.
- Add QEMU proof that two local accounts receive different home/config namespaces for the same application binary.
- Add QEMU proof that guest and anonymous sessions cannot persist data or request durable home caps.
Gate 5: RBAC Runtime
- Implement a
RoleDirectorybacked by account-store role bindings. - Map roles to named bundle fragments and approval eligibility.
- Add policy tests for
account-admin,policy-admin,storage-admin,net-operator,service-operator,security-auditor, andrecovery-operator. - Add deny tests showing roles alone do not authorize capability calls after the relevant cap is absent or revoked.
- Add audit records for role grant, role removal, and role-derived bundle issuance.
Gate 6: ABAC Runtime
- Define the first
PolicyRequestcontext fields for auth freshness, source, external provider/tenant, object owner, object label, requested interface, method class, quota impact, and lease duration. - Prototype the
PolicyEngineboundary with a small in-repo evaluator or Cedar-backed host-side prototype hidden behind the same interface. - Add ABAC tests for fresh-auth requirements, remote-vs-local denial, provider/tenant scoping, maintenance windows, service measurement, and storage label constraints.
- Ensure
PolicyDecisioncannot be used directly by callers; only a broker may turn it into capabilities or leases. - Add QEMU proof that a stale authenticated session can keep ordinary home access only through policy-explicit recovery/renewal state and cannot obtain a privileged leased cap until renewal or reauth mints fresh grant leases.
Gate 7: External Users
- Implement external identity binding records keyed by provider and subject hash, with tenant and expiry.
- Normalize OIDC/passkey/certificate/cloud claims before they enter policy requests.
- Add explicit external admission configuration. It must either bind an external subject to an existing account or permit auto-creation with a named policy profile and resource profile.
- Add an external pseudonymous account profile with bounded temp storage, bounded durable storage only when configured, and no local administrative roles.
- Add explicit local-account binding flow for external users that need durable local home storage.
- Add tests rejecting stale tokens, wrong tenants, unmapped provider groups, disabled bindings, and external attempts to assume local admin roles without a binding rule.
- Add default-deny admission tests for absent external admission config, auto-creation disabled, and unknown policy/resource profile names.
Gate 8: MAC/MIC And Labels
- Attach confidentiality and integrity labels to account profiles, session profiles, namespaces, logs, secrets, and service accounts.
- Implement wrapper caps for read-like, write-like, control-like, and transfer-like method classes where labels affect the grant.
- Add tests for no-read-up, no-write-down, integrity write/control, and trusted-subject exceptions.
- Decide whether any label/hold-edge metadata must become kernel-visible for mandatory transfer rules, or whether broker and wrapper enforcement is sufficient for the first implementation.
Gate 9: POSIX Profile Adapter
- Add POSIX profile metadata for uid/gid/user name/group name/home path as compatibility data derived from account records.
- Ensure
setuid,chmod, and ownership metadata cannot grant caps outside the compatibility filesystem service. - Add tests proving POSIX metadata changes do not widen cap bundles.
Verification Gates
- Host tests for manifest validation, account-store mutation policy, role mapping, ABAC request construction, external binding normalization, and audit emission.
- QEMU smokes for manifest-seeded operator login, two-account namespace separation, guest/anonymous persistence denial, disk-backed account survival across reboot, external pseudonymous login, and stale-session privileged-grant denial.
- Documentation updates to
docs/security/trust-boundaries.md,docs/proposals/user-identity-and-policy-proposal.md, and storage docs before any implementation is treated as selected milestone work.
Shared-Service Demo Backlog
Detailed decompositions and design notes for chat, adventure, and federated
service demos. docs/tasks/README.md links here but should not inline these subtasks.
Design Notes
Multi-process userspace applications exercise the resident-server plus
shell-spawned-client pattern on top of the completed boot-to-shell, Endpoint,
ProcessSpawner, and session-bound invocation substrate. Chat has migrated to
service-scoped caller-session identity, and Aurelian ordinary player state is
also keyed by live caller-session metadata. The focused text adventure manifest
uses session-bound service grants for player, NPC, Adventure, and chat paths.
Reuse is extracted after the second service lands, not speculatively.
Federation is blocked on future
network-transparency proof work.
The authoritative migration gates for removing caller-selected shared-service
identity now live in docs/backlog/session-bound-invocation-context.md under
Gate 4. The older service-object migration backlog is historical background
only unless the selected milestone changes again.
The first slice keeps chat and adventure usable as ordinary spawned commands
over generic Endpoint grants plus explicit StdIO for terminal I/O. The
shared demos/capos-chat crate owns typed request/response DTOs for the
prototype bridge, while the top-level shell/ crate owns generic process
commands such as spawn, blocking run, wait, and grant parsing. The
StdIO clients are a smoke harness and compatibility path, not the target
capOS-native command boundary; native interactive apps should later expose
command surfaces as described in
docs/proposals/interactive-command-surface-proposal.md.
Room-scoped MUD speech (say, tell) maps naturally onto chat channels, so
adventure should consume the chat service rather than reimplementing pub/sub.
Keep the Adventure schema for world state and verbs that are not speech;
route say/tell/NPC dialog through Chat subscriptions scoped to room
channels.
Chat Follow-Ups
Completed context:
- MVP
Chatendpoint interface and event variants. The original receiver-selector identity MVP has since been replaced by service-scoped caller-session keys for normal chat membership. - Public chat lobby stays
#lobby; adventure room speech uses hierarchical#room/<world>/<room>channels, with the demo world under#room/demo/<room>. - Client
poll()is used for MVP event delivery; foreground client drains queued events before prompting again. -
demos/chat-server/scaffold with capos-rt entry and bounded per-channel history ring. -
join/leave/send/whoplus fan-out for the legacy endpoint-metadata MVP. -
demos/chat-client/over explicit capos-rtStdIOplus chat endpoint client. - Chat stays out of native shell builtins and runs as a spawned command
with omitted-badge
stdio: client @stdioandchat: client @chatgrants. -
make run-chatsmoke: shell-spawned client sends a line through the resident service, resident bot observes it and replies, foreground client prints the reply.
Remaining:
- Migrate chat from legacy endpoint receiver-metadata identity to
service-scoped caller-session keys.
ChatRootpossession authorizes join attempts; membership, channels, sends, leaves, and polls key off the live opaque caller-session reference instead of a caller-selected selector. - Add per-principal state keyed by
UserSession.info, admin-only verbs, typed denial results, and redacted audit records per join/leave/send. - Defer a distinct
Subscriptioninterface until federation or native command surfaces need a separate event authority object.
Client-server interface sync audit (2026-05-03 19:02 UTC):
- Walk the recent chat-server interface and behavior changes against the
demo chat clients to confirm no drift. Audit covered six commits
(
5dc0e8ca/exitbanner alignment,e7d0e00d[history]text label plusEventKind::History,45384fa0server-assignedmember-Nsender labels in place of the caller-supplied join handle,7bb90528idempotent re-join,dc7ece49membership keyed by caller session, andf5eab276EndpointUserDatarename) and confirmeddemos/chat-client/,demos/capos-chat/,demos/chat-bot/, anddemos/chat-observer/already track each one. The chat-client banner lists/exitand accepts both/exitand/quit;ChatEventKinddecodes all five schema variants and renders the server’s[history] <text>prefix verbatim while the chat-observer reportskind=historyvskind=liveseparately;event.senderis always taken from the server response so server-assigned member labels are shown without modification; re-join goes throughleave_waitfollowed byjoin_waitwith no stale “already joined” assertion; chat-server session keying and the neutral endpoint user-data name are entirely server-side. Verified live withmake run-chat(exit 0): the smoke transcript shows[chat] /join <channel>, /leave, /who, /exit, or plain text,[chat] #lobby <member-2> hello from shell, and[chat] #lobby <member-1> [chat-bot] echo-bot heard you.— i.e. the/exitbanner, server-assignedmember-Nlabels, and pass-through of message text all behave as expected. The[history]label path is not exercised bymake run-chat(chat smoke joins fresh) but is covered bymake run-adventureviaassert_adventure_npc_chat_history_actorintools/qemu-shell-smoke.sh. - Latent
demos/chat-bot/self-echo filter resolved in7b9c5993by skippingChatEventKind::Historyevents so replayed[history] [chat-bot] ...messages no longer slip past the prefix-only check.
Adventure Follow-Ups
Completed context:
- MVP
Adventureinterface for non-speech verbs and room views. - Legacy receiver-selector layout distinguishes player from NPC authority
on both a future
PlayerSessionand room chat channels. MVP manifests reserve low selectors for shell players (chat=1,adventure=2) and service/NPC authority at100+. - Rooms map to chat channels as
#room/<world>/<room-id>, with the demo world underdemo. -
demos/adventure-server/scaffold with a small room graph, typed world verbs, live caller-session keyed player state, and chat channel metadata. -
demos/adventure-client/as a spawned command over explicitStdIO,adventure, andchatendpoint grants. - Adventure
StdIOparser remains prototype-scoped and should later be replaced with a command surface exposing nested paths such asgo,take,drop,inventory,say, andchat join. -
make run-adventuresmoke: scripted player moves rooms, completes one state-changing world action, and exits cleanly through the shell-spawned client. - NPC-as-process fleet: one process per NPC, each holding manifest-issued legacy player/adventure receiver metadata plus chat endpoint authority for room dialog.
- At least two concrete NPCs ship with liveness asserted in the adventure smoke.
- NPC process exit surfaces as
ProcessHandlecompletion on the server side.
Game-depth follow-up:
- Decompose the Aurelian Frontier proposal through
docs/backlog/aurelian-frontier.mdrather than expanding this shared service harness backlog with content, combat, economy, and multiplayer details.
Session-bound identity follow-up:
- Finish adventure NPC and service-authority cleanup for the focused shared-service proof now that ordinary player state is keyed by live caller-session metadata. NPC service authority is broker/manifest-issued rather than caller-chosen, and the focused adventure manifest uses the already session-keyed chat service through ordinary chat authority.
Shared Harness Extraction
Completed context:
- Extract duplicated legacy endpoint receive/release/return loop used by
chat and adventure resident services into
demos/service-common/. - Defer a shared bounded event queue until chat history/inbox and
adventure/NPC event needs converge. Current evidence: only
chat-serverhas bounded history/inbox queues; adventure room state and NPC polling do not expose a matching queue abstraction. - Extract bot/NPC client scaffolding shared by chat bot and adventure NPC processes.
- Extract shared chat actor polling loop used by chat-bot, wanderer, and shopkeeper while keeping each actor’s cap validation, join/greeting, reply text, and exit logging local.
- Extract shared chat actor bootstrap for required
consoleandchatcaps plus the single-owner ring client, while preserving actor-specific failure text and behavior setup.
Federated Chat Milestone
Blocked on future network transparency.
Extend chat across hosts after a separate proof shows cap transport crossing machines. This integration test exercises networking, TLS, OIDC, key-management, and audit proposals together.
- Define cross-host addressing (
@user@host,#room@host) and record it in schema. - First cross-VM channel smoke: two QEMU instances, one message delivered across TLS.
- Federated audit: per-host records plus signed cross-host event trail.
Paperclips Terminal Demo Backlog
This backlog tracks future expansion of the clean-room Paperclips terminal demo described in Paperclips Terminal Demo. It is not the current selected milestone.
The clean-room mechanics baseline is recorded in
docs/research/paperclips-clean-room-functional-spec.md.
Use that note as the planning source for gameplay behavior. Do not copy
source-game implementation identifiers, text, assets, generated tables, exact
balance, CSS, or code when expanding content.
Current runnable status: the focused Paperclips manifest now boots an
authoritative Paperclips server and a terminal client. The server owns generated
content, resources, GameState, proof-command gating, unlock checks, and
game-rule mutation. The server owns regular timer cadence and exposes the
current command list and unlocked projects as structured data so server-mode
terminal clients render plain help and projects from server state.
Server-mode terminal clients also render plain status from the server’s
PaperclipsStatusSnapshot, while status --json remains proof-only and
server-gated. Follow-up work should move unlocked command facets behind
server-issued capabilities so later terminal and web clients do not reimplement
rules.
Design grounding for the client/server, structured command-list, structured
plain-status, and structured project-list slices:
docs/demos/paperclips.md,
docs/research/paperclips-clean-room-functional-spec.md,
docs/architecture/ipc-endpoints.md,
docs/architecture/capability-ring.md,
docs/proposals/session-bound-invocation-context-proposal.md, and
docs/proposals/system-info-proposal.md. No other research note applies
directly because this slice uses the existing endpoint/ring transport and does
not introduce a new external OS/runtime protocol.
Current Baseline
Implemented:
- Clean-room terminal implementation inspired by the public Paperclips premise without copying original game code or assets.
-
make run-paperclipsboots a focused manifest, launches Paperclips server services plus a terminal client through the shell, grants explicitStdIOplus aPaperclipsGameendpoint to the terminal client, grantsTimerto the server, drives the first production loop, proves project chains and proof gating, and exits cleanly. - In the focused manifest, game state is local to the Paperclips server
process and disappears when that server exits. Direct standalone launches
without a
gameendpoint retain the older in-process fallback. - The default
system.cuemanifest still advertises the standalone fallback launch withrun "paperclips" with { stdio: client @stdio, timer: @timer }because it does not start the Paperclips server. The structured command-list, status-snapshot, and project-list methods only change server-mode client rendering, so no MOTD/default-manifest text change is needed for this slice. - The pure rules layer lives in
demos/paperclips-contentand is host-testable separately from the terminal adapter. - Paperclips content is authored in CUE, converted through pinned
mkmanifest cue-to-capnpinto the Paperclips-specific Cap’n Proto schema, checked in as generated Rust bytes, then deserialized through typed Paperclips schema bindings at runtime. - Core game balance and content live in CUE: initial state, purchases, projects, unlock effects, production rates, millisecond intervals, currency formatting, price limits, trust thresholds, and phase transition values.
- Manual
makeproduces one clip only; counted manual make requests are rejected. - Automation advances from the
Timercapability in real time, whilerun <ms>is reserved for focused proof launches with an explicitproof_acceleratorcap. - Opening business loop has dynamic demand plus CUE-owned raw-material bundle pricing, slower market updates, purchase pressure, and generated content freshness checks.
- Business-phase explicit sales are time-aware: successful sales start a
CUE-owned cooldown, repeated immediate
sell <n>commands are refused without mutating state, andTimer/proof time advancement clears the cooldown. - Focused QEMU transcript now demonstrates manual work and explicit sales
funding Autoclipper License, one repeatable economic choice, one wire
purchase, and completion of
precision-rollerswith a visible autoclipper-count effect. - Focused QEMU transcript now also demonstrates representative Stage 1
refusal output: an early locked
buy autoclipper, an insufficient-fundsbuy wire 1000, pending manual work, bulk manual rejection, and lockedproject survey-drones, plus a high-pricesell 1demand refusal and asell 2requested-count sale capped by one available clip, plus a no-wire manual production refusal after automation drains the available wire. - Focused QEMU transcript now proves a Stage 2 project chain after
repeatable marketing investment and scaled business-phase production:
autoclipper-license,precision-rollers,design-search,forecast-engine, andsurvey-drones, ending at== autonomous phase ==. - Stage 3 autonomous rules now use CUE-owned millisecond intervals for drone local-matter conversion, factory wire consumption, probe cosmic matter conversion, and probe replication caps. Host tests cover the resource caps, scaling projects, cosmic replication, completion gating, and validation for the new rule fields.
- Focused QEMU transcript now continues after
== autonomous phase ==to completematerial-harvestersandfoundry-lines, run milliseconds, and assert visible drone/factory counts plus local-matter conversion and additional clip production. - Focused QEMU transcript now closes the representative late-game proof:
after the autonomous/factory proof it completes
mesh-coordination, transitions throughseed-probesinto== cosmic phase ==, asserts visible probe replication plus cosmic-matter conversion and clip production, then leavesfinal-conversionlocked.make run-paperclipsis a representative transcript, not an exhaustive playthrough. - Public player launches no longer expose fast-forward.
run <ms>is hidden from normal help output and refused unless the launched process receives the explicitproof_acceleratorcapability used by the focused QEMU proof manifest. The shell rejects attempts to mint that authority by renaming an ordinary@timergrant. - Player-facing project ids, labels, title text, completion text, and Strategy resource wording have been renamed away from distinctive source-game terms.
- Active schema, CUE content, Rust rules, generated-content guardrails, and focused smoke assertions use clean-room Strategy internals rather than source-game resource identifiers.
- Purchase parsing treats omitted counts as one and rejects explicit zero counts without mutating game state.
Known limits:
- There is no save/load path; process exit discards game state.
- The focused QEMU proof stops at the cosmic production milestone. It does not prove a compact full win; host coverage checks that the final conversion cost exceeds a generous one-hour normal-play creativity upper bound.
Clean-Room Gameplay Stages
Stage 1, opening business loop:
- Add host rules tests for manual production pacing, wire depletion, explicit sales, price-sensitive demand, marketing/demand investment, and automation intervals.
- Extend the focused transcript so existing early progression shows manual work, one economic decision, one automation purchase, and one project unlock without copying external balance values.
- Keep representative refusal output legible in the focused QEMU transcript for missing funds, pending manual work, bulk manual production, locked purchases, and locked projects.
- Add focused transcript cases for missing wire and demand/sale refusals.
The QEMU proof now asserts
No wire available.andNo demand at current price.without changing game balance. - Add any remaining Stage 1 sale-limit cases that need end-to-end
transcript proof.
The QEMU proof now asserts the unique
sell 2window starts with one available clip, ends with zero available clips, and incrementsSoldfrom 1 to 2.
Stage 2, data-driven project chain:
- Expand original CUE project content around generic effects: production multiplier/resource grant, demand policy change, compute-resource generation, strategy resource unlock, capacity grant, and stage transition. Generated CUE content now covers production/resource grants, public-demand grants, operations grants, Strategy unlock/resource grants, processor/memory capacity grants, and stage transitions. Direct trust grants remain unsupported because available trust is recomputed from clip milestones minus spent trust.
- Replace per-project Rust effect variants with one generic CUE-backed loader/evaluator. Project completion now applies generic production and resource grants, public-demand grants, compute and strategy-resource grants, design/Strategy unlock flags, capacity grants through processors/memory, and one-step stage transitions without matching gameplay on project ids or effect kinds. Direct trust grants remain unsupported because available trust is recomputed from clip-count milestones and trust spent; adding a trust field would be misleading without changing that invariant. Generated-content tests now verify the checked-in CUE payload exercises every currently supported generic category that fits the model.
- Add current-model validation for project graph bounds before adding more content. Host tests now reject invalid/empty ids, duplicate ids, too many projects for the completion bitset, zero-cost projects, no-op or zero grant effects, out-of-stage effects, stage-transition regressions, and missing transition paths from business to autonomous, cosmic, and complete.
- Add explicit prerequisite and cyclic unlock-chain validation after the
project schema grows named prerequisite/unlock edges.
Projectnow carries data-only named prerequisites from CUE through the typed schema; generated content records the intended business, autonomous, cosmic, and completion unlock chain; runtime availability requires completed prerequisites in addition to stage and cost gates; and host validation rejects missing, malformed, self-referential, duplicate, and cyclic prerequisite edges. - Add focused smoke coverage for at least one project unlock chain after repeatable demand investment, including a phase transition out of the business phase.
Stage 3, autonomous and completion mechanics:
- Model later-stage autonomous production with independently authored labels and bounded rules for resource conversion, factory/drone-style scaling, exploration or replication capacity, and completion progress.
- Add host tests for stage transition predicates, autonomous production cadence, trust/capacity limits, and the completion condition.
- Keep QEMU coverage representative rather than exhaustive: prove one transition and one timer-driven later-stage action, then rely on host rules tests for full playthrough cases. The transcript now covers one autonomous timer-driven conversion, one cosmic transition/probe interval, and locked completion-stage availability without scripting every late-game purchase.
- Split proof acceleration from player gameplay. Normal interactive
Paperclips sessions should advance only from the granted
Timercapability;run <ms>should either be removed from the player-visible command set or gated behind an explicit harness-only authority that normal shell users cannot mint accidentally.helpand docs should stop presenting fast-forward as a regular player command. Implemented by requiring theproof_acceleratorcap for terminal fast-forward and by proving the normal launch refusal before the accelerated QEMU proof path. - Rebalance the completion path after fast-forward is no longer public.
A normal player should not be able to reach
== complete phase ==within one real-time hour. Keep the smoke proof representative by stopping at selected milestones or by using a clearly test-only acceleration path, rather than shrinking late-game matter and project costs until a full win fits in the QEMU transcript. Implemented by increasingseed-probescosmic matter/wire scale, raisingfinal-conversionclip and creativity costs, adding host coverage for the one-hour creativity bound, and changing the focused QEMU proof to stop after cosmic probe replication/production withfinal-conversionstill locked. - Add coverage for the gameplay/test-mode split: host rules should still
test bounded millisecond advancement directly, but the terminal adapter
should prove that normal player input cannot invoke fast-forward. If a
harness-only accelerator remains, the focused QEMU proof must demonstrate
that it is tied to an explicit proof capability or proof manifest, not
ambient player authority. The focused QEMU proof first launches
Paperclips with
StdIOplus the normalPaperclipsGameendpoint, assertsrun <ms>is refused, asserts a forgedproof_accelerator: @timergrant is rejected, then relaunches against the proof server endpoint withproof_acceleratorfor the accelerated transcript. - Make business-phase sales time-aware. Repeated immediate
sell <n>commands should not bypass demand cadence; model a replenishing demand budget, outstanding orders, or a sell cooldown backed byTimeradvancement, and keep host/QEMU coverage for sale-limit refusals. Implemented with a CUE-owned sale cooldown and QEMU coverage for the immediate repeat-sale refusal. - Tighten clean-room naming. Replace player-facing names and text that mirror the source game’s title or distinctive project labels with independently authored names while preserving the generic paperclip maximizer premise.
Stage 4, persistence and assertions:
- Add a compact
status --jsonor equivalent machine-readable command only if future smoke tests need stronger assertions than the human transcript.status --jsonnow emits one deterministic compact JSON object with numeric game-state fields, and the focused QEMU proof asserts a late-game machine-readable status line without dropping the human transcript checks.
Blocked on platform persistence:
- Add save/load or restart-resume behavior after capOS has a durable user storage path appropriate for spawned demos.
- Keep saved state scoped to this child process or an explicitly designed storage capability; do not introduce ambient filesystem or service state.
Schema-Aware Content Migration
Completed:
- Defined
schema/paperclips-content.capnpas a bounded data-only schema for initial state, rules, purchases, trust milestones, projects, costs, and project effects. It contains no live capOS capabilities or interface objects. - Kept
demos/paperclips-content/content/paperclips.cueas the authoring source, now matching thePaperclipsContentschema root directly. - Converted generated content with pinned tools through
mkmanifest cue-to-capnp, then rendered checked-in aligned Rust bytes from the schema-validated binary. - Updated
paperclips-contentruntime loading to deserialize the typed Paperclips Cap’n Proto message instead ofcapos_config::CueValue. - Wired the freshness check into
make generated-code-checkthroughgenerated-paperclips-content-check.
Remaining guardrails:
- Keep generated content as schema-validated binary data; do not add runtime CUE parsing to the demo.
- Keep the focused QEMU transcript representative: one launch, one production loop, one automation purchase, one early project unlock/effect, and clean exit. Cover larger rule validation with host tests.
- Continue using the Rust validator for semantic bounds that Cap’n Proto cannot encode directly, such as project count, id shape, graph reachability, and nonzero costs/effects.
Client/Server Architecture Backlog
Goal: migrate Paperclips from one terminal process into an authoritative server
plus thin clients. The terminal client should render output, parse player
commands, and invoke server capabilities; it should not own GameState, timer
advancement, proof acceleration, unlock checks, or game-rule mutation.
Staged tasks:
- Define the first coarse Paperclips server/client schema. The initial
PaperclipsGameendpoint covers initial text, command text, command results, proof-only explicit time advancement, and the first structured command-list and plain-status queries. Regular automation is driven by the server’s own timer capability. Session creation, broader structured state/events, project/purchase listings, and capability transfer points remain future protocol work. - Add the first Paperclips server process. It owns
GameState, generated content, proof-command gating, unlock checks, and mutation rules while preserving current clean-room mechanics and host rules coverage. - Convert the terminal Paperclips process into a client when a
gameendpoint is present. The client keeps stdio, blank-command repeat, and transcript handling, then routes commands to the server. It still accepts server-rendered text in this first slice. - Move regular game timer cadence into the Paperclips server. Server-mode terminal clients still receive a timer grant so they can poll and display server-generated status messages while the player is idle at the prompt.
- Add a structured command-list protocol method. The Paperclips server
reports the commands available for the current state/session, including
proof-only and later-stage commands only when they are actually available,
and server-mode terminal clients render
helpfrom those command specs. Command execution remains the existing text request path in this slice. - Add a structured status snapshot protocol method.
PaperclipsGame.statusreturnsPaperclipsStatusSnapshotwith the fields needed to render the existing plainstatustranscript, and server-mode terminal clients format that snapshot locally instead of relying on server-formatted status text.status --jsonremains proof-only instrumentation decided by server-side authority and is not exposed through normal structured status. - Add a structured project-list protocol method.
PaperclipsGame.projectsreturns unlocked project entries with id, label, description, rendered cost, and status markers so server-mode terminal clients render plainprojectslocally from server-provided state.project <id>execution remains the existing text request path and still mutates server-owned game state. - Split command parsing and presentation more cleanly. The terminal client can now render help from structured server command specs, plain status from structured server snapshots, and plain projects from structured server project lists, but it should eventually parse player command syntax and render broader structured server state/events instead of sending raw command strings and displaying server-formatted command results.
- Model unlocks as server-issued facets or command capabilities. Early
stages may expose coarse
play,project,purchase, andprooffacets; later stages should narrow toward facet-per-command authorities once capability transfer ergonomics are ready. - Keep proof acceleration explicit. The server, not the client, should decide whether proof-only commands such as millisecond advancement and machine-readable status are available for a session.
- Update
make run-paperclipsto prove the first split in QEMU: shell launches server services plus terminal clients, normal and proof sessions use different server endpoints, proof-only commands remain gated by server authority, and the existing representative transcript still exits cleanly. - Extend
make run-paperclipsafter command facets land: prove the client cannot mutate state locally beyond server-granted facets and that unlock/facet changes are visible. - Add the later web shell/client path after the server protocol is stable. A browser-facing client or gateway should share the same Paperclips game capabilities as the terminal client instead of reimplementing game logic.
Deferred:
- Add durable save/load only after capOS has a durable user storage capability appropriate for spawned demos.
- Split every gameplay command into a distinct transferred capability only when the platform has ergonomic capability transfer and revocation patterns for short-lived command facets.
Run Targets, Init Mandate, And Default-Run Integration
This backlog captures three intertwined make-target and manifest-policy
requirements raised against the current Makefile and system-*.cue set. They
share manifests, harness scripts, and review surface, so they should land as
one mainline track rather than scattered fixes.
Policy Statements
make run-*targets only start QEMU. Any scripted input driving, transcript assertion, timeout-based pass/fail, log greps, or harness script wrapping must live outside therun-*recipe – either in a siblingtest-*target or in a host harness invoked by the user directly.initusage is MANDATORY in every boot manifest. The boot init binary must beinit(thecapos-initELF). Service or demo binaries such ascapos-shell,credential-store,terminal-session,network-client,revocable-read,memoryobject-shared-parent, and per-demo entrypoints must be declared as services and launched byinit, never as the top-level init binary.make runstays the default user-facing target demonstrating a sane, safe, full-featured (as of the current state) capOS instance. When a milestone introduces a user-visible common service or binary, it must be integrated intomake run– either auto-started or advertised through MOTD instructions describing how the operator reaches it – as part of the milestone’s doc-update gate.
Current State
run-* recipes that contain test logic
Snapshot from Makefile at branch base. All targets in this list embed
input drivers, asserts, or harness invocations and therefore violate
policy 1:
run-smoke,run-uefi,run-netrun-spawn,run-shell,run-restricted-shell-launcherrun-chat,run-adventure,run-terminalrun-credential,run-login,run-login-setup,run-local-usersrun-tcp-listen-authorityrun-revocable-read,run-memoryobject-sharedrun-ssh-host-key,run-ssh-authorized-key,run-ssh-public-key-session,run-ssh-public-key-auth,run-ssh-feature-policyrun-ringtap-failing-callrun-measure
(run-network-client, run-telnet(-vm), and
run-ssh-gateway-terminal-host(-vm) were on this list but are now exit-2
retirement stubs with no test logic, retired with the kernel socket owner.)
Compliant run-* recipes (QEMU-only):
run– interactive, manifest-driven, terminal on stdio.run-display– interactive variant with QEMU display.
Manifests violating the init mandate
Init binary is something other than init:
system-smoke.cue– init binarycapos-shellsystem-shell.cue– init binarycapos-shellsystem-login.cue– init binarycapos-shellsystem-login-setup.cue– init binarycapos-shellsystem-local-users.cue– init binarycapos-shellsystem-credential.cue– init binarycredential-storesystem-terminal.cue– init binaryterminal-sessionsystem-revocable-read.cue– init binaryrevocable-readsystem-memoryobject-shared.cue– init binarymemoryobject-shared-parent
Manifests already compliant: system.cue, system-adventure.cue,
system-chat.cue, system-spawn.cue, system-measure.cue,
system-restricted-shell-launcher.cue,
system-tcp-listen-authority.cue, all remaining system-ssh-*.cue
(system-telnet.cue, system-network-client.cue, and
system-ssh-gateway-terminal-host.cue are removed with the kernel socket
owner).
Default-run feature integration gap
make run boots system.cue, which already wires the anonymous shell,
the login flow with the seeded password verifier in MOTD, the
chat/adventure demos, chat/adventure spawn instructions, the host-local
remote-session CapSet gateway, and (as of 2026-05-14 09:07 UTC) the
self-served remote-session-web-ui service. The Telnet research demo is
retired (the focused make run-telnet / system-telnet.cue path and its
gateway demo are removed with the kernel socket owner).
Milestones still absent from the default path
or its MOTD are local-user setup, terminal-session focused proofs, SSH gateway
terminal host, and any future SSH shell milestone.
The default make run recipe now attaches virtio-net with
host-local remote CapSet forwarding to guest port 2327 and host-local web UI
forwarding to guest port 8080. Both use the same ?=-overridable host port
with fallback-to-free-port behavior implemented in
tools/qemu-run-hostfwd.py. Other network-backed milestones, such as the SSH
gateway terminal host and future SSH shell, still require their own safe
default forwarding or an explicit deferral before they can be called integrated
into make run.
Open Gates
Gate A: Naming and contract
- Decide the rename split. Pick one of the two consistent options
and apply it uniformly:
- Strict:
runandrun-displayare the onlyrun-*entrypoints; every other currentrun-*recipe (includingrun-uefi,run-net,run-measure) becomestest-*regardless of whether its body is reduced to a plain QEMU start, because the policy is enforced by name, not by current contents. - Permissive: any QEMU-only recipe against the default manifest with documented firmware/device flag variations may keep arun-*name, withtest-*reserved for recipes that script input or assert output. Pick this only if the policy text inCLAUDE.md/REVIEW.mdcan spell out the boundary unambiguously so reviewers do not have to relitigate the split per target. - Document the chosen policy in
CLAUDE.md“Build and Test” section andREVIEW.mdso future targets are added under the right prefix without case-by-case judgement.
Gate B: Init mandate enforcement
- For every non-compliant manifest above, restructure so the init
binary is
initand the previous top-level binary becomes a service. Preserve the focused-proof intent: the service receives the same scoped caps it had as init, init holds only the bootstrap authority needed to spawn and supervise it, and the smoke/proof transcript continues to assert the same boundary properties. - Add a manifest-loader validation rule (or
mkmanifestcheck) that rejects any manifest whoseinitConfig.init.binaryis notinit. The rule should also reject the field being missing. Update host tests to cover the negative case. - Update every doc that currently describes shell-led or
service-led manifests as having the service as init. A 2026-04-28
12:48 UTC docs pass reconciled the current default
system.cuepath as standalone-init-owned while preserving focused shell-led smoke descriptions wheresystem-smoke.cueandsystem-shell.cuestill bootcapos-shelldirectly. Gate B remains open until the focused manifests themselves are migrated or documented as explicit exceptions, the loader/manifest validation rule lands, and a final re-grep confirms no stale default-boot wording remains.
Gate C: Test split
- Move every scripted input driver, transcript assertion, timeout
wrapper, harness invocation, and log grep currently embedded in a
run-*recipe into a newtest-*recipe. Therun-*side, where retained, becomes a one-lineqemu-system-x86_64 ... $(QEMU_COMMON) $$serial_argsinvocation against the same ISO. - Keep
tools/qemu-*-smoke.sh,tools/qemu-*-harness.sh, and the ringtap viewer assertion out ofrun-*recipes. They are acceptable insidetest-*recipes or as standalone host scripts. - Update CI hooks, developer docs, and
docs/tasks/README.mdcheckpoints that referencemake run-<x>for verification to callmake test-<x>instead. Audit the migrated review-finding task records, theREVIEW_FINDINGS.mdtombstone history, and the recent changelog updates so historical entries stay accurate while new gates use the renamed targets.
Gate D: Default-run feature integration
- Define an integration checklist that every milestone’s doc-update
step must satisfy before close: either auto-start the new
user-visible service from
system.cuewith safe defaults, or extend the MOTD with a clear, copy-pasteable instruction block describing how to reach the feature from the default boot. - Backfill the integration for already-shipped milestones whose
user-visible services are still absent from
make run: local-user setup, terminal-session, and the SSH gateway terminal host slice. The Telnet gateway remains a focused research fixture undermake run-telnetand is deliberately absent from the default operator path. For each remaining milestone, either wire the service intosystem.cue(preserving the default-safe posture) or add a MOTD section with the exact command. Network-backed milestones must also record the QEMU device and forwarding posture. SSH gateway terminal-host integration remains deferred until its production/non-loopback gates pass or a separate host-local development forwarding rule is reviewed. A MOTD-only addition is not sufficient for a network-backed milestone. - Add the integration checklist to the “Stage Implementation
Workflow” section of
CLAUDE.mdso future milestones cannot land without it.
Interaction With Paused SSH Shell Gateway Milestone
docs/tasks/README.md currently pauses the SSH Shell Gateway behind Service Object
Identity Migration. When SSH work resumes, it will still have a visible goal of
make run-ssh-shell and additional make run-ssh-* proofs. Without an
explicit checkpoint, that milestone can land more non-compliant run-* recipes
(scripted host harnesses, transcript asserts, network-only smokes) before this
backlog is applied.
- Before the SSH Shell Gateway milestone closes, add Gate A’s
naming decision and Gate C’s test split as a milestone-level
prerequisite: the user-visible target name (
run-ssh-shellvstest-ssh-shell) and the location of any host harness must conform to the chosen rename split, andmake runintegration must be addressed under Gate D rather than left as a separaterun-ssh-*recipe. Record the decision in the SSH milestone checkpoint or block its closeout.
Sequencing
Gate A is purely policy and naming and unblocks the others. Gate B
(init mandate) and Gate C (test split) can proceed in parallel on
separate branches per affected manifest area, because they touch
different files: B rewrites system-*.cue and may add services to
init/src/main.rs, while C touches Makefile and the tools/qemu-*
harnesses. Gate D follows once the test split lands so MOTD updates
land alongside system.cue changes without competing with make run’s
recipe.
Out Of Scope
- Renaming or relocating
tools/qemu-*-smoke.shandtools/qemu-*-harness.shscripts. They stay where they are; only their callers change. - Producing a new test runner that aggregates all
test-*targets. That is a separate CI ergonomics task. - Reworking the focused-proof transcript content. The intent is to preserve current proof coverage, not extend it.
Aurelian Frontier Backlog
Detailed decomposition for growing the current deterministic mission slice
into the Aurelian Frontier game described in
docs/proposals/aurelian-frontier-proposal.md.
This track is low priority and currently dormant (deprioritized in
docs/roadmap.md under “Game/demo plans … are deprioritized”). It is a
forward decomposition reservoir, not a landed-history log: completed-phase
milestone chronology lives in docs/roadmap.md (the dated Aurelian Phase 9-12
entries) and in git history. This file keeps the forward-looking plan, the
unstarted gates/themes, and one-line orientation for why current shapes exist.
Promote it into docs/tasks/state.toml and root task records only when the
selected visible outcome changes to a game-depth milestone.
Current Baseline
The deterministic Aurelian expedition slice is landed: a shell-spawned
adventure-client with explicit StdIO/Adventure/Chat grants drives a
session-keyed adventure-server that owns room, inventory, combat, writ,
evidence, and effect state. Typed Adventure methods cover look, movement,
inventory, inspect, use, status, combat, authority verbs, delegation,
order, seal, leave, and the market/repair verbs. The expedition mission
proves ward-writ, route evidence, ward-wraith combat, delegation, effects,
eagle-standard recovery, witness-certified custody, evacuation, gate sealing,
downed-state refusal, and leave cleanup. adventure-content owns the pure
deterministic combat/zone/profile foundation and bounded construction-job state.
Inventory/status splits into Items, Writs, Relics, Marks, Evidence.
Phase-level done/not-done state is encoded in the checkbox lists below; the
dated landed milestones for each phase are in docs/roadmap.md and git.
Known limitations (still open):
- Most state-transition/failure text still lives in Rust handlers. Authored item, spell, and use text has moved into generated content for the named slices; broader text migration is open.
- NPCs that matter to world state are mostly server text, not separate actors holding scoped game authority. Aurelian chat-only boot NPCs share init’s system session under session-bound chat membership, so the smoke proof treats them as one session-keyed chat member (all greetings visible, Centurion Varro the single deterministic polling reply actor). Distinct concurrent NPC chat memberships need distinct spawned session contexts.
- Combat profiles are generated and proven for the current mobs, but broad weapon parsing, durable alert groups, pending interruption state, generalized stealth openings beyond the imp-scout route slice, and broader authority-combat verbs remain open.
- Rank, faction, debrief, market, party, and item-transfer logic are bounded proof slices, not durable profile/ledger subsystems. PvP consent and two-client multiplayer proofs are not present.
- Construction jobs are bounded to one service-owned field-repair proof. They do not yet persist durable stock ledgers, replenish from outposts, update output/currency inventories, advance job time, persist crash-recovery state, or expose a general crafting API.
Implementation Posture
The kernel capability model remains the authority boundary. Game code should not be trusted because it is written in Rust or Lua; it should be trusted only to the extent that it holds narrow caps and correctly uses typed capOS interfaces. A useful game demo should eventually show both Rust and Lua code using the capability model properly.
Rust remains the right implementation language for bounded state, no-std userspace services, typed Cap’n Proto calls, deterministic QEMU proofs, and resource validation.
Do not let Rust become the long-term content authoring language. Larger room graphs, mission beats, item descriptions, dialogue hints, aliases, shop catalogs, and debrief text should move into a bounded data-driven mission format before the Aurelian content grows materially.
Keep this split:
- The kernel owns authority enforcement through capabilities, while Rust services own simulation rules, combat resolution, object limits, schema encoding, and failure behavior.
- Mission content owns room/site data, visible descriptions, actor dialogue, aliases, lead text, deterministic encounter placement, and debrief records.
- Lua can later own deterministic scenario glue and NPC behavior when the
capos-luarunner exists: mission beats, state-machine dialogue, debrief variants, quest-board text, and scripted reactions that still call typed capOS/game interfaces through granted caps. - Runtime loading may stay compile-time embedded at first, but the content must pass the same validator used by host tests and QEMU smoke setup.
Candidate content formats:
- CUE plus
mkmanifest cue-to-capnp: preferred for new schema-rooted data messages now that host-side CUE evaluation can feed a caller-specified Cap’n Proto struct through the pinnedcapnp convertpath. - RON: compact Rust-native authoring, but adds another format and tooling convention.
- TOML: familiar for simple data, weaker for graph validation and nested mission rules.
Prefer CUE if the implementation can reuse existing host-side validation and generate a bounded Rust data blob. Avoid runtime parsing in the game service until there is a concrete reason.
New Aurelian content migrations should use the cue-to-capnp flow when the
data has, or needs, a stable schema boundary:
-
Define a bounded Cap’n Proto root struct for the content slice rather than extending
SystemManifestor encoding ad hoc JSON. -
Author the source as CUE in package mode, with the same id/text/list bounds documented here and with build-time variation supplied through
--tagorCAPOS_CUE_TAGSonly when the generated output is intentionally tagged. -
Convert with the pinned tools:
make cue-ensure capnp-ensure CAPOS_CUE="$(make -s cue-path)" \ CAPOS_CAPNP="$(make -s capnp-path)" \ cargo run --manifest-path tools/mkmanifest/Cargo.toml --target "$(rustc -vV | awk '/^host:/ {print $2}')" -- \ cue-to-capnp --package adventure_content --import-path schema \ demos/adventure-content/content/prototype.cue schema/adventure-content.capnp \ AdventureContent target/generated-adventure-content.bin -
Feed the converted data into the existing host validator/generator or a reviewed no-std decode path, then check in only deterministic generated artifacts required by the current build.
-
Keep live capOS authority out of content files. Writs, grants, NPC roles, and future service references may be represented as ids or policy records, but actual capability transfer stays in runtime IPC and service logic.
The existing tools/adventure-content-gen JSON-to-Rust path may remain for
already implemented slices. When a new content family needs a schema or a
larger migration touches generator boundaries, prefer moving that family to
cue-to-capnp instead of growing bespoke JSON parsing.
Near-Phase Gates
The first game-depth milestone must produce a player-visible improvement. A branch that only moves the existing hardcoded room data into a generated blob is technical prep, not completion of the near phase. The first complete near-phase slice must keep the current Aurelian expedition mechanically stable while also making the path discoverable through canonical ids, aliases, lead text, and specific failure messages.
Legacy endpoint badges are not part of the Aurelian authority model. New
Aurelian phases must keep player, party, NPC, and chat participation keyed by
session-bound invocation context or by future broker-granted service facets,
not by manifest-assigned or user-selected receiver selectors. The focused
run-adventure gate rejects system-adventure.cue if badge: fields are
reintroduced.
Input and content bounds for the near phase:
- command lines accepted through the current
StdIOadapter: 256 bytes; - typed object ids, actor ids, mob ids, writ ids, directions, spell names, and
skill names: 64 bytes, ASCII alphanumeric plus
_and-; - chat
saytext and future free-form command text: 256 bytes after trimming, with no semantic parsing beyond the declared text field; - generated content ids and aliases: same 64-byte id rule unless a reviewed schema/runtime change raises it;
- room/site titles: 80 bytes; descriptions: 320 bytes; lead and failure hint lines: 160 bytes; actor dialogue and debrief lines: 320 bytes;
- content lists must use the explicit per-player, per-site, and per-room caps in this file, not unbounded vectors.
If generated mission content is checked in, every branch that changes content or
the generator must provide a freshness check equivalent to
make generated-code-check; stale generated Rust blobs are a review finding.
Authority-RPG Direction
The next design target is a compact expedition RPG where rare authority is RPG power fantasy, not paperwork. The core loop is:
accept mission
choose writs / companions / relics
enter dangerous site
discover authority conflicts
fight / negotiate / delegate / revoke
extract with loot, survivors, evidence, or consequences
upgrade rank, base, companions, and future authority
Design rules for subsequent backlog slices:
- Writs are loot: gear, skill tree, access key, social status, and sometimes curse. A good writ changes what the player can do, carries inspectable issuer/scope/expiry/delegation/revocation rules, and may have bounded affixes or drawbacks under the mission seed.
- Classes are authority archetypes: Warden, Marshal, Archivist, Custodian, Factor, and Heretic/Renegade. Differences come from legal, social, and supernatural verbs, not generic damage numbers.
- Delegation is buildcraft. Companion loyalty, ambition, competence, reputation, fear, and doctrine should affect how delegated authority behaves under pressure.
- Combat attacks authority as well as HP. Forgers, null-priests, bandit captains, corrupt magistrates, spies, oathbreakers, and wraiths should threaten writs, custody, witnesses, route grants, and legal control.
- Denial should reward with leads: a missing witness, hidden jurisdiction, forged seal, rival claim, corrupt actor, unsafe state, rank gate, or alternate route.
- Progression unlocks reach: new jurisdictions, deputy appointment, remote revocation, relic custody capacity, hostile negotiation, disputed shrine access, and operating without a local witness in constrained cases.
- Base modules unlock verbs. Archive, Temple vault, Barracks, Court, Market hall, Signal tower, and Sanctuary should affect future expeditions through explicit actions, not passive percentage bonuses.
- Controlled randomness covers mission complications, route hazards, faction demands, companion behavior, relic side effects, enemy authority tricks, optional objectives, and loot/writ modifiers. The legal model remains deterministic and auditable under a seed.
- Multiplayer stays scoped to cooperative expedition pressure first. Defer MMO scale, open economies, broad construction seasons, LLM-critical NPCs, federation, and worldlines until the compact expedition loop is excellent.
The pure combat-targeting foundation, generated combat profiles, server integration, and the bounded authority-challenge / writ-affix / delegation / Archive-reach proofs have landed (see the Phase 8/9 checkboxes). Remaining forward sequence:
- Extend the first authority-attacking enemy behavior beyond the bounded forged route/custody claim into broader authority-bearing enemy variants.
- Generalize writ-affix and delegation-buildcraft proofs beyond the single
bounded
ward-writ/Livia cases into more writs and companions. - Extend base/rank reach unlocks beyond the bounded Archive evidence unlock without starting a general construction/base-management system.
- Extend construction jobs only after a visible gameplay need appears: durable stock ledgers, job-time advancement, artifact custody outputs, and facility slot capacity remain future work.
- Keep proofs deterministic: pure Rust tests for new rules and one
adventure-scenario-testpath per new cross-service behavior; keep the shell transcript to representative parser coverage.
Phase 1: Player-Visible Mission Substrate
Visible outcome: a first-time player can complete the current Aurelian expedition
without reading source or memorizing hidden ids, and the read-only mission
content comes from a validated generated blob instead of hardcoded room tables
and scattered text. The mission path and existing QEMU transcript outcomes stay
stable, but look, status, inspection, and failures become clearer.
- Define a bounded
AdventureContentmodel for sites, exits, visible items, actors, mobs, aliases, objectives, leads, and scripted proof-path metadata. - Add host validation for content graph integrity: unique ids, valid exits, valid aliases, referenced actor/item/mob ids, bounded text length, and deterministic ordering.
- Generate or embed a compact static Rust representation for userspace;
keep runtime parsing out of the
no_stdservice unless explicitly justified. - Add a generated-content freshness check and wire it into the relevant branch verification so checked-in content blobs cannot drift from source mission data.
- Move current
square,tavern,garden,cellar,map,coin,key,scout-marker, andward-wraithdescriptors into content data. - Keep all state-changing behavior in Rust handlers; content may select text and ids but must not bypass authority checks.
- Extend
AdventureRoomViewor status text solookpresents objective, visible interactables, actors, active mobs, exits, and one lead line. - Add canonical-id display for objects, actors, mobs, writs, and exits.
- Add alias resolution for common casing and titles, with responses that name the resolved canonical id.
- Add near-miss suggestions for known ids, starting with common failures
such as
ward->ward-writ,wraith->ward-wraith, andliviacasing. - Improve invalid
orderresults so they name plausible next actions when player knowledge allows it. - Split status text into survival state, mission state, held/delegated authority, evidence/effects, and lead.
- Add host tests for rejecting malformed content graphs.
- Keep
make run-adventuretranscript stable after the migration and add assertions for at least one canonical-id suggestion and one improved actor-task hint.
Implementation notes:
- Start with read-only content fields. Do not introduce a general scripting engine for mission logic in this phase.
- Keep object ids ASCII, stable, and bounded by the near-phase limits above unless a reviewed schema/runtime change raises those limits.
- Lua scripting belongs after the data model exists. Do not use Lua to bypass the content validator or make transcript-critical behavior depend on an unbounded script.
Phase 1b: Deterministic Scenario Scripting
Visible outcome: once capos-lua can run scripts with exact grants, selected
scenario and NPC behaviors can move from Rust match branches into deterministic
Lua scripts without changing the authority boundary.
- Use
docs/proposals/lua-scripting-proposal.mdas the scripting design source. - Expose only narrow game host APIs to scripts, such as read current mission state, choose a dialogue branch, emit a debrief line, or request a typed game action through a granted object cap.
- Keep mission authority, inventory mutation, relic custody, combat damage, and cap transfer in kernel-enforced capability calls and Rust service handlers.
- Add deterministic script fixture tests for NPC state machines and scenario beats.
- Add QEMU transcript coverage showing one Lua-scripted NPC or scenario reaction using a granted cap and one denied ungranted path.
- Keep Rust and Lua examples side by side so the demo proves capability discipline is language-independent.
Cut scope:
- No dynamic native Lua modules, no broad
ProcessSpawner, no raw CapIds in scripts, and no script-owned authority beyond the runner’s CapSet.
Phase 1c: Non-Deterministic NPC Brains
Visible outcome: non-transcript-critical NPC flavor can later use the language-model/agent proposals without weakening deterministic proofs.
- Use
docs/proposals/llm-and-agent-proposal.mdfor any LLM-backed NPC implementation. - Keep LLM NPCs behind narrow caps and treat model outputs as suggestions or dialogue data, not authority.
- Restrict LLM use to ambient tavern chatter, optional hints, flavor summaries, or player-facing explanation when exact transcript output is not part of the proof.
- Keep main mission success paths, combat outcomes, custody decisions, policy denials, and QEMU smoke assertions deterministic.
Deferred from Phase 1:
- Dynamic completions belong with the future
CommandSessioninterface and should not duplicate full parser logic in theStdIOadapter.
Phase 2: Aurelian Expedition Map
Visible outcome: the playable mission uses the proposed frontier expedition locations rather than the four-room prototype.
- Replace prototype content with a small
Sitegraph:fort_aurelian,gate_yard,ashen_road,signal_tower, andunder_vault. - Model site metadata: region, threat level, exits, visible items, actors, active wards, and optional required route authority.
- Implement the first mission objective: recover
eagle-standardfrom the ruined signal tower. - Add complications: unstable tower gate, wounded legionary behind a ward, guild scout route information, and temple witness custody requirements.
- Provide at least two acceptable good outcomes, such as recovered standard plus sealed gate, or recovered standard plus survivor evacuation.
- Update
make run-adventureto drive the new mission path with stable assertions.
Cut scope:
- Do not add random mission variants in this phase.
- Do not split mission state into a new service until the single-server model blocks explicit authority or proof coverage.
Phase 3: Authority Inventory And Relic Custody
Visible outcome: player-facing inventory makes authority, evidence, and relic custody visible without implying every entry is a pick-up item.
- Split inventory/status output into
Items,Writs,Relics,Marks, andEvidence. - Keep
takeanddropfor physical items only. - Keep
request,accept,delegate, andrevokefor authorities. - Add
relic custodystate foreagle-standard, including a failure path when the player lacks temple or rank authority. - Add
temple-sealor equivalent witness-certified custody proof. - Ensure relic failures distinguish missing location, missing authority, unsafe state, and witness refusal.
- Add QEMU assertions for relic custody denial, successful custody, and
audit/evidence status output. Complex custody coverage runs in the capOS
adventure-scenario-testuserspace process through realAdventurecap calls; the shell-drivenadventure-clienttranscript remains representative interactive client coverage.
Phase 4: Persistent Profile And Ledger Substrate
Visible outcome: player profile data and mission evidence have bounded save/load semantics, while ordinary client launches remain fresh unless the player explicitly resumes an expedition.
- Define bounded Cap’n Proto records for
AdventureProfile,AdventureExpeditionCheckpoint, andAdventureLedgerRecord, including schema version, content hash or release id, profile id, record/checkpoint version, size limits, and migration policy. - Add host tests for save-record encode/decode, first schema-version acceptance, unknown-content rejection, over-limit rejection, stale-version rejection, and wrong-profile rejection.
- Add the
AdventureProfileServicesummary substrate for bounded create/load/save, local non-reward settings and progression updates, and validation of rank marks, warrior stars, wizard circles, faction standing, cosmetics, contributor badges, title choices, and settings. - Connect
AdventureProfileServicereward and title mutations to ledger-backed authorization onceAdventureLedgerexists, so rank marks, faction standing, cosmetics, contributor badges, and title choices are applied from auditable mission facts rather than direct summary edits. - Add
AdventureLedgeras append-only mission evidence: debrief records, relic custody, forbidden-rite use, witness certifications, reward mints, market/trade receipts, and revocations. - Add
AdventureExpeditionServicefor active expedition checkpoints: current site, objective state, player state, party state, mob state, pending events, and turn ordering. - Add
AdventureSaveStoreas the only persistence adapter used by the profile, ledger, and expedition services. It may target RAM, local disk-backedStore/Namespace, or a futureCloudGameStore, but gameplay services should not call provider-specific APIs directly. - Prove the local baseline first: save and reload a profile, append and replay one ledger record, and explicitly checkpoint/resume one expedition through RAM-backed or disk-backed store semantics.
- Keep
run adventure-clientfresh by default. Add an explicitresumecommand or profile option before loading active expedition state. - Add proof coverage for one rejected stale checkpoint write and one rejected wrong-profile load.
Cut scope:
- Do not make the kernel persist process memory or the live capability graph.
- Do not merge divergent combat checkpoints automatically; reject stale writes and require the player or service to pick a checkpoint.
- Do not require GCP to pass the local QEMU proof path.
Phase 5: Cloud Persistence Bridge
Visible outcome: the same profile, ledger, and expedition records can be stored through an optional cloud-backed capability without changing game service logic.
- Define
CloudGameStoreas a narrow bridge with save/load/append operations matching the localAdventureSaveStoresemantics. - Keep the GCP bridge outside the game authority boundary: the bridge stores
records, but
AdventureProfileService,AdventureLedger, andAdventureExpeditionServicedecide which mutations are valid. - Use Firestore Native mode only for mutable profile/index documents and transactional compare-and-set style updates.
- Use Cloud Storage for versioned snapshots and larger evidence blobs, with object versioning and lifecycle policy so old snapshots do not accumulate without bounds.
- Use Cloud Run or an equivalent narrow service endpoint for the bridge and Secret Manager for bridge-side service credentials. Do not expose those credentials inside ordinary game clients.
- Add local fake-cloud tests that enforce the same stale-write, wrong-profile, append-only-ledger, and size-bound behavior before using real GCP services.
- Add an operational note for project, region, IAM service account, retention, backup/export, and cost controls before any real deployment.
Operational note:
- The first real deployment must use a dedicated Google Cloud project per game-world environment, or an equivalently isolated folder/project split for development, staging, and production. Record the project id, numeric project number, billing account owner, support contact, and break-glass owner in the deployment runbook before enabling writes.
- Choose one primary region for the Cloud Run bridge, Firestore database, Cloud Storage buckets, Secret Manager secrets, and Cloud KMS keys unless a reviewed multi-region design exists. The runbook must name the region and the data-residency reason; cross-region replication is a separate design decision because it affects latency, cost, and recovery semantics.
- Cloud Run is the only provider-facing bridge endpoint in this phase. Ordinary
capOS game clients see only the
CloudGameStorecapability and never receive Firestore document names, bucket names, OAuth tokens, service account keys, Secret Manager secret names, or broad network/provider authority. - The bridge must not be public. Launch requires authenticated invocation,
no
allUsersor disabled-invoker-IAM setting, an explicit Cloud Run ingress mode, and a named invoker identity for the capOS bridge path. Public HTTPS exposure or unauthenticated browser calls would bypass theCloudGameStorecapability boundary even if provider credentials remain hidden. - The bridge runs as a dedicated service account. Isolate Firestore by database or project boundary, then enforce adventure collection/document path allowlists in bridge code before issuing provider calls; do not rely on Firestore security rules or collection-scoped IAM for server-side access. Grant only the database-level Firestore role needed by the isolated database, Cloud Storage object access for the configured adventure buckets, Secret Manager secret access for named bridge secrets, and KMS encrypt/decrypt authority for the configured game-world key. Do not grant project owner/editor, wildcard bucket admin, or user-browser OAuth authority to the bridge.
- Firestore Native mode holds mutable profile/index documents and version/CAS records only. Every mutable write must read the current document version and commit inside a transaction or equivalent preconditioned update; stale writes fail closed and preserve the current document.
- Cloud Storage holds immutable or versioned records: expedition snapshots, larger evidence blobs, exports, and content-addressed objects. Buckets must enable object versioning before production writes and must have a lifecycle policy bounding noncurrent versions and abandoned exports. Versioning is recovery, not immutability: create-only evidence and content-addressed writes must use generation-match preconditions, and audit evidence that must resist replacement or deletion needs an explicit retention policy or hold gate before launch.
- Retention policy belongs in the runbook before launch: profile/index documents keep only the current mutable summary plus required audit references; ledger/evidence objects retain enough noncurrent versions for recovery and audit; debug exports and test objects have a short TTL. Legal hold, public world audit, or contributor-reward evidence retention needs separate approval before becoming indefinite.
- Backup/export is explicit. Schedule Firestore exports and Cloud Storage
inventory or backup jobs to a separate restricted bucket, record restore
drills, and verify restore through
CloudGameStorevalidation rather than accepting provider bytes as authoritative. - Cost controls are launch gates: configure budgets and alerts for Cloud Run requests/egress, Firestore reads/writes/storage, Cloud Storage live and noncurrent object bytes, KMS operations, and Secret Manager access. Add lifecycle rules before enabling object versioning so stale snapshots do not grow without bounds.
- Provider credentials stay bridge-side. Prefer service account identity and Secret Manager references over static keys. If a static credential is unavoidable for a development bridge, record its rotation owner, expiry, allowed environment, and revocation procedure; never put it in manifests, game save records, browser JavaScript, or QEMU transcripts.
Cut scope:
- No direct Firestore/Cloud Storage calls from
adventure-clientoradventure-server.
Sequencing note:
- Cross-device multiplayer through GCP is on the roadmap, but it must wait
until local multiplayer authority, session-bound invocation context, and
stale-write rejection are already correct behind
AdventureSaveStoreandCloudGameStore. The cut is sequencing, not a permanent scope exclusion.
Phase 6: User-Owned Browser Save Vault
Visible outcome: private player data can be exported and imported as signed, encrypted save capsules through a browser using user-granted Google Drive or Firebase authority, without making those blobs authoritative for shared world state.
- Define
UserSaveCapsulewith schema version, capsule version, profile id, device id, content hash, migration policy, record kind/version, previous capsule hash, plaintext hash, ciphertext, AEAD algorithm, signature algorithm, signer public key id, signature, and timestamp. - Define the save-vault key-boundary policy model for local capOS-host key material, GCP game-world Cloud KMS authority, and browser transport authority.
- Use storage-domain encryption keys: local capOS-host key material for
local storage and GCP Cloud KMS envelope encryption for GCP-backed data,
with a per-world or per-shard KMS KEK wrapping service-owned DEKs. The
browser transports ciphertext and provider handles; it must not receive
DEKs,
SymmetricKeycaps,KeySourcecaps, KMS decrypt/unwrap grants, or provider-independent plaintext authority. - Prefer Google Drive
appDataFolderwith the narrowdrive.appdatascope for personal backup files that the user should not edit directly. - Allow Firebase/Firestore user documents only as a transport/cache for
encrypted capsules. Firestore/Firebase rules can bind access to the
authenticated user through an explicit
{request.auth.uid}path template, but cannot validate encrypted game semantics. - Add KMS/IAM design notes for the GCP path: one key ring/key per game-world instance or shard, narrow decrypt authority for the game-world service, key rotation policy, and revocation behavior for retired worlds.
- Add restore validation in
AdventureSaveStore: signature, content hash, schema version, profile id, previous hash, monotonic version, size bounds, and wrong-profile rejection. - Add rollback policy: importing an older private checkpoint may restore an explicit local expedition snapshot, but it must not erase append-only ledger facts, contributor rewards, market receipts, or public multiplayer outcomes.
- Add host tests for tampered ciphertext, wrong signing key, wrong profile, stale version, unknown content hash, oversized capsule, and replayed old capsule.
- Add a web-terminal or browser-companion fixture path with fake Drive and fake Firebase adapters before using real Google APIs.
Cut scope:
- No authoritative public world state from user-owned blobs.
- No direct provider SDKs inside
adventure-server. - No mandatory Google account for local QEMU adventure proof.
- No silent cloud sync; export/import or sync must be visible user action or profile setting.
- No browser-held game-world key capabilities, KMS decrypt/unwrap grants, or provider-independent plaintext authority.
Phase 7: Actors As Capability-Bounded Processes
Visible outcome: important NPCs have process identity and only the capabilities their role needs.
- Keep
adventure-serveras the authority owner until direct NPC mutation needs are explicit. - Add actor content and chat behavior for Centurion Varro, Magister Livia, Acolyte Iunia, Maro the Guild Scout, Wounded Legionary, and Gate Echo.
- Give chat-only NPC processes only
consoleand the narrowest available chat authority. The focused manifest uses selector-freechatgrants; user-selectable or manifest-assigned receiver selectors must not be part of this proof. - For any NPC that can affect world state, add a separate scoped
broker-granted
AdventureNpcfacet or equivalent session-bound service authority. Do not use receiver-selector compatibility grants as NPC mutation authority. - Route NPC offers and refusals through player-visible commands and chat events rather than hidden server side effects.
- Add focused smoke assertions proving each resident chat-only NPC process launches and contributes visible room chat history under session-bound chat membership.
- Add distinct service sessions, chat participant ids, or a scoped
AdventureNpcfacet before requiring every boot-launched NPC process to act as an independently polling chat participant.
Current shape: system-adventure.cue launches the six named actor processes
with only console plus selector-free chat grants. Because boot-launched
actors inherit init’s system session, chat membership intentionally collapses
to the service-scoped caller-session key, and make run-adventure proves each
named actor published visible room history with Centurion Varro as the single
deterministic polling reply. Independent per-NPC chat participants and direct
world-mutation NPC authority remain the open [ ] items above.
Phase 8: Tactical Combat And Mob State
Visible outcome: combat remains deterministic and bounded, but offers more
than repeating attack.
- Add a bounded mob model with hp, armor, ward, attack, morale, traits, intent, and threat level.
- Keep
ward-wraith; add at least two ofimp-scout,ash-ghoul,gate-hound, andecho-centurion. - Implement command-level turns: player action, eligible ally action, hostile action, deterministic transcript.
- Add visible intent when scout or wizard support makes it available.
- Add
retreatand at least one blocked-retreat failure. - Extend
guardto protect an ally when one is present. - Add QEMU assertions for one intent line, one ally-related combat action, and one deterministic hostile response.
Cut scope:
- No random combat outcomes until seeded mission variants land.
- No hidden dice rolls that make QEMU transcript assertions fragile.
Follow-up combat architecture, grounded by Game Mechanics Prior Art. Most of this is landed (deterministic target-zone damage, fatigue, interrupt, recognition disclosure, stealth openings, alert-source generalization, construction-fed weapon/focus/cloak combat, sustained-magic fatigue refusal, and scenario coverage); the open forward items are:
- Use Evil Islands as planning input for tactical fight shape (targeted body zones, damage-type/armor matchups, stealth openings, visibility-dependent recognition, fatigue/retreat pressure, cast interruption, equipment-derived effects). Not a clone target; Aurelian keeps command-level turns, capability-gated authority, deterministic smoke coverage, and service-owned outcomes.
- Move mob combat definitions out of hard-coded
adventure-servertemplates into validated generated content once the next combat slice needs more than the current generated profile fields (damage affinities, zone armor, alert groups, recognition thresholds, stealth-opening permissions, cast-interrupt vulnerability). - Extend
adventure-contentpure logic before server integration:CombatZone,DamageKind,CombatAttackProfile,MobCombatProfile, deterministic target-zone damage, fatigue cost, interrupt outcome, recognition level, and alert propagation helpers. - Extend CUE content and
tools/adventure-content-genbeyond the current generated mob combat profiles when alert groups, stealth openings, or richer profile references land. - Add typed Adventure surface only where the existing text target cannot stay unambiguous (e.g. structured target/zone/weapon fields for the browser client); current explicit-zone parsing already covers the proof commands.
- Update
AdventureRoomView/status output for inspected vs rough mob intel. - Keep
adventure-serveras the authoritative combat state owner. Durable alert state, broader limb persistence, and pending multi-turn interruption remain future work. - Add targeted attacks with a small fixed zone set:
head,hands,legs,core, with deterministic zone effects. - Add damage-type and mitigation metadata for weapons, spells, armor, ward state, and zone armor, with explicit result text.
- Make enemy recognition depend on scout/wizard support, distance, direct inspection, and prior codex evidence.
- Add height and route-position inputs to enemy recognition once room topology and browser-client world positioning expose those facts as structured state rather than server-local command context.
- Add stealth-opening support for ambush/backstab-style advantages.
- Add a bounded pull/alert behavior for the ward-wraith to gate-hound path.
- Add a bounded imp-scout warning path.
- Generalize alert-source resolution across ward-wraith alarm, imp-scout warning, and escaping-scout paths.
- Add bounded failed-stealth gameplay integration for route-supported imp-scout attacks lacking scout-track evidence.
- Add bounded noisy-movement gameplay for recovered relic movement.
- Add broader noisy-movement integration beyond the current relic movement, ward-wraith, and imp-scout paths.
- Tie combat output to equipment construction inputs from Phase 11c:
weapon/shield/focus/cloak object type, material, facility quality,
warrior stars, wizard circles, and remaining enchantment budget affect
bounded damage, guard, fatigue, interruption, and resistances. Bounded
slices for
shield-wallcloak,bronze-gladiusweapon, andember-dartfocus have landed; broader equipment handling, construction jobs, and durable runtime inventory semantics remain open. - Add a bounded sustained-magic fatigue refusal for the
shield-bindpath. - Generalize explicit fatigue and cast-interruption rules for heavy equipment, running, retreat, additional sustained magic, and monster fatigue, creating meaningful retreat/guard choices rather than hidden penalties, and without unfair infinite-fatigue monster behavior.
- Add QEMU scenario coverage through
adventure-scenario-testfor inspected targeted attack, damage/armor explanation, stealth/scout opening, alert/pull response, cast interruption/fatigue refusal, and retreat/blocked-retreat. - Keep rewards mission-audited. Do not add enemy grinding as a rank, warrior-star, wizard-circle, or faction-standing source.
Phase 9: Skills, Spells, Ranks, And Reputation
Visible outcome: player competence affects available actions and future grants without becoming a grind.
- Model player rank labels:
tiro,signifer,centurion, andlegate. - Keep warrior stars and wizard circles visible in status, but make them policy inputs for brokered authorities.
- Add missing skills from the proposal as needed by the first mission:
shield-wall,counter,rally, or narrowed equivalents. - Add missing spells as needed by the first mission:
mend-woundandstabilize-gatebefore higher-circle spells. - Add explicit failure text when rank, stars, or circles block an action.
- Add debrief outcomes that update rank marks, faction standing, and evidence records from auditable mission facts.
- Add QEMU assertions for one rank/circle denial and one debrief reward.
Deferred:
dome-shield,demon-brand, and high-circle gate rewriting are later campaign scope unless a focused proof needs them.rallyremains explicitly reserved for later centurion command authority.
Phase 10: Market And Logistics
Visible outcome: the shopkeeper becomes a small capability-shaped economy proof instead of flavor chat.
- Add typed verbs for
quote,buy,sell,trade, andrepairbefore accepting them as implemented gameplay. - Define bounded market roles: quartermaster, guild scout, temple annex, and field engineer.
- Implement one deterministic route purchase or favor exchange with Maro.
- Implement one authority-gated refusal, such as focus equipment requiring wizard circle 1 or temple certification requiring clean custody.
- Define trade/custody transfer as a service-mediated transaction protocol, not two save-file edits: reserve or escrow both sides, commit or release with idempotency keys, reject stale versions, record one ordered ledger receipt, and specify cancellation, retry, and crash-recovery behavior.
- Ensure prices and blocked authority are named in failure text.
- Add QEMU assertions for one quote, one successful exchange, and one rejected trade explaining the gate.
Planning input, grounded by Game Mechanics Prior Art: use external game-mechanics research as planning input only, not a clone target. Stardew Valley is useful for calendar pressure, seasonal resource tables, festivals, routine changes, quests, gifts, affection, and season-bound crops. EVE Online is useful for regional markets, market-eligible item classes, brokered buy/sell orders, immediate matching, and blueprint/material/facility manufacturing constraints. Evil Islands is useful for equipment construction and the targeted combat model. The capOS translation turns these stable mechanics into the capability-shaped tasks in Phases 11-12: seasonal cycles, regional settlements/outposts, service-owned order books, blueprint/artifact construction, targeted deterministic combat, token-budgeted agent NPCs, and a rich tilemap client.
Phase 11: Seeded Variation
Visible outcome: repeated runs vary content meaningfully across normal play, while the smoke transcript stays reproducible under a fixed seed.
Current shape: live adventure player state is keyed by endpoint caller-session scoped refs, and generated mission content carries fixed smoke seed/variant metadata printed in status and asserted by the scenario cap-call path. This is the deterministic seed-metadata foundation only; seeded gameplay variation, production per-run seeds, festivals, NPC routines, and full seasonal economy behavior remain open.
- Add generated mission content fields for a fixed smoke seed label and selected variant metadata.
- Add manifest or mission setup field for a fixed mission seed and a separate per-run seed for production play.
- Print seed and selected variant metadata in transcript/debug mode.
- Seed mob placement, optional hazards, shop inventory, rumor lines, loot cache locations, debrief complications, and ambient encounter timing.
- Seed seasonal state for normal play: season, day, weather/hazard class, seasonal resources, festival/event hooks, and NPC routine variants. The deterministic smoke seed forces a stable generated calendar state, but normal-play seed selection remains open.
- Keep season-sensitive resource tables bounded: crops, forage, fish, shop stock, route hazards, and outpost production all have explicit per-site caps and stable sorted output under a fixed seed.
- Keep combat outcomes reproducible under a fixed seed; production play may add bounded variance per turn as long as the smoke seed reproduces the recorded transcript.
- Add scenario assertion for seed and variant metadata through real
Adventurecap calls.
Phase 11a: Calendar, Seasons, And Resource Cycles
Visible outcome: the frontier feels alive across repeated sessions without making proof transcripts nondeterministic.
- Add an
AdventureCalendarmodel with four 28-day seasons as the initial default, explicit day advancement rules, and debug output for the fixed smoke seed. - Attach bounded fixed-smoke seasonal availability primitives to generated content for crops, forage, fish, shop inventory, route hazards, and repair/material production. Multi-season resources must be declared explicitly.
- Apply seasonal availability to gameplay systems. Bounded slices have
landed: quartermaster
field-rationsquotes read the fixed-smoke seasonal shop-stock table;Adventure.statusforecasts carried seasonal crops expiring and fish/forage degrading at the next season change; theseason-transitionask path applies the next-season transition to actual player inventory (crops expire, fish/forage become-degradedtokens); and thefield-rationsbuy path spends audited Aurelian standing and records per-expedition seasonal stock usage. Broader season advancement, economy, persistence, market orders, seeded normal-play calendars, and automatic world mutation remain open. - Add festival and military-event records that can temporarily expose actor-location, shop, witness, route, and rumor metadata. This is metadata/status only; actual gameplay mutation remains open.
- Give named actors bounded routine variants by season, festival, mission beat, and local emergency, visible as structured actor presence/state. This is metadata/status selection only; it does not move actors or mutate authority.
- Add simple quest/gift/affection hooks only after profile and ledger facts can record them. Daily interactions and gifts should affect actor standing through auditable records, not client-owned counters.
- Add pure Rust unit tests for calendar rollover, season/day bounds, seasonal resource eligibility, multi-season exceptions, and stable fixed-seed ordering.
- Add pure Rust unit tests for festival scheduling once festival records exist.
Phase 11b: Regional Settlements, Outposts, And Trade Routes
Visible outcome: Aurelian is one settlement in a wider frontier economy with multiple cities, outposts, production sites, and routes.
- Model more than one settlement:
fort_aurelianremains the proof settlement, while later content can add at least one civilian city, one temple-administered site, one guild waystation, and multiple resource outposts. - Define outpost roles such as mine, farm, timber camp, shrine, gate-yard, salvage yard, and repair yard. Each role produces bounded resources, consumes supplies, exposes route risks, and may require specific writs.
- Add region and route metadata: distance, hazard, faction control, route authority, cargo limits, seasonal closure, and known-safe/unknown states.
- Extend markets from actor-local deterministic handlers toward a service-owned regional market with market-eligible item classes, brokered buy orders, sell orders, price/time priority, immediate matching when price crosses, expiry, fees, and ordered ledger receipts.
- Add the bounded generated-content order-book foundation for regional markets: market book id/location/settlement, buy/sell side, item id, price, quantity, expiry day/duration, fee, owner actor/faction/outpost, receipt ledger id, pure validation, and deterministic non-mutating price-cross matching.
- Add the first bounded service-mediated transaction proof on top of the generated regional order books: reserve one crossed match, commit or release it with idempotency keys, reject stale versions, record ordered receipt facts, and keep the server as the owner of live transaction state.
- Route real player, NPC, and outpost inventory/currency transfers through
the Phase 10 service-mediated transaction protocol. The current proof is
one regional market match: on fresh commit it debits player-local Aurelian
chits once, decrements seller
ash_farmfield-rationstock once, accrues service-owned regional market fees once, credits service-ownedash_farmseller proceeds once, and delivers the committed quantity into the player inventory only when ordinary capacity can accept it. It does not yet move NPC stores, broader outpost inventories, durable currency/proceeds ledgers, profile ledger balances, or durable save records. - Add broader scenario coverage for crash-recovery state, receipt replay after restart, multi-client settlement, and player/NPC/outpost transfer effects. The current scenario path covers quote, reserve, idempotent retry, commit replay, stale-version rejection, no-cross partial release, explicit cancellation/release, fee withdrawal, bounded receipt-snapshot restore, and a bounded settlement side-effect snapshot-view replay. These are bounded recovery proofs, not durable persistence or a restart harness.
Phase 11c: Blueprint And Artifact Construction
Visible outcome: equipment and artifacts become authored constructions with traceable materials, skills, facilities, and enchantment limits.
- Add blueprint records for craftable equipment, repair jobs, gate parts, relic containers, focus items, and lawful wards. Blueprints name required materials, facility class, skill/rank/circle gates, expected duration, cost, and output bounds.
- Keep the first construction job proof service-mediated. The field-engineer gate repair job reserves materials at a generated facility, validates blueprint/facility/rank constraints, records ordered job facts, and either completes or releases the reservation. Currency escrow, job-time advancement, output inventory, and general crafting remain future work.
- Add deterministic property-derivation primitives as a bounded result of base blueprint, material, facility quality, and paid cost. Full crafting job integration remains open.
- Add artifact construction metadata for rare pieces whose authority matters: witness-sealed relic cases, warded cloaks, focus rings, route compasses, golem cores, and gate-stabilizer parts.
- Add enchantment slot metadata and validation bounds. The constrained post-process gameplay remains open until construction jobs exist.
- Add pure Rust unit tests for blueprint validation, material/property derivation, enchantment slot limits, facility/rank/circle gates, and missing or retired authority references.
- Add service-side material reservation and stale construction job rejection for the bounded field-repair proof. The server owns per-session construction material stock, mutates holds/restores only for fresh outcomes, and keeps stale/version and idempotent replay behavior in the pure job-state model. Durable stock ledgers and broad crafting remain future work.
Phase 11d: Token-Budgeted Agent NPCs
Visible outcome: optional agent-controlled NPCs can feel reactive while staying bounded, auditable, and outside transcript-critical authority.
- Use
docs/proposals/llm-and-agent-proposal.md,docs/proposals/hosted-agent-swarm-proposal.md,docs/proposals/capos-repo-harness-engineering-proposal.md, anddocs/research/hosted-agent-harnesses.mdas grounding before any implementation. - Treat model output as dialogue or proposed action data. Mission-critical authority, custody, combat, market commits, rank rewards, and policy denials stay in deterministic services.
- Add an
NpcAgentBudgetor equivalent service-owned quota: per actor, session, day, and model profile; input/output token limits; tool-call limits; cooldown; and exhaustion behavior. - Let NPCs spend quota on bounded chatter, optional hints, and outpost status summaries. Spending must be visible in logs/debug output for review.
- Extend token-budgeted NPCs to personal routines, shop negotiation flavor, and festival reactions as fake-agent dialogue/proposed-action data only.
- On quota exhaustion, fatigue, sleep schedule, or policy denial, the NPC
should refuse in-world, for example:
I'm tired. Going to sleep.The refusal must not be a hidden transport error. - Keep hosted-agent memory separate from authority. Long-lived NPC memory can record bounded facts and reflections, but only reviewed/compiled facts influence deterministic game services.
- Add tests with a deterministic fake model that proves quota decrement, quota exhaustion refusal, no authority mutation from free text, and stable transcript output when agent NPCs are disabled.
Current shape: agent NPC budget metadata is disabled-by-default for Iunia, Livia, and Maro; a deterministic fake-model turn function drives bounded chatter/hints/outpost-summaries plus routine/shop/festival flavor, decrements quota, and refuses in-world for quota/fatigue/sleep/cooldown/policy blocks. Live LLM calls, hosted-agent service execution, durable memory service, autonomous NPC actions, and any transcript-critical model gameplay remain open.
Phase 12: Multiplayer, Parties, And Lawful PvP
Visible outcome: shared multiplayer authority works correctly across local multi-client play first, with cross-device play following on the same service-mediated boundaries. Players can party up, delegate scoped authority, assist each other, and engage in lawful PvP without leaking private inventory or allowing ambient harm.
- Do not start this phase until Adventure and chat authority use session-bound caller identity, or future broker-granted service facets, rather than receiver-selector identity. The first bounded slices key local player labels from live caller-session metadata.
- Add
Expeditionor equivalent shared state only when the singleAdventureservice cannot cleanly model party authority. The first bounded slice keeps deterministic party state insideAdventurebecause no cross-service coordinator is needed for local create/invite/accept/ leave/delegate/assist records. - Add party verbs: create, invite, accept, leave, and delegate. Implemented
for service-created local labels (e.g.
player-1) derived from live caller-session keys, not caller-selected badges or global session data. - Add
assist <player> with <task>for deterministic cooperative action. Implemented for the firstdetect-wardassist record, requiring party membership plus delegatedward-writ; it records scoped service-owned state and grants no unrelated inventory authority. - Route first player-to-player physical-item transfer through a
single-owner atomic mutation path inside
Adventure. Implemented astransfer <item> to <player>for service-local player labels: both players must be in the same party,eagle-standardrelic custody is refused, and source/target inventories mutate atomically. - Add currency escrow and broader two-party trade/custody transfer protocol only after the economy model and multi-client proof harness justify it; user-owned backup capsules must not be transfer authority.
- Add a two-client QEMU proof with two service-created player objects, one shared party, one delegated writ, and one assist. The proof must use two distinct live caller-session keys for Adventure cap calls, not manifest receiver selectors or user-chosen identity. Still open: the focused Adventure manifest does not yet provide a reliable two-client launcher/session harness, so real cap-client assertions stay at one-client party surface coverage and complex transitions are covered in pure Rust.
- Keep PvP opt-in: duel, spar, contest, or bounty authority must exist before harmful verbs can target another player.
- Add denial text for unauthorized player harm that names the missing
lawful conflict authority.
attack <player-label>refuses known local player labels with text naming the missing duel/contested-yard authority. Duel/spar/contest/bounty authority remains future work.
Sequencing note:
- The first slice is local multi-client (two clients on one capOS instance)
because that is the cheapest deterministic proof. Cross-device
multiplayer is on-roadmap and lands once the local authority model is
correct and
CloudGameStorecarries shared expedition/ledger state. - Network-transparent multiplayer (full federation across capOS instances) stays separate from this phase and follows the broader networking work.
Future Phase: Parallel Universes And Worldline Federation
Visible outcome: separate capOS-hosted Aurelian worlds can expose alternate seeded worldlines and limited cross-world interaction without making remote instances trusted authorities for local inventory, relic custody, profile standing, or market settlement.
This is deliberately after local multiplayer, durable ledger/profile state, service-owned market/escrow, and basic networking. The near-term shape is not a single shared MMO world. It is a federation of sovereign worldlines, each with its own content release, worldline id, seed epoch, generated overlays, ledger head, market policy, and profile-import rules.
- Add a
WorldlineSeedmodel whose outputs are deterministic artifacts: generated regional overlays, seasonal economy tables, event schedules, market starts, outpost production, route hazards, optional encounters, loot caches, and bounded NPC routine variants. - Keep authored anchors static: factions, core law, major sites, named relics, capability interfaces, canonical proof missions, and security policy. Seeded generation may vary conditions around those anchors, but must not mint new authority classes or bypass service-owned validation.
- Store provenance for every admitted generated artifact: content release id, worldline id, seed epoch, generator version, scope label, provenance hash, and bounded output size.
- Add pure deterministic generator tests before gameplay integration: same seed produces the same artifacts, different seeds produce bounded variation, invalid generated references are rejected, and generated outputs remain sorted/stable for proof transcripts.
- Add fixed-seed QEMU proof once the generator exists. The smoke path should still use pinned selections until generator coverage is strong enough.
- Define
WorldlineDirectory,WorldlineVisit,WorldlineExpedition,WorldlineTransfer, andWorldlineAuditservice surfaces as local facade caps over remote protocol messages. Do not serialize raw cap slots, endpoint generations, global session ids, or local player labels as portable authority. - Start with echo-only federation: list a remote or second local worldline, inspect content/seed/ledger metadata, and view public state without mutation authority.
- Add a denial proof for cross-world relic transfer before implementing
successful transfer:
eagle-standardtransfer must fail until custody escrow, remote policy, dual-ledger receipts, content compatibility, and replay protection exist. - Later, add envoy visits and expedition bridges. Projected remote characters may observe, chat, or perform explicitly granted low-risk actions; spending home-world inventory or importing rewards requires a transfer/settlement receipt.
- Treat cross-world markets and migration as receipt-verified claims. Remote order views, faction standing, rank, contributor rewards, and custody history require local policy gates before they affect local authority.
Feasibility note: this is feasible if it is built as capability federation plus deterministic worldline generation. It is not feasible as “trust another capOS instance’s save file” or “transfer local caps over the network.” Cross-world state changes need the same reserve/escrow, commit/release, stale-version rejection, idempotency, and ledger receipt discipline already planned for local markets and trades.
Phase 13: Contributor Quest Mechanics
Visible outcome: after the base Aurelian game has stable profiles, evidence, debriefs, and cosmetic rewards, the game can recognize real capOS development work through maintainer-witnessed outer-world quests.
- Use
docs/proposals/contributor-quest-mechanics-proposal.mdas the design source for this phase. - Keep all rewards cosmetic, narrative, reputational, or bounded game-only perks unless a separate reviewed security design grants authority.
- Use full GitHub issue and PR URLs, commit hashes, issuer identity, and timestamps in contribution evidence records.
- Add manual quest and witness records before any read-only forge connector.
- Add QEMU proof that witnessed contribution evidence mints a badge or decoration, while an unwitnessed claim does not.
Cut scope:
- No automatic GitHub mutation, no token handling in the game client, no public leaderboard that pressures maintainers or security reviewers, and no reward that grants repository or OS authority.
Phase 14: Rich Browser Adventure Client
Visible outcome: after WebShellGateway, session-bound game authority, profiles, persistence, and the core game loop are stable, a browser-hosted adventure client presents the same game as a pixel-art interface with animated characters, location art, inventory panels, combat affordances, and chat/event feeds. The browser client should feel like a native game client, not a terminal skin.
- Treat
adventure-clientas the text/QEMU proof client and compatibility adapter. Do not route the rich browser UI throughStdIOcommand lines. - Implement the web shell or WebShellGateway side as a capability-call proxy for the authenticated session. The gateway holds the real capOS caps and invokes only the allowed adventure/chat methods for that web session.
- Keep the browser authority opaque: browser JavaScript receives web-session
handles and typed DTOs, never raw capOS
CapIds, badge selectors, provider credentials, shell spawn authority, game-world keys, or broad network capability. - Prefer narrow game-session objects such as
AdventurePlayerandChatParticipantwith methods forlook, movement, inventory, status, combat actions, orders, delegation, chat send/history, and bounded event polling. A genericCommandSessionmay coexist for terminal-style front ends, but it is not the required ABI for a purpose-built game UI. - Return structured view state and events suitable for rendering: current site, exits, actors, mobs, visible items, held/delegated authority, evidence, effects, party state, combat state, animation/event cues, and chat history cursors.
- Represent the world as a 2D tilemap data model for browser presentation: maps, tilesets, tile layers, object layers, collision/interaction zones, spawn points, actor paths, region/outpost markers, and event triggers. Tiled JSON is an acceptable authoring/export candidate if validation rejects oversized maps, missing tiles, unknown layer types, and invalid object references.
- Evaluate PixiJS plus
@pixi/tilemapfor the first rich client because it gives a WebGL-oriented 2D renderer and rectangular tilemap path with a canvas fallback. This is a client rendering choice, not game authority. - Keep all semantic validation in game services. Browser-side disabled buttons, command palettes, targeting hints, and animations are presentation only; the server still rejects missing authority, invalid location, stale state, bad custody, unsafe combat, and oversized input.
- Use explicit asset manifests for pixel art, sprite sheets, portraits, tiles, VFX, UI sounds, and animation ids. Asset lookup must not grant game authority, and missing or mismatched assets must fail as presentation errors rather than game-state mutations.
- Add a headless browser harness that authenticates through the web gateway, opens the rich client, drives one deterministic mission slice using UI actions, verifies rendered state transitions/events, and checks logout or tab-close teardown.
- Add browser rendering checks for tilemap layer order, actor placement, viewport/camera bounds, collision affordance display, event feed updates, and no browser-side mutation of authoritative adventure state.
Blocked by:
- WebShellGateway authentication, origin/TLS policy, session teardown, and bounded browser transport.
- Broker-granted Adventure/chat authority or gateway-owned live caller-session mapping so web sessions do not depend on caller-selected receiver identity.
- Persistent profile/ledger/checkpoint semantics for save/resume UX.
- Stable core gameplay phases through at least authority inventory, relic custody, actor roles, combat, and debrief rewards.
Cut scope:
- Do not make browser-rendered state authoritative.
- Do not let browser UI bypass game service methods or mutate save records directly.
- Do not require the rich browser client for QEMU proof coverage; the text client remains the deterministic low-dependency proof path.
Service Split Gates
Keep one adventure-server until there is a concrete proof value in splitting
state. Split services only at these gates:
- Mission service: when multiple clients or NPCs need shared expedition state independent of private player profiles.
- Profile service: when rank marks, cosmetics, contributor badges, or settings must persist beyond one process lifetime.
- Audit/Witness service: when relic custody, forbidden rites, and debrief evidence need a separate authority boundary.
- Save store: when profile, ledger, or expedition state needs a shared adapter over RAM, local disk, or cloud backing.
- Cloud bridge: only after local save/load semantics and stale-write rejection
are proved behind
AdventureSaveStore. - User-owned save vault: when private profile/export data should sync through a user’s browser or Google account without granting provider credentials to game services.
- Market/Trade service: when two-party exchange or shop inventory becomes more than a deterministic local handler.
- Expedition service: when parties, assists, duels, or contested sites need shared state and explicit consent capabilities.
Every new service split must include manifest grants and QEMU assertions for both allowed behavior and at least one rejected overbroad action.
Verification Gates
For each phase that changes behavior:
-
make fmt-check -
make generated-code-checkwhen schema or generated bindings change. - Generated content freshness check when mission source data or content generation changes.
- Relevant host tests for content validation or pure logic.
- Prefer pure Rust unit tests for complex deterministic game logic: calendar/season rules, resource tables, blueprint validation, market matching, escrow state machines, route constraints, and agent quota accounting.
- Use a real Rust test client process calling game caps for complex scenario tests that cross service boundaries: custody, construction, market transactions, party assists, and regional economy flows.
- Keep the current command-client transcript focused on basic command and client functionality: parsing, rendering, representative success/failure calls, and stable QEMU smoke proof. Do not make it the only coverage for complex game state machines.
- Save-record encode/decode and migration tests when profile, expedition, ledger, or cloud-bridge persistence changes.
- User-save capsule tamper/replay/wrong-profile tests when browser-mediated backup or restore changes.
-
make run-adventurewith deterministic transcript assertions for the new behavior.
For content-only changes:
- Content validator tests must pass.
- Generated content freshness check must pass when content blobs are checked in.
- If the content family uses
mkmanifest cue-to-capnp, rerun the conversion with pinnedCAPOS_CUEandCAPOS_CAPNP, decode or validate the produced Cap’n Proto message, and include the freshness check inmake generated-code-check. -
make run-adventuremust still prove the visible mission path.
Do not claim the full adventure proposal is implemented until the Aurelian mission, authority inventory, actor roles, relic custody, debrief, and deterministic proof path all land.
Full-Scope Review 2026-06-09
Findings ledger for the full-scope review cycle completed at
2026-06-09 19:01 UTC. Eight
independent subsystem reviews covered the tree at commit 50e8eaba
(2026-06-09) against the previous review base bb776326e (2026-05-23). Each
open finding below is remediated through a task record under docs/tasks/
whose source points here; severities are carried into task priority.
Documentation-status findings (stale status wording, landed-behavior drift)
were remediated directly in commit 3ac860dc and are not re-listed.
Scopes Reviewed
- Storage on-disk formats and mount validation (kernel storage caps,
tools/mkstore-image). - Storage services and installable-system flow (init generation/rollback,
storage-persist-service, NVMe-backedBlockDevice). - Kernel core and x86_64 architecture (fault handlers, TLB shootdown, ELF spawn, percpu/SMP/paging/IOAPIC/ISO reader).
- Device Driver Foundation authority (MMIO bounds, DMA-buffer release invariants, device-manager proof gating).
- Remote-session Web UI and network-facing services.
- Schema, generated bindings, and System Manual.
- Userspace runtime and POSIX adapter (
capos-rt,libcapos-posix). - Fuzzing, host-test harnesses, tooling, and CI workflows.
Findings By Scope
1. Storage on-disk formats and mount validation
- High —
kernel/src/cap/persistent_store.rs:parse_disk_store,kernel/src/cap/writable_fs.rs:mount_volume: live extents are validated only against the data region, not againstnext_free_sectoror each other. A crafted or torn image with a live extent in the bump-allocator free region mounts cleanly and is silently overwritten by the nextput_blob/persist_file;compact_reclaim’s shadow-generation copy into the data-region tail clobbers such extents mid-copy; overlapping live extents are accepted. - Low —
writable_fs.rs:mount_volume(alsoreadonly_fs,persistent_store): duplicate sibling names are silently collapsed byBTreeMapinsert instead of failing the mount. - Low —
tools/mkstore-image:write_caposwf1_dir_node/write_caposwf1_file_node: name-length assertion usesWF_NODE_RECORD_BYTES - WF_NODE_OFF_NAME(104) instead of the kernel’sMAX_DISK_NAME_BYTES(88). - Medium —
persistent_store.rs:DiskStoreCap::get_blob: returns disk bytes trusting the entry table without re-verifyingcontent_hash(bytes) == key; init fetches generation objects by hash from this store, so a disk-level edit swaps active system-config content undetected.
2. Storage services and installable-system flow
- Medium —
demos/storage-persist-service/src/bin/server.rs:commit: overwrites the single payload region in place before the superblock write; a crash mid-payload-write destroys the previously committed snapshot and wedges startup. The doc comment overclaims torn-write safety; this service is the named production storage route. - Medium —
init/src/main.rs:read_candidate_pointer/decide_boot_generation: a corrupt or truncatedgen-candidatemarker parses toErrand fails boot closed (the CREATE|TRUNCATE marker rewrite persists a durable size-0 window), contradicting the “a bad generation can never permanently brick the system” guarantee. - Medium —
kernel/src/cap/block_device.rs:NVME_ARBITRARY_NAMESPACE_BLOCKS: hardcodes the 16 MiB QEMU fixture geometry (32768 blocks) on the always-built NVMe arm;BlockDevice.infoand the filesystem/storeBlockSource::inforepeat it, so larger real namespaces are unreachable.kernel/src/nvme_storage_backend.rs“production” wording omits the bounded sync-io seam (64 ops/boot, wedges on CQ wrap).
3. Kernel core and x86_64 architecture
- High —
kernel/src/arch/x86_64/idt.rs:page_fault_handler/gp_fault_handler/invalid_opcode_handler: CPL3 faults halt the whole machine; the “no task abstraction yet” rationale is stale now thatsched::exit_current_threadand process exit cleanup exist. Any userspace null deref is a full-system denial of service. - Medium —
kernel/src/arch/x86_64/tlb.rs:kernel_tlb_shootdown_all: the remote ack uses a CR3-reload flush, which under CR4.PGE does not evict GLOBAL entries — the very kernel upper-half/MMIO mappings it exists for. Safe for the sole current caller (fresh non-present→present installs), buttlb.rsandmem/paging.rsadvertise unmap/revoke reuse. - Medium —
kernel/src/spawn.rsPT_LOAD mapping:PF_W|PF_Xsegments map PRESENT|USER|WRITABLE without NX;capos_lib::elfdoes not reject W+X. - Low (bundle) —
percpu.rs:current_cpu_idunwrap_or(0)masquerades unknown LAPIC ids as the BSP;smp.rsAP_CPUSspin-mutex IF constraint undocumented;mem/paging.rs:map_kernel_physical_rangepartial failure leaks installed PTEs and the VA window;ioapic.rs:write_destinationrestores mask from the cached record, not hardware;mem/validate.rslegacyvalidate_user_bufferis dead code;kernel/src/iso/mod.rsISO_BOOT_SOURCEmutex held across a full polled-PIO ELF transfer;capos-rt/src/panic.rsemergency console write can race a live SQ producer.
4. Device Driver Foundation authority
- Medium —
kernel/src/virtio_transport.rs:MmioRegion: volatile accessor bounds aredebug_assert!-only and the kernel ships release, so the documented “range-checks before reaching device MMIO” contract is false in shipped builds; some regions claim the full BAR length while only aMAPPED_COMMON_CFG_LIMITprefix is mapped. - Medium —
kernel/src/device_manager/stub.rs:detach_dmabuffer_record_for_cap_release_with_reason: the pinned-enabled-vring refusal, RX-DMA quarantine, and autonomous-MSI-X/NVMe handoff blocks live in per-proofcfgislands, while the invariant — never free a frame the device may still master — is production DMA-lifetime behavior.
5. Remote-session Web UI
- Medium —
demos/remote-session-web-ui/src/main.rs:do_login: no login rate limiting and noaccepted.peer_addrcheck; loopback-only is enforced solely by topology plus forgeable Host/Origin headers, weaker than the host bridge sibling./api/probe/expireand/api/probe/stale-callproof seams ship unconditionally in the production-named binary.
6. Schema, generated bindings, and System Manual
- Medium —
schema/capos.capnpSymmetricKey..CertVerifierblock uses leading-style doc comments; capnp attaches docs to the preceding declaration, so every comment shifts one method in the checked-in bindings and the System Manual ships misattributed descriptions. Themanualccoverage gate is interface-level only, so it passes.
7. Userspace runtime and POSIX adapter
- Medium —
capos-rt/src/ring.rs:pack_copy_transfers: computesparams_offsetfrom theVec’sas_ptrbeforeinto_boxed_slicemay realloc, invalidating the computed alignment; currently saved only by undocumented allocator behavior, and the existing alignment test passes vacuously under the 16-aligned host allocator. - Medium (bundle) —
libcapos-posix:dup/dup2/F_DUPFDsnapshotposper slot instead of sharing the open-file-description offset (src/fd.rs);poll/selectignore the timeout entirely so infinite timeout returns 0 and callers busy-spin (src/poll.rs);errno::clear()on shim entry violates C11 §7.5;F_SETFLaccepts-and-ignoresO_NONBLOCKthenreadblocks forever. No#[cfg(test)]host unit tests exist in the crate.
8. Fuzzing, harnesses, tooling, and CI
- Medium —
fuzz/fuzz_targets/manifest_capnp.rsfuzzes a 4096-word/16-deep envelope while productiondefault_reader_optionsallows 64 Mi words and nesting 32; the ISO 9660 record/PVD parser, the CAPOSRO1/CAPOSST1/CAPOSWF1 mount parsers, thecapos-tlsDER validity walk (capos-tls/src/cert.rs:parse_validity), andstorage-persist-service:deserialize_state/parse_superblockhave no fuzz or host coverage. - Low (bundle) — CI Miri step soft-skips when the component is missing
(
.github/workflows/ci.yml);publish-crates.ymlcargo publish --no-verifyis uncommented; thesqe_validationfuzzPARK_BENCHarm is permanently reject-only without a measure-feature fuzz build;capos-wasm/src/wasi/fs.rs:install_preopendiscards itstry_reserve_exactresult.
Spawned Task Records
All records carry source: docs/backlog/full-scope-review-2026-06-09.md:
review-storage-mount-extent-placement-validation(high)review-storage-store-get-hash-verificationreview-storage-persist-service-crash-safe-commitreview-installable-torn-candidate-fallbackreview-storage-nvme-identify-geometryreview-kernel-user-fault-containment(high)review-kernel-tlb-global-shootdown-ackreview-spawn-wx-segment-rejectionreview-ddf-mmio-region-release-boundsreview-ddf-dmabuffer-detach-invariant-hoistreview-webui-inguest-login-hardeningreview-schema-crypto-doc-attributionreview-fuzz-parser-coveragereview-capos-rt-transfer-pack-alignmentreview-posix-fd-semanticsreview-kernel-arch-hardening-lows(low bundle)review-tooling-ci-lows(low bundle)
Proposal Index
This page classifies proposal documents by current role so readers do not confuse implemented behavior, active design direction, future architecture, and rejected alternatives.
The sidebar nests long proposal documents under this index so the public site opens as a current-system manual instead of an archive dump. Use this table as the first status checkpoint before opening a long proposal.
Current design authority lives in Current Design Authority. Proposal files are design history or active design records; when a proposal is implemented, future technical changes should update the stable current-design page first.
Lifecycle classes used below:
- Implemented: shipped behavior; proposal is archival unless the status link or historical note is being corrected.
- Accepted design: selected direction; implemented subsets need a stable current-design home.
- Partially implemented: some behavior is in tree; future/planned text must remain explicit.
- Active design: unimplemented or near-term design record still available for planning. Older rows that say “Future design” are active design records with no current implementation unless the row says otherwise.
- Superseded or Rejected: retained historical rationale, not current direction.
Promoted Current Design
| Proposal or decision | Stable current-design authority | Disposition |
|---|---|---|
| Session-Bound Invocation Context | Session Context and IPC and Endpoints | Implemented proposal is archival. |
| Error Handling | Error Handling and Capability Ring | Implemented proposal is archival. |
| System Configuration | Configuration and Manifest and Service Startup | Implemented proposal is archival. |
| DMA Assurance Model | DMA Isolation | Accepted design is grounded in the stable DMA design page. |
Active or Near-Term
| Proposal | Status | Purpose |
|---|---|---|
| Service Architecture | Partially implemented | Defines authority-at-spawn, service composition, exported capabilities, and the init-owned service graph direction. |
| Schema Registry | Future design | Active design record for runtime schema reflection as the machine-readable twin of the System Manual; no implementation yet. |
| Session Archive & Gantt Effort | Future design | Active design record for session recap and planning-timeline effort records; retained as workflow design, not system behavior. |
| Task State and Agent Telemetry | Partially implemented | File-per-task ledger, selected-milestone state, lifecycle directories, and the tools/vibe-loop-capos-tasks adapter are implemented; generated checked-in views and tracker sync remain future. |
| Session-Bound Invocation Context | Implemented | Archival record for replacing caller-selected endpoint identity and the superseded service-object migration with one immutable session context per process. Current design authority is Session Context. |
| Storage and Naming | Accepted design | Defines capability-native storage, namespaces, boot-package structure, and future persistence instead of a global filesystem. |
| Error Handling | Implemented | Archival record for the implemented transport/capability-exception/schema-result split. Current design authority is Error Handling. |
| Security and Verification | Partially implemented | Defines the security review vocabulary, trust-boundary checklist, and practical verification tracks used by capOS. |
| DMA Assurance Model | Accepted design | Defines the DMA authority model, invariants, and TLA+/Alloy/Kani/Loom evidence mapping that cloud and production driver backend claims must use before attended sign-off. |
| Device Manager Refactor | Implemented | Separates the kernel device authority ledger from QEMU proof scaffolding while preserving one MMIO/DMA/IRQ ownership transaction for userspace-driver readiness; further registry, ledger, or proof-internal splits are optional risk-reduction follow-ups. |
| Cloud Driver Foundation Gap Analysis | Superseded | Retained as a DDF coverage map; the central blocked virtio-net driver gap it tracked is closed and successor work lives in Phase C userspace NIC relocation and NVMe BlockDevice graduation records. |
| NVMe Model B Doorbell DMA Validator | Accepted design | Records the conditional direct-remapping/vIOMMU validator model and explicitly excludes the current no-IOMMU bounce path. |
| Network-Reachable Datapath Scope Decision | Accepted design | Fixes the real-GCE-boot milestone’s reachable-network requirement to raw-frame TX/RX reachability, not a TCP/UDP socket round trip. |
| Phase C Userspace NIC Driver Relocation | Accepted design | Active Phase C design record for relocating the virtio-net driver into userspace over the landed device-authority surfaces. |
| Remote Session UI Security | Partially implemented | Defines the per-browser BrowserSession model, OWASP-style web hardening posture, cookie/CSRF/CSP/headers/Fetch-Metadata controls, and Tauri-wrapper capability-allowlist minimization for the trusted local remote-session-ui bridge; the loopback bridge now has per-browser cookies, CSRF checks, Host/Origin/content-type validation, first-wins ownership, and bounded HTTP parsing/threading. |
| mdBook Documentation Site | Partially implemented | Defines the documentation site structure, status vocabulary, and curation rules for architecture, proposal, security, and research pages. |
| capOS Repository Harness Engineering | Future design | Applies OpenAI-style harness engineering to the capOS repository through agent-facing maps, run-target inventories, proposal metadata, decision records, compiled knowledge, and workflow evals. |
| capOS Agentic Development Experiment | Future design | Defines the longitudinal study design for using capOS development sessions, subagents, reviews, raw archives, and recap tooling as an agentic software-engineering experiment; initial tooling only exists today. |
| SMP | Accepted design | Defines the selected per-CPU Phase A direction plus later AP startup, multi-core scheduler, and TLB shootdown work. |
| Ring v2 For Full SMP | Future design | Defines per-thread capability rings, completion routing, and SQPOLL ownership as the target transport model for full SMP. |
| Scheduler Evolution | Accepted design | Defines the layered scheduler architecture. Phase D WFQ and Phase E SchedulingContext gates are accepted; Phase F SQPOLL/nohz/tickless idle, realtime islands, and EEVDF evaluation remain follow-on work. |
| Tickless and Realtime Scheduling | Future design | Defines staged tickless idle, SQPOLL nohz CPU isolation, request deadline metadata, scheduling-context CPU-time authority, donation, and admitted realtime islands. |
| System Configuration and Operator Extensibility | Implemented | Defines operator-extensible CUE configuration. Slices 1-3 are closed, including defaults-package migration, system.local.cue overlay hooks, strict top-level manifest decoding, and the operator configuration how-to; Slice 4 adds mkmanifest cue-to-capnp for schema-aware CUE-authored data conversion. |
Future Architecture
| Proposal | Status | Purpose |
|---|---|---|
| Real-Filesystem Decision | Partially implemented | Records the accepted role split between capnp-native managed state and read-only FAT32 host/interop images; several FAT and host-tool increments have landed. |
| Installable System | Partially implemented | Defines installed persistent capOS boot/config/update/rollback composition; the bounded local/QEMU data-region, overlay, generation, install, provision, and update/rollback smokes have landed. Secure boot/signing, production release authority, public ingress, provider breadth, and full durable account policy remain future work. |
| Standard App Capabilities | Future design | Defines per-app AppData private storage, a user-mediated powerbox/file-picker grant, and attenuated capability sharing as native, structural alternatives to Google Drive’s appData/Picker/role mechanisms. |
| Google Drive Storage Backend | Future design | Defines using a Google-authenticated user’s Drive behind the standard storage caps, via a near-term browser-transport path and a gated native OAuth2/HTTP/TLS backend, with explicit remote-vs-local-cap trust semantics. |
| Networking | Partially implemented | Records implemented kernel-internal virtio-net ping/HTTP smokes, kernel TCP capability objects, and the host-local Telnet shell demo; userspace NIC and network-stack decomposition remains blocked on production DMAPool/DeviceMmio/Interrupt authority. |
| capos-service | Partially implemented | Defines a userspace service framework above capos-rt for lifecycle, endpoint serve loops, readiness, shutdown/drain, request/session context, metrics, and resource budgeting hooks. The first slice landed the standalone lifecycle crate and Telnet gateway wrapper; endpoint-loop helpers and richer supervision hooks remain future work. |
| Stateful Task and Job Graphs | Future design | Defines durable stateful task/job graphs for init orchestration, IX-style package builds, operator work queues, and notebook-style run stories without making the graph coordinator a god object. |
| Resource Accounting and Quotas | Partially implemented | Generalizes existing per-process ResourceLedger mechanisms to cross-service resource profiles, ledgers of record, quota donation, and fail-closed reservation semantics. |
| Memory Authority Model | Future design | Defines memory authority classes, residency, mapping consistency, TLB/frame-reuse rules, pinned/DMA/swap boundaries, and proof obligations before future shared-memory and device work build on the existing VirtualMemory and MemoryObject substrate. |
| OOM Handling and Swap | Future design | Defines memory-pressure policy, explicit OOM outcomes, budgeted anonymous memory, and optional encrypted swap without an ambient OOM killer. |
| Cryptography and Key Management | Partially implemented | Minimal SymmetricKey, PrivateKey/PublicKey ABI, RAM XChaCha20+HMAC/P-256 cores, RAM-only KeyVault custody, and development KeySource bootstrap landed; production custody and persistence remain future. |
| Volume Encryption | Future design | Defines encryption-at-rest for system and user volumes, including passphrase, recovery, cloud KMS, and measured-boot-backed key sources. |
| Userspace Binaries | Partially implemented | Describes native userspace binaries, capos-rt, Rust std, C/libcapos, C++, Go, Python, Lua, JavaScript/TypeScript, POSIX adapters, WASI host adapters, and runtime authority handling. |
| Go Runtime | Future design | Plans a custom GOOS=capos path, runtime services, memory growth, TLS, scheduling, and network integration for Go. |
| Lua Scripting | Partially implemented | Defines Lua as an ordinary capability-scoped userspace runner with curated libraries, exact grants, and no ambient shell or POSIX authority; Phase 0 and Phase 1 host bindings are in tree, while Phase 2+ remains future work. |
| WASI Host Adapter | Partially implemented | Defines a capos-wasm userspace host adapter whose WASI imports are backed by typed capOS capabilities, with wasmi for v0 (Phases W.1–W.6), wasmtime/WAMR as W.7+ migration targets, and the Component Model as the typed-cap-handle path. Phase W.1 host-runtime scaffold landed 2026-05-05 19:12 UTC (capos-wasm/ standalone crate over vendored vendor/wasmi-no_std/wasmi-1.0.9/, make capos-wasm-build); Phase W.2 closed 2026-05-07 10:53 UTC across four sub-slices: sub-slice 1 (wasm-host binary + empty-instantiation smoke + userspace-image budget bump, 2026-05-06 20:19 UTC), sub-slice 2 (Preview 1 stdout-only import resolver in capos-wasm/src/wasi/preview1.rs plus probe-driven nosys=52 proof, 2026-05-07 08:03 UTC), sub-slice 3 (Rust hello, wasi smoke + manifest-payload load path, 2026-05-07 09:36 UTC), and sub-slice 4 (C hello, wasi smoke through system clang-18 + Ubuntu wasi-libc, 2026-05-07 10:53 UTC). make run-wasm-host / make run-wasi-hello-rust / make run-wasi-hello-c are the boot smokes. Phase W.3 (per-instance CapSet plumbing + LaunchParameters) and successor phases remain future design. |
| POSIX Adapter | Partially implemented | Defines a two-layer C substrate (libcapos thin Rust staticlib, libcapos-posix POSIX surface on top) whose POSIX wrappers are backed by typed capOS capabilities. P1.1 closed at merge fe5f5208 (2026-05-05 13:28 UTC), P1.2 UDP + DNS smoke closed 2026-05-05 21:21 UTC, and P1.3 pipe + recording-shim fork-for-exec closed 2026-05-07 09:55 UTC; broad POSIX headers and a whole dns.c build remain future work. |
| POSIX fork/execve fd Inheritance | Implemented | Recording-shim execve inherits the parent’s live fd table by default with FD_CLOEXEC/O_CLOEXEC handling; only optional pre-spawn transferability refinement remains. |
| Shell | Partially implemented | Describes native, agent-oriented, and POSIX shell models over explicit capabilities instead of ambient paths. |
| Remote Session CapSet Clients | Partially implemented | Defines regular host apps, including CLI, native GUI, Tauri backends, webapp gateways, and agent runners, that authenticate to capOS, keep broker-issued remote CapSets in trusted client-side backends, call granted capabilities over Cap’n Proto RPC, and optionally grant bounded UI-composition caps back to capOS services. The first implementation slice proves this with a schema-framed DTO transport; standard capnp-rpc proxy transport remains future work. |
| SSH Shell Gateway | Partially implemented | Defines production remote CLI shell access through SSH while preserving the same TerminalSession and broker-issued shell-bundle boundary proven by the Telnet shell demo; focused QEMU proofs now cover the non-production SshHostKey, manifest-seeded AuthorizedKeyStore, public-key session bridge, unsupported-feature policy table, scoped listener, restricted shell launcher, and a bounded plain-TCP terminal-host wiring slice. Full OpenSSH transport remains future work. |
| Telnet over TLS Shell | Future optional design | Defines a peer optional remote-shell path to the SSH gateway: TLS 1.3 over the existing Telnet TerminalSession handoff, with mTLS client certificates as the recommended user-auth path and CredentialStore passwords as fallback. Reuses the project’s PKI/ACME/cert-rotation track instead of inventing a parallel SSH-only key-management story. Smaller protocol surface than SSH; different operational profile, not the default main access interface. |
| Language Models and Agent Runtime | Future design | Defines language-model and embedder capabilities, local and remote backends, capOS-side agent runners, and browser-agent UI orchestration through gateway-enforced tool execution. |
| capOS-Hosted Agent Swarms | Future design | Defines OpenClaw-like hosted personal agents, swarms, harness controls, task workspaces, agent memory/wiki services, MCP/A2A-style adapters, and the research agenda for capability-scoped background agents. |
| Enterprise Agent Game Showcase | Future design | Positions a playable business simulation as the capOS enterprise-agent showcase: agents manage procurement, finance, operations, logistics, markets, and audit under OS-enforced capability policy. |
| Chat As Multimedia Substrate | Future design | Defines Chat as a unified text/audio/video transport for human, agent, and service participants, with listener-cap delivery and a clean WebRTC mapping for browser surfaces, so new messaging surfaces do not require new top-level capabilities or gateway DTOs. |
| Realtime Voice Agent Shell | Future design | Extends the agent-shell path for native realtime audio models, direct browser provider media, and browser-agent UI sessions while preserving broker-mediated tool execution and web-shell session boundaries. |
| Interactive Command Surfaces | Future design | Defines structured command sessions for native interactive applications so familiar text commands compile to typed invocations instead of application-owned StdIO parsers. |
| Userspace Authority Broker | Future design | Proposes moving shell bundle policy out of the kernel and making shutdown an init-owned lifecycle control capability granted only after login. |
| Aurelian Frontier | Partially implemented | Capability-native persistent-world RPG on a Roman-inspired magical frontier. Current proof slice covers the deterministic mission, command discoverability, typed room view, CUE-sourced content with make generated-code-check freshness, resume cap, Phase 9 rank/skill/standing gates, Phase 10 market quote/buy/sell/trade/repair, Phase 11 session-keyed player state with fixed-smoke seed/variant metadata, Phase 11a calendar/festival/military event status plus the seasonal quartermaster ration purchase, Phase 11b regional delivery with bounded inventory capacity, player-local chit currency, seller-outpost stock, service-owned market fee accrual/withdrawal, seller-outpost proceeds, order expiry, Phase 11c construction material holds/restores plus the receipt snapshot proof, Phase 11d disabled-by-default fake-agent budget/dialogue, Phase 12 party labels/verbs and physical-item transfer, the settlement snapshot proof, and the eagle-standard/gate-seal/temple-seal/under_vault interactive transcript. See the runnable proof slice for current commands and coverage. Production seeds, two-client multiplayer transfer escrow, PvP consent authority, durable ledgers, full economy behavior, and a 2D tilemap browser client remain future work. |
| Contributor Quest Mechanics | Future design | Defines a post-adventure follow-up where maintainer-witnessed open-source contributions can mint cosmetic badges, states, decorations, and bounded game perks without granting repository or OS authority. |
| Public Release and Maintainer Boundaries | Future design | Defines the release posture, security-audit disclaimer, issue/PR intake limits, maintainer-load boundaries, and the adventure-repository-split and git-history-rewrite hygiene gates required before making the repository public. Defers the long-term sibling-repository rule to the Repository Composition proposal. |
| Repository Composition | Future design | Defines the scope rule for the capOS core repository, the list of tracks (adventure, whitepaper, public site, userspace netstack, remote-access services, protocol stacks, language runtimes, GPU, agent shell, cloud images, volume crypto) that should ship as siblings, the when-to-split criteria, the cross-repository mechanics, and the intended cap-os-dev GitHub organization placement. |
| Boot to Shell | Partially implemented | Defines text-only console and web-terminal login/setup, password verifier and passkey authentication, and the authenticated native shell launch path after manifest execution, terminal input, native shell, session, broker, audit, and credential-storage prerequisites are credible. |
| System Info Capability | Phase 1 + Phase 2 implemented | Unifies the system-wide informational capability (MOTD today; hostname, help topics, manpages later), moves banner printing into the shell, and has AuthorityBroker.shellBundle mint SystemInfo plus profile-scoped chat/adventure service endpoint caps for operator shells. Guest and anonymous shells receive no service endpoints by default. |
| System Manual Capability | Partially implemented | A built-in man-pages analog: shell man/apropos, self-served web-UI doc viewer, schema-derived section-2 description proofs, and programmatic API/agent-export consistency are settled, with remaining follow-ups described in the proposal. |
| System Monitoring | Future design | Defines capability-scoped logs, metrics, health, traces, crash records, and audit/status views. |
| Time and Clock Authority | Partially implemented | Defines WallClock and ClockDiscipline; Phase 1 WallClock read/provenance is landed, with trusted/network-synchronized time still future. |
| Debug and Trace Authority | Future design | Capability-scoped process-attach, read-only cap-table inspection, ring-trace capture, and sampler authority with explicit consent and audit; no ambient ptrace analog. |
| Hardware Audit Log Persistence | Partially implemented | Store-inventory segment retention, retained-window recovery, hash-chain evidence, manifest reader admission, a local persistent-store reboot proof, development-source RAM-local HMAC segment seals, and explicit runtime-reader refusal have landed; external key custody, production rotation/revocation, rollback policy, and authority-broker runtime admission remain future. |
| Crash Recovery and Supervision | Future design | Defines stale-cap DISCONNECTED propagation on unplanned process death, structured crash records appended to the supervisor’s AuditLog, bounded restart policy with crash-loop detection, watchdog liveness, and degraded-boot fallback. |
| System Performance Benchmarks | Future design | Defines correctness-gated primitive, workload, and user-story benchmarks for comparing capOS with other operating systems without distorting capability semantics. |
| HPC Parallel Processing Patterns | Future design | Extends benchmark planning from static SMP/thread scaling proofs to generic single-node and multi-node parallel pattern coverage: map/reduce, task pools, barriers, scans, stencils, dense/sparse kernels, graph frontiers, pipelines, and collectives. |
| Scientific Standard Package and Agent Lab Capabilities | Future design | Defines a curated scientific service graph for CAS, numerical computing, solvers, proof assistants, notebooks, package closures, provenance, and LLM agent research-lab workflows. |
| User Identity and Policy | Partially implemented | Defines users, sessions, guest profiles, and policy layers for RBAC, ABAC, and MAC over capability grants. Current implementation has anonymous/operator/guest UserSession metadata, bootstrap credential/session flows, broker-issued shell bundles, and seed-account configuration; durable accounts, external bindings, session revocation, quotas, and broader ABAC/MAC remain future work. |
| Delegated Subject Context | Future design | Defines bounded act-on-behalf-of subject context as separate from capability transfer and from the completed session-bound invocation context milestone. |
| Default User Avatar | Partially implemented | Deterministic default user avatar derived from a stable account identifier, with the shell-side default mapping implemented and schema-carried avatar caps plus durable overrides still future work. |
| Cloud Metadata | Future design | Describes cloud instance bootstrap through metadata/config-drive capabilities and manifest deltas. |
| Cloud Deployment | Partially implemented | Records QEMU boot, serial output, ACPI/PCI/MSI-X discovery work, the landed cloudboot image/harness, the first GCP imported-image serial-console boot proof, and the GCP-first usable-instance provider rollup; public L4/SSH/WebShell ingress, broader storage variants, cloud clocking, production cloud-image release, AWS/Azure proofs, and aarch64 deployment remain future work. |
| Live Upgrade | Future design | Defines service replacement without dropping capabilities or in-flight calls through retargeting and quiesce/resume protocols. |
| GPU Capability | Future design | Sketches capability-oriented GPU, CUDA, memory, and driver isolation models. |
| capOS As A Robot Brain | Future design | Defines capability-oriented robotics service graphs, actuator gateways, safety monitors, realtime control islands, and ROS 2/micro-ROS/MAVLink/OPC UA bridges. |
| Formal MAC/MIC | Future design | Defines a formal mandatory-access and mandatory-integrity model plus future proof obligations. |
| Browser/WASM | Future design | Explores running capOS concepts in a browser using WebAssembly and worker-per-process isolation. |
| Browser Capability and Agent Web Sessions | Future design | Defines browser profiles, a cap-native document-engine middle track, visual browsing after GUI, and earlier agent/shell browser sessions as capability-scoped services over external or native browser backends. |
| Certificates and TLS | Partially implemented | Phase 1 dependencies, host verifier, minimal signing keys, RAM-only vault custody, and development KeySource bootstrap have landed; TLS and ACME remain future. |
| OIDC and OAuth2 | Future design | Defines federated login, OAuth2 clients, typed token capabilities, JWKS, DPoP, token-exchange workload identity federation, and the broker integration for scopes/claims as ABAC input. |
Rejected or Superseded
| Proposal | Status | Purpose |
|---|---|---|
| Endpoint Badges as Service Identity | Rejected | Post-mortem for the seL4-style endpoint badge identity model that was superseded by Service Object Capabilities, then by Session-Bound Invocation Context. |
| Service Object Capabilities | Superseded | Historical service-minted object capability model; the landed synthetic routing/lifecycle proof remains low-level coverage, but the implemented replacement is Session-Bound Invocation Context. |
| Cap’n Proto SQE Envelope | Rejected | Records why ring SQEs stay fixed-layout transport records instead of becoming Cap’n Proto messages themselves. |
| Sleep(INF) Process Termination | Rejected | Records why infinite sleep should not replace explicit process termination, while preserving typed status and future sys_exit removal as separate lifecycle work. |
Maintenance
When a proposal becomes implemented, rejected, or stale, update this index in the same change that changes the proposal or corresponding implementation. If the proposal is implemented, also update or create the stable current-design page named by Current Design Authority. Long proposal files may describe target behavior; this index is the first status checkpoint before a reader opens those documents.
Proposal: Capability-Based Service Architecture
How capOS processes receive authority, compose into services, and expose layered capabilities — without a service manager daemon.
Problem
Traditional OSes grant processes ambient authority (file system, network, IPC namespaces) and then restrict it via sandboxing (seccomp, namespaces, AppArmor). Service managers like systemd handle dependencies, lifecycle, and resource limits through a central daemon with a massive configuration surface.
capOS inverts this: processes start with zero authority and receive only the capabilities they need. The capability graph implicitly encodes service dependencies, resource limits, and access control. No central daemon required.
Process Startup Model
A process receives its entire authority as a set of named capabilities at spawn time. There is no ambient authority to fall back on — if a capability wasn’t granted, the operation is impossible.
The child process sees its granted capabilities by name. It cannot discover or request capabilities it wasn’t given.
Capability Layering
Each process consumes lower-level capabilities and exports higher-level ones. Authority narrows at every layer:
Kernel
│
├─ Nic cap (raw frame send/receive for one device)
├─ Timer cap (monotonic clock)
├─ DeviceMmio cap (one device's BAR regions)
└─ Interrupt cap (one IRQ line)
│
v
NIC Driver Process
│
└─ Nic cap ──> Network Stack Process
│
├─ TcpSocket cap (one connection)
├─ UdpSocket cap (one socket)
└─ NetworkManager cap (create sockets)
│
v
HTTP Service Process
│
├─ Fetch cap (any URL)
│ │
│ v
│ Trusted Process (holds Fetch, mints scoped caps)
│
└─ HttpEndpoint cap (one origin)
│
v
Application Process
The application at the bottom holds an HttpEndpoint cap scoped to a single
origin. It cannot make raw TCP connections, send arbitrary packets, or touch
any device. The capability is the security policy.
HTTP Capabilities
Two levels of HTTP capability: Fetch (general) and HttpEndpoint (scoped).
HttpEndpoint is implemented by a process that holds a Fetch cap and
restricts it.
Fetch
Unrestricted HTTP access — equivalent to the browser Fetch API. The holder can make requests to any URL. This is the base capability that HTTP service processes use internally.
interface Fetch {
# General-purpose HTTP request to any URL.
request @0 (url :Text, method :Text, headers :List(Header), body :Data)
-> (status :UInt16, headers :List(Header), body :Data);
}
struct Header {
name @0 :Text;
value @1 :Text;
}
Fetch is powerful — granting it is roughly equivalent to granting arbitrary
outbound network access. It should only be held by service processes that need
to make requests on behalf of others, not by application code directly.
HttpEndpoint
A restricted view of Fetch, scoped to a single origin. The holder can only
make requests within the bounds encoded in the capability.
interface HttpEndpoint {
# Request scoped to this endpoint's origin.
# Path is relative (e.g., "/v1/users").
request @0 (method :Text, path :Text, headers :List(Header), body :Data)
-> (status :UInt16, headers :List(Header), body :Data);
}
Note: same request() signature as Fetch, but path instead of url.
The origin is implicit — bound into the capability at mint time.
Attenuation
A process holding Fetch mints HttpEndpoint caps by narrowing authority.
The core restriction is always origin — Fetch can reach any URL,
HttpEndpoint is locked to one host. Additional constraints (path prefixes,
method restrictions, rate limits) are possible but are userspace policy
details, not OS-level concerns.
This is the standard object-capability attenuation pattern: same interface,
less authority. The application code is identical whether it holds a broad or
narrow HttpEndpoint.
Boot and Initialization Sequence
The kernel doesn’t know about services. It boots, creates a handful of
kernel-provided caps, and spawns exactly one process: init. Everything else
is init’s responsibility.
Current State vs Target State
The implementation has crossed the single-init startup milestone and the 15.4
schema split. SystemManifest now carries schemaVersion, binaries,
initConfig, and kernelParams. The Cap’n Proto schema no longer exposes
ServiceEntry, ServiceCapSource, CapRef, exports, or restart policy as
kernel-consumed fields. Those service-graph concepts remain as Rust parsing
types inside capos-config because the focused init executor still interprets
initConfig.services.
Each process now also carries an immutable session context produced at spawn
time by kernel/src/session_context.rs; default inheritance comes from the
parent’s session context, and a broker can select a child session through the
AuthorityBroker/UserSession path. This invocation context is the basis for
session-scoped audit attribution and identity-policy enforcement; see
User Identity and Policy
and make run-session-context for the one-session-per-process proof.
Current manifests put the first process description at initConfig.init.
The default system.cue manifest now boots the separate init binary with
BootPackage and ProcessSpawner; that init process reads initConfig.services
and starts the shell, remote-session CapSet gateway, chat server, and resident
demo services.
Focused shell-led manifests such as system-smoke.cue and system-shell.cue
still boot capos-shell as the lone init process for narrow login/shell
proofs. Focused init-executor manifests such as system-spawn.cue,
system-chat.cue, and system-adventure.cue boot the separate init binary
with BootPackage and ProcessSpawner; that init process reads
initConfig.services and resolves the remaining service graph through
ProcessSpawner. Other focused single-service or harness manifests still boot
a demo/service binary as the init process for narrow proofs. The kernel
validates only the kernel-owned boot boundary: schema version, binaries,
kernelParams, initConfig.init.binary, and kernel-sourced
initConfig.init.caps.
Current Bootstrap Ownership Inventory
As of 2026-05-13, the repo is in the schema-split init-owned startup state:
schema/capos.capnpdefinesSystemManifestasschemaVersion,binaries,initConfig, andkernelParams. Service graph fields are not Cap’n Proto schema fields.capos-config/src/manifest.rsstill definesServiceEntry,CapRef,CapSource::Kernel,CapSource::Service, andRestartPolicyas internal Rust types for parsinginitConfig.services.tools/mkmanifeststill embeds every declared binary into the manifest and validates the full init-owned graph before writingmanifest.bin.capos-config/src/validation.rsseparates kernel bootstrap validation from init graph validation. Kernel bootstrap validation covers binary names,initConfig.init.binary, init kernel cap sources, andkernelParams. Full graph validation coversinitConfig.servicesfor mkmanifest and init’s metadata-onlyManifestBootstrapPlanpath.kernel/src/main.rs::run_initreads the Limine manifest module, validates the kernel-owned bootstrap contract, configures serial policy fromkernelParams, and loads onlyinitConfig.init.binary.kernel/src/cap/mod.rs::create_boot_service_capsbuilds onlyinitConfig.init.caps. Those caps are kernel-sourced by type, so the kernel has noCapSource::Servicebranch.- The init cap bundle is currently described by
initConfig.init.caps. In the defaultsystem.cuemanifest this grants the separateinitbinary the bootstrap caps it needs to read BootPackage and spawn the service graph. In focused shell-led manifests such assystem-smoke.cue, this still grantscapos-shellterminal, credential, session, audit, and broker capabilities directly. In focused single-service or harness manifests,initConfig.init.capsgrants only the capabilities the harness itself needs. BootPackageexposes the full serialized manifest bytes to init. That path is live for default and focused init-executor manifests. Focused shell-led manifests do not grantBootPackagetocapos-shell.ProcessSpawnerowns the embedded binary set. It receives the boot manifest bytes so delegatedProcessSpawnergrants can preserve that same boot package context; childBootPackagecaps are not minted fromSpawnGrantSource::Kernel.ProcessSpawner.createPipe(bufferBytes)mints a bounded SPSC kernelPipecapability used by the POSIX adapter Phase P1.3 recording-shim fork-for-exec path; see POSIX Adapter §Phase P1.3 and Userspace Binaries Part 4.ProcessSpawner.spawnresolvesSpawnGrantSource::Kernelfor the bounded manager-issued DDF authority surfaces (DeviceMmio,DMAPool,Interrupt,HardwareAuditLog) through the matching grant-source records inkernel/src/cap/devicemmio_grant_source.rs,kernel/src/cap/dmapool_grant_source.rs, and their interrupt/audit peers. Each grant attaches a fresh manager-owned record, validates owner/quiesce/ scrub state for DMA-side caps, and returns a child-local handle without sharing the parent’s owner object. See device-driver-foundation.md Task 5 for the bounded-authority scope and the focusedmake run-devicemmio-grant,make run-dmapool-grant,make run-interrupt-grant, andmake run-hardware-auditsmokes.init/src/main.rsis the focused BootPackage executor. When that binary is the init process, it reads the BootPackage manifest, builds aManifestBootstrapPlan, validates it again, discovers its own kernel grants frominitConfig.init.capsplus the CapSet, preflights theinitConfig.servicesgraph, resolves kernel and service cap sources, records exports, spawns children throughProcessSpawner, and waits on theirProcessHandles.system.cue,system-smoke.cue,system-spawn.cue,system-chat.cue,system-adventure.cue, and the other focused manifests now express their first-process bundle underinitConfig.initand any child topology underinitConfig.services.
The practical cleanup boundary is therefore not “move service startup to init”; that already happened. The current cleanup target is narrower: the kernel no longer understands the service graph as a bootstrap authority structure. The remaining future cleanup is to stop letting focused harnesses choose arbitrary init binaries and direct kernel cap bundles, then move to one fixed generic-init ABI.
Narrowed Transitional Contract
The current schema is schemaVersion, binaries, initConfig, and
kernelParams. The narrowed kernel contract is:
- The kernel validates
schemaVersion, parseskernelParamsfor kernel-consumed boot policy, and configures serial policy. - The kernel resolves only
initConfig.init.binaryagainstbinariesand loads only that ELF. - The kernel may interpret
initConfig.init.capsonly as the bootstrap cap bundle for the single first process. Those caps must be kernel-sourced; a service-sourced cap ininitConfig.init.capsis invalid because no non-init service exists at kernel handoff time. initConfig.services[*], theircaps,exports,restart, and anyCapSource::Servicereferences are init-owned configuration while the transitional Rust parser exists.mkmanifestand init continue validating them for smoke coverage, but kernel bootstrap does not run the multi-service graph validator or a service export resolver.- Focused harness manifests that intentionally boot a demo/service binary as
init stay valid during this slice. Their harness-specific caps are still
described by
initConfig.init.capsuntil those smokes are migrated behind a generic init-owned executor config.
Kernel bootstrap implements this contract with a first-service cap-table
builder. That builder covers only implemented kernel sources used by current
initConfig.init.caps lists.
That current first-service surface is wider than the eventual generic-init
minimum: the default init-owned path needs Console, TerminalSession,
CredentialStore, SessionManager, AuditLog, AuthorityBroker, BootPackage,
ProcessSpawner, listener, launcher, and chat endpoint authority so it can
launch the current service graph; focused shell-led paths still need
TerminalSession, CredentialStore, SessionManager, AuditLog, and
AuthorityBroker directly; focused harnesses need their own direct kernel caps.
Cross-service export lookup, service-source attenuation, and non-init
cap-resolution policy stay in init/src/main.rs for the focused
BootPackage-executor manifests.
Target Boot Package Contract
After the harness migration, SystemManifest should keep the same outer shape
but initConfig.init should stop being a per-manifest kernel bootstrap bundle.
At that point:
ServiceEntry,CapRef,CapSource::Service, service exports, and restart policy remain ordinary data insideinitConfig, interpreted and validated by init or a supervisor service.- Kernel validation is limited to the schema version, kernel parameters, boot-package integrity/measurement policy, and enough binary metadata to load the one init image.
- The first process is the generic init/supervisor, not a demo harness or shell. Shell-led and focused single-service proofs should become init-owned configurations rather than alternate kernel bootstrap contracts.
- The fixed direct kernel bundle for that generic init starts with
Console,BootPackage, andProcessSpawnerin the currently implemented system. This is the target generic-init minimum, not the full transitionalinitConfig.init.capssurface. The architecture-level target also includesTimer,DeviceManager,FrameAllocator, and per-processVirtualMemoryonce those authorities are ready to be part of init’s stable bootstrap ABI. Until then, FrameAllocator, VirtualMemory, and Endpoint grants for child processes remain minted throughProcessSpawnerspawn grants.
The target model removes the kernel-side service graph entirely. The manifest stops being a kernel authority graph and becomes a boot package delivered to init:
- List of embedded binaries (init needs them before any storage service exists; they can’t be fetched from a filesystem that hasn’t started).
- Init’s config blob (CUE-encoded tree; what to spawn, with what attenuations, with what restart policy).
- Kernel boot parameters (memory limits, feature flags) consumed by the kernel itself, not forwarded to init.
The kernel spawns exactly one userspace process (init) with a fixed cap bundle:
Console— kernel serial wrapper (may be replaced later by a userspace log service, with init retaining a direct console cap for emergency use).ProcessSpawner— only init and its delegated supervisors hold this.FrameAllocator— physical frame authority for init’s own allocations.VirtualMemory— per-process address-space authority for init.DeviceManager— enumerate/claim devices; init delegates device-specific slices to drivers.Timer— monotonic clock.BootPackage— read-only cap exposing the embedded binaries and the config blob.
Everything else — drivers, net-stack, filesystems, supervisors, apps —
init spawns at runtime via ProcessSpawner with appropriate attenuation.
No manifest ServiceEntry, no cross-service CapRef, no manifest exports.
Pre-Init Boundary After Stage 6
Rule of thumb: no userspace service runs before init. The kernel’s job is primitive cap synthesis and a single-process handoff; init’s job is the whole service graph. Concretely, after Stage 6:
- Stays in kernel pre-init: memory map ingest, frame allocator, heap,
paging, GDT/IDT/TSS, serial for kernel diagnostics, scheduler, ring
dispatch, kernel-cap
CapObjectimpls, ELF loading for init, boot package measurement (if attested boot is added). - Stays in manifest: binaries list + init config blob + kernel boot
params. Schema-wise,
ServiceEntryandCapSource::Servicedisappear;SystemManifestshrinks tobinaries + initConfig + kernelParams. - Moves to init: service topology, cross-service cap wiring, attenuation, restart policies, dynamic spawn, cap export/import, supervision trees. Anything a service manager would do.
- Moves to init or later services: logging policy, config store, secrets, filesystem mounts, network configuration, device binding.
Edge cases that might look like they want a pre-init service but don’t:
- Early crash / panic handling. Kernel-side panic handler, no service needed.
- Recovery shell. Kernel fallback: if init fails to reach a healthy state within a timeout (e.g. exits immediately, or never issues a liveness SQE), kernel optionally spawns a “recovery” binary from the boot package with the same cap bundle. Still just one userspace process at a time pre-supervisor-loop.
- Attested/measured boot. Kernel hashes binaries in the boot package
before handing
BootPackageto init. The measurement agent, if any, runs as a normal service spawned by init with a cap to the sealed measurements. - Early-boot console. Kernel owns serial and exposes
Consoleto init. A userspace log service can layer on top later; it is not pre-init.
Legacy Manifest Fields After Stage 6
ServiceEntry.caps, CapSource::Service, and ServiceEntry.exports are
transitional init configuration, not kernel schema. The 15.4 schema split
deleted them from schema/capos.capnp, collapsed the service graph into
initConfig: CueValue, and kept kernel bootstrap on the first-service cap-table
builder. The remaining cleanup is to make that first-service bundle fixed
rather than manifest-selected:
- Move shell-led and focused harness proofs behind an init-owned executor config instead of booting their binaries directly as init.
- Embed or otherwise pin the generic init image as the only kernel-loaded
userspace image. Partially landed (2026-05-25 23:26 UTC): the
initimage is embedded and loaded fromkernel::boot::INIT_ELFwheneverinit.binary == "init"(see “Init Binary Embedding”). It is not yet the only kernel-loaded image — until step 1 moves the focused/shell proofs behind an init-owned executor, non-"init"PID-1 selectors are still kernel-loaded frombinaries. - Replace per-manifest
initConfig.init.capswith the fixed bootstrap cap bundle described above plusBootPackage. - Keep
initConfig.servicesas ordinary init/supervisor configuration until a later libcapos or supervisor API gives it a more concrete format.
The re-export restriction added in capos-config::validate_manifest_graph
(service A exports cap sourced from B.ep) becomes moot at that point
because there are no kernel-owned manifest exports at all. It stays as
defensive validation for initConfig.services while the transitional
init-owned executor exists.
Init Binary Embedding
Status: landed 2026-05-25 23:26 UTC as a hybrid keyed on the reserved init
selector (see below). Init is part of the kernel’s bootstrap contract, not a
configuration choice: the cap bundle handed to init is a kernel ABI, the
_start(ring, pid, …) entry shape is a kernel ABI, and a version-mismatched
init is a footgun with no payoff in a single-init research OS. So the init ELF
ships inside the kernel binary via include_bytes!, not as a separate
manifest entry or Limine module.
Shape (as landed):
init/stays a standalone crate with its own linker script and code model (user-space base0x200000,staticrelocation model, 4 KiB alignment). Not a workspace member; different build flags than the kernel.kernel/build.rsreads the prebuiltinit/artifact (the Makefile passesCAPOS_INIT_ELFand ordersinitbefore the kernel; a conventional-path fallback covers a barecargo buildafter init is built) and emits aninclude_bytes!("…")into akernel::boot::INIT_ELF: &[u8]static. Driving init’s build frombuild.rswas rejected to avoid duplicating its custom target/code-model flags; failing closed on a missing artifact is the chosen behavior.initConfig.init.binaryis a generic “which binary is PID 1” selector, so embedding is keyed on the reserved namecapos_config::RESERVED_INIT_BINARY_NAME("init"). Wheninit.binary == "init", kernel bootstrap parsesINIT_ELFthrough the samecapos_lib::elfpath used for service binaries, creates the init address space viaAddressSpace::new_user(), loads segments, populates the cap bundle (includingBootPackage), and jumps — no Limine module lookup and nobinariesresolution for that identity. Wheninit.binarynames any other binary (the shell onrun-smoke, the ~70 focused test-as-PID-1 manifests), PID 1 still resolves fromSystemManifest.binariesexactly as before.- The reserved name
"init"must not appear inSystemManifest.binaries: manifest validation (capos-configandmkmanifest) rejects it, since the kernel owns theinitimage. Real-init manifests drop theirinitentry; theirbinarieslist is services-only. - The embedded image is the canonical
initbinary, so init’s own child spawns that referenceinitby name (e.g.system-spawn.cue’s spawn-hardening fixtures) still resolve: when init is embedded,run_initinjects the embedded bytes into theProcessSpawnerbinary set under the reserved name (theBootPackagecap serves only the serialized manifest bytes, which never carry the reserved entry). This keeps the spawnable set identical to the pre-embedding state withoutinitre-entering the serialized manifest. Service binaries remain distinctBootPackageblobs. - Measured-boot attestation (if added) covers the kernel ELF, which
transitively covers init’s bytes. Service binaries are hashed
separately by the kernel before handing
BootPackageto init.
What this does not change:
- Init still runs in Ring 3 with its own page tables; embedding is byte packaging, not privilege merging.
- Init is still ELF-parsed at boot — the same loader and W^X enforcement apply. The only thing different is where the bytes came from.
- Service binaries (everything spawned after init) stay in the boot
package as distinct blobs, exposed to init via
BootPackage. They are not linked into the kernel; their lifecycle is independent of the kernel’s.
What option was rejected: fully linking init into the kernel crate (shared
compilation unit, shared text). That collapses the kernel/user build
boundary, couples linker scripts and code models, and puts init’s
panics/UB inside the kernel’s compilation context. The process-isolation
boundary survives that arrangement — but the build-time separation that
makes the boundary trustworthy does not. include_bytes! preserves the
separation; static linking destroys it.
Kernel boot
│
├─ Create kernel caps: Console, Timer, DeviceManager, ProcessSpawner
│
└─ Spawn init with all kernel caps
│
init process (PID 1)
│
├─ Phase 1: Core services (sequential — each depends on previous)
│ ├─ DeviceManager.enumerate() → list of devices
│ ├─ Spawn NIC driver with device-specific caps
│ ├─ Wait for NIC driver to export Nic cap
│ ├─ Spawn net-stack with Nic + Timer caps
│ └─ Wait for net-stack to export NetworkManager cap
│
├─ Phase 2: Higher-level services (can be parallel)
│ ├─ Spawn http-service with TcpSocket cap from net-stack
│ ├─ Spawn dns-resolver with UdpSocket cap
│ └─ ...
│
└─ Phase 3: Applications
├─ Spawn app-a with HttpEndpoint("api.example.com")
├─ Spawn app-b with Fetch cap (trusted)
└─ ...
The Init Process in Detail
Init is a regular userspace process with privileged caps. It is the only
process that holds ProcessSpawner (the right to create new processes) and
DeviceManager (the right to enumerate and claim devices). It can delegate
subsets of these to child supervisors.
// init/src/main.rs — this IS the system configuration
fn main(caps: CapSet) {
let spawner = caps.get::<ProcessSpawner>("spawner");
let devices = caps.get::<DeviceManager>("devices");
let timer = caps.get::<Timer>("timer");
let console = caps.get::<Console>("console");
// === Phase 1: Hardware drivers ===
// Find the NIC
let nic_device = devices.find("virtio-net")
.expect("no network device found");
// Spawn NIC driver — gets ONLY its device's MMIO + IRQ
let nic_driver = spawner.spawn(SpawnRequest {
binary: "/sbin/virtio-net",
caps: caps![
"device_mmio" => nic_device.mmio(),
"interrupt" => nic_device.interrupt(),
"log" => console.clone(),
],
restart: RestartPolicy::Always,
});
// The driver exports a Nic cap once initialized
let nic: Cap<Nic> = nic_driver.exported("nic").wait();
// === Phase 2: Network stack ===
let net_stack = spawner.spawn(SpawnRequest {
binary: "/sbin/net-stack",
caps: caps![
"nic" => nic,
"timer" => timer.clone(),
"log" => console.clone(),
],
restart: RestartPolicy::Always,
});
let net_mgr: Cap<NetworkManager> = net_stack.exported("net").wait();
// === Phase 3: HTTP service ===
let tcp = net_mgr.create_tcp_pool();
let http_service = spawner.spawn(SpawnRequest {
binary: "/sbin/http-service",
caps: caps![
"tcp" => tcp,
"log" => console.clone(),
],
restart: RestartPolicy::Always,
});
let fetch: Cap<Fetch> = http_service.exported("fetch").wait();
// === Phase 4: Applications ===
// Trusted telemetry agent — gets full Fetch
spawner.spawn(SpawnRequest {
binary: "/sbin/telemetry",
caps: caps![
"fetch" => fetch.clone(),
"log" => console.clone(),
],
restart: RestartPolicy::OnFailure,
});
// Sandboxed app — gets scoped HttpEndpoint
let api_cap = fetch.attenuate(EndpointPolicy {
origin: "https://api.example.com",
paths: Some("/v1/users/*"),
methods: Some(&["GET", "POST"]),
});
spawner.spawn(SpawnRequest {
binary: "/app/my-service",
caps: caps![
"api" => api_cap,
"log" => console.clone(),
],
restart: RestartPolicy::OnFailure,
});
// Init stays alive as the root supervisor
supervisor_loop(&spawner);
}
Key Mechanisms
Cap export. A spawned process can export capabilities back to its parent
via the ProcessHandle (see Spawn Mechanism section). This is how the NIC
driver makes its Nic cap available to the network stack — init spawns the
driver, waits for it to export "nic", then passes that cap to the next
process.
Restart policy. Encoded in SpawnRequest, enforced by the supervisor
loop in the spawning process. When a child exits unexpectedly:
- Old caps held by the child are automatically revoked (kernel invalidates the process’s cap table on exit)
- Supervisor re-spawns with the same
SpawnRequest - New instance gets fresh caps — same authority, new identity
Dependency ordering. Sequential in code: wait() on exported caps
blocks until the dependency is ready. No declarative dependency graph
needed — Rust’s control flow is the dependency graph.
Service Taxonomy
Concrete categories of userspace services capOS expects to run. All spawned by init (or a supervisor init delegates to) after Stage 6. None are pre-init.
Hardware Drivers
One process per managed device. Each holds exactly the caps for its own
hardware: an DeviceMmio slice, the corresponding Interrupt cap, and
optionally a DmaRegion cap carved out of the frame allocator. Exports a
typed device cap (Nic, BlockDevice, Framebuffer, Gpu, …). Examples:
virtio-net, virtio-blk, NVMe, AHCI, framebuffer/GPU.
Platform Services
- Logger / journal — accepts
Logcap writes, forwards to console and/or durable storage. Init and kernel bootstrap use a directConsolecap until the logger is up; afterwards new services getLogcaps only. - Filesystem — one per mounted volume. Consumes a
BlockDevicecap, exportsDirectory/Filecaps. FAT, ext4, overlay, tmpfs. - Store — capability-native content-addressed storage backing
persistent capability state (
storage-and-naming-proposal.md). - Network stack — userspace TCP/IP (
networking-proposal.md). ConsumesNic+Timer, exportsNetworkManager,TcpSocket,UdpSocket,TcpListener. - DNS resolver — consumes a
UdpSocket, exportsResolver. - Config / secrets store — reads the initial config from
BootPackage, exposes runtimeConfigandSecretcaps with per-key attenuation. - Cloud metadata agent — detects IMDS / ConfigDrive / SMBIOS on cloud
boot and delivers a
ManifestDelta(cloud-metadata-proposal.md). - Upgrade manager — orchestrates
CapRetargetfor live service replacement (live-upgrade-proposal.md). - Capability proxy — makes selected local caps reachable over the network.
The near-term shape is typed Cap’n Proto RPC or a schema-framed proxy,
following Cloudflare’s production pattern of schema-bundled Workers bindings
to internal services; later remote-capability sessions can borrow
Spritely/OCapN CapTP’s session, handoff, and reference-lifetime model without
treating current OCapN drafts as capOS ABI commitments. The proxy must never
serialize local
CapIdvalues, endpoint generations, receiver selectors, or kernel/session ids as portable authority, and it must own explicit resource ledgers for remote refs, queued calls, streams, and retries. See Cloudflare, Cap’n Proto, Workers RPC, and Cap’n Web and Spritely, OCapN, and CapTP. - Measurement / attestation agent — consumes sealed kernel hashes
from
BootPackage, exposesQuotecaps for remote attestation.
Supervisors
Per-subsystem restart managers that hold a narrowed ProcessSpawner plus
the caps of the subtree they own. If any child crashes, the supervisor
tears down and re-spawns the set. Example: net-supervisor owns NIC
driver + net-stack + DHCP client.
Application Services
User-facing or user-spawned processes: HTTP servers, API gateways, worker
pools, shells, interactive tools. Hold only the narrow caps the supervisor
grants (HttpEndpoint for one origin, Directory for one mount, etc.).
Human users, service accounts, guests, and anonymous callers are represented
by session/profile services that grant scoped cap bundles; they are not kernel
subjects or ambient process credentials. See
User Identity and Policy.
What Does Not Become a Service
- Console / serial — stays in the kernel as a
CapObjectwrapper. Small enough, needed for kernel diagnostics, no benefit from userspace isolation. A userspace log service can layer on top. - Frame allocator, virtual memory, scheduler, ring dispatch — kernel primitives, exposed as caps but not as services.
- Interrupt delivery, DMA mapping — kernel mechanisms, exposed to drivers as caps.
- Boot measurement — if added, happens in the kernel before
BootPackageexists; the measurement agent (userspace) only reports them.
Supervision
Supervision Tree
Init doesn’t have to supervise everything directly. It can delegate:
init (root supervisor)
├─ net-supervisor (holds: spawner subset, device caps)
│ ├─ virtio-net driver
│ ├─ net-stack
│ └─ http-service
└─ app-supervisor (holds: spawner subset, service caps)
├─ my-service
└─ another-app
Each supervisor is a process that holds a ProcessSpawner cap (possibly
restricted to specific binaries) and the caps it needs to grant to children.
If net-supervisor crashes, init restarts it, and it re-spawns the entire
networking subtree.
Supervisor Loop
#![allow(unused)]
fn main() {
fn supervisor_loop(children: &[SpawnRequest], spawner: &ProcessSpawner) {
let mut handles: Vec<ProcessHandle> = children.iter()
.map(|req| spawner.spawn(req.clone()))
.collect();
loop {
// Wait for any child to exit
let (index, exit_code) = wait_any(&handles);
let req = &children[index];
match req.restart {
RestartPolicy::Always => {
handles[index] = spawner.spawn(req.clone());
}
RestartPolicy::OnFailure if exit_code != 0 => {
handles[index] = spawner.spawn(req.clone());
}
_ => {
// Process exited normally, don't restart
}
}
}
}
}
Socket Activation
systemd pre-creates a socket and passes the fd to the service on first connection. In capOS, the supervisor does the same with caps:
Eager (default): supervisor spawns the child immediately with a
TcpListener cap. Child calls accept() and blocks.
Lazy: supervisor holds the TcpListener cap itself. On first incoming
connection (or on first accept() from a proxy cap), it spawns the child
and transfers the cap. The child code is identical in both cases.
#![allow(unused)]
fn main() {
// Lazy activation — supervisor holds the listener until needed
let listener = net_mgr.create_tcp_listener();
listener.bind([0,0,0,0], 8080);
// This blocks until a connection arrives
let _conn = listener.accept();
// Now spawn the actual service, giving it the listener
spawner.spawn(SpawnRequest {
binary: "/app/web-server",
caps: caps!["listener" => listener, "log" => console.clone()],
restart: RestartPolicy::Always,
});
}
Configuration
See Storage and Naming for the full storage, naming, and configuration model.
Summary: the system topology is currently defined in a capnp-encoded
system manifest baked into the boot image. tools/mkmanifest compiles the
human-authored system.cue, system-smoke.cue, or focused manifest sources
such as system-spawn.cue, system-devicemmio-grant.cue, and
system-wasi-random.cue into the binary manifest. Default boot uses
standalone init and init-owned service-graph execution; focused shell-led
manifests still grant login/session/broker caps directly to capos-shell for
narrow smokes. Focused init-executor manifests let the separate init binary
validate and execute the manifest through ProcessSpawner; the old generic
kernel resolver has been replaced by first-service cap construction.
Manifest-declared SpawnGrantSource::Kernel entries cover the bounded DDF
authority surface (DeviceMmio, DMAPool, Interrupt, HardwareAuditLog)
and the wasm-host’s optional EntropySource grant; the WASI host adapter
(see WASI Host Adapter) and the
POSIX adapter (see POSIX Adapter)
both run as ordinary userspace processes spawned through this same path.
Remaining cleanup is to move runtime configuration into a capability-based
store service once that service exists. See also the layered CUE configuration
model in
System Configuration and Operator Extensibility.
Comparison with Traditional Approaches
| Concern | systemd/Linux | capOS |
|---|---|---|
| Service dependencies | Wants=, After=, Requires= | Implicit in cap graph |
| Sandboxing | seccomp, namespaces, AppArmor | Default: zero ambient authority |
| Socket activation | ListenStream=, fd passing protocol | Pass TcpListener cap |
| Restart policy | Restart=on-failure | Supervisor process loop |
| Logging | journald, StandardOutput=journal | Log cap in granted set |
| Resource limits | cgroups, MemoryMax=, CPUQuota= | Bounded allocator caps |
| Network access control | firewall rules (iptables/nftables) | Scoped HttpEndpoint / TcpSocket caps |
| Config format | INI-like unit files (~1500 directives) | Rust code or minimal manifest |
| Trusted computing base | systemd PID 1 (~1.4M lines) | Init process (hundreds of lines) |
Spawn Mechanism
Spawning is a capability-gated operation. The kernel provides a
ProcessSpawner capability — only the holder can create new processes.
Implemented Kernel Slice
The kernel now provides:
-
ProcessSpawnercapability — aCapObjectimpl inkernel/src/cap/process_spawner.rs. Methods:spawn(name, binaryName, grants) -> handleIndex— resolve a boot-package binary, load ELF, create address space (builds on existingelf.rsloader andAddressSpace::new_user()inmem/paging.rs), populate the initial cap table, schedule the process, and return theProcessHandlethrough the ring result-cap list- the returned
ProcessHandlecap lets the parent wait for child exit in the first slice; exported caps and kill semantics are later lifecycle work
-
Initial cap passing — at spawn time, the kernel copies permitted parent cap references into the child’s cap table or mints authorized child-local kernel caps. Raw grants preserve the source legacy badge. Endpoint-client grants may mint a requested legacy badge only from an endpoint owner or trusted parent endpoint result source; delegated client facets must preserve their existing service identity. Child-local Endpoint, FrameAllocator, and VirtualMemory grants are created for the child’s process. Child-local endpoint grants return parent-side client facets as result caps instead of sharing the endpoint owner object. The parent’s references are unaffected. Legacy endpoint badges are transitional; new multi-client service identity should use session-bound invocation context plus broker-granted service roots/facets.
-
Cap export — future lifecycle work will let a child register a cap by name in its
ProcessHandle, making it available to the parent (or anyone holding the handle). This is the mechanism behindnic_driver.exported("nic").wait()once exported-cap lookup is added.
Schema
interface ProcessSpawner {
spawn @0 (name :Text, binaryName :Text, grants :List(CapGrant)) -> (
handleIndex :UInt16,
capabilityManagerIndex :UInt16,
);
createPipe @1 (bufferBytes :UInt32) -> (readIndex :UInt16, writeIndex :UInt16);
}
struct CapGrant {
name @0 :Text;
capId @1 :UInt32;
interfaceId @2 :UInt64;
mode @3 :CapGrantMode;
badge @4 :UInt64;
source @5 :CapGrantSource;
}
struct CapGrantSource {
union {
capability @0 :Void;
kernel @1 :KernelCapSource;
}
}
enum CapGrantMode {
raw @0;
clientEndpoint @1;
move @2;
serviceObject @3;
}
interface ProcessHandle {
wait @0 () -> (exitCode :Int64);
terminate @1 () -> ();
}
Note on capability passing: Capabilities are referenced by cap table
slot IDs (UInt32), not by Cap’n Proto’s native capability table mechanism.
spawn() returns the ProcessHandle and a CapabilityManager cap through
the ring result-cap list; handleIndex and capabilityManagerIndex identify
those transferred caps in the completion. The first slice passes a
boot-package binaryName instead of raw ELF bytes so the request stays
within the bounded ring parameter buffer. terminate (deferred kill) is
implemented on ProcessHandle; post-spawn grants and exported-cap lookup
remain future lifecycle work until their authority semantics are implemented.
capOS uses manual capnp dispatch (CapObject trait with raw message bytes,
not capnp-rpc), so cap references are plain integers and typed result caps use
the ring transfer-result metadata. See
Userspace Binaries Part 7 for
the surrounding userspace bootstrap schema context, Part 4 for the POSIX
adapter surface that consumes ProcessSpawner.createPipe plus the
recording-shim fork-for-exec successor posix_spawn over the same Move-grant
path, and Part 5 for the WASI host adapter that runs as a userspace process
spawned through this same ProcessSpawner with manifest-supplied capability
grants (WASI Host Adapter).
Relationship to Existing Code
The current kernel has these pieces in place:
- ELF loading (
kernel/src/elf.rs) — parses PT_LOAD segments, validates alignment, and feeds the reusable spawn primitive behindProcessSpawner. - Address space creation (
kernel/src/mem/paging.rs) —AddressSpace::new_user()creates isolated page tables with the kernel mapped in the upper half. - Cap table (
kernel/src/cap/table.rs) —CapTablewithinsert(),get(),remove(), transfer preflight, provisional insert, commit, and rollback helpers. EachProcessowns one local table. - Process struct and scheduler (
kernel/src/process.rs,kernel/src/sched.rs) — a process table plus round-robin run queue are in place for both legacy manifest-spawned services and init-spawned children.
Generic capability transfer/release and the reusable ProcessSpawner
lifecycle path are complete enough for the focused init-owned spawn executor.
Default startup now uses standalone init for service-graph execution, while
focused shell-led startup remains for narrow smokes.
ProcessSpawner.createPipe extends the lifecycle surface with a bounded SPSC
kernel Pipe capability consumed by the POSIX adapter’s recording-shim
fork-for-exec path (P1.3) and exposed as the posix_spawn successor on the
same Move-grant path. The DDF Task 5 grant-source families
(devicemmio_grant_source.rs, dmapool_grant_source.rs, and their
interrupt/audit peers) extend SpawnGrantSource::Kernel with the bounded
manager-issued DDF authority surface; production handle lifecycle, hardware-
backed driver wait/ack dispatch beyond bounded route proofs, and the S.11.2
hostile-smoke gates remain open. Each spawned process also receives one
immutable session context (default-inherited from the parent or
broker-selected), used as the invocation subject for audit attribution and the
identity-policy boundary. Remaining lifecycle gaps are post-spawn grants, runtime exported-cap lookup,
restart supervision, and shrinking the transitional manifest schema.
ProcessHandle.terminate (deferred kill) is implemented.
Prerequisites
| Prerequisite | Status | Why |
|---|---|---|
| ELF loading + address spaces | Done (Stage 2-3) | elf.rs, AddressSpace::new_user() |
| Capability ring + cap_enter | Done (Stage 4/6 foundation) | Ring-based cap invocation with blocking waits |
| Scheduling + preemption (core) | Done (Stage 5) | Round-robin, PIT 100 Hz, context switch |
| Cross-process Endpoint IPC | Done (Stage 6 foundation) | CALL/RECV/RETURN routing through Endpoint objects |
| Generic cap transfer/release | Done (Stage 6, 2026-04-22/24) | Copy/move transfer, result-cap insertion, CAP_OP_RELEASE, epoch revocation, and revoked endpoint Disconnected error surface |
| ProcessSpawner + ProcessHandle | Done (Stage 6, 2026-04-22) | Init-driven spawn with grants, wait completion, hostile-input coverage; kill/post-spawn grants still future |
| ProcessSpawner.createPipe + recording-shim fork-for-exec | Done (POSIX adapter P1.3, 2026-05-07 09:55 UTC) | Bounded SPSC Pipe capability and Move-grant fork-for-exec successor; see POSIX Adapter §Phase P1.3 and Userspace Binaries Part 4 |
DDF bootstrap-grant sources (DeviceMmio, DMAPool, Interrupt, HardwareAuditLog) | In progress (DDF Task 5) | Bounded manager-issued authority over SpawnGrantSource::Kernel; production handle lifecycle and S.11.2 hostile smokes remain open. See device-driver-foundation.md Task 5 |
| Immutable per-process session context | Done (kernel/src/session_context.rs) | One session context per process, default-inherited or broker-selected; make run-session-context proof |
| Authority graph + quota design (Security Verification Track S.9) | Done (2026-04-21) | Defines transfer/spawn invariants, per-process quotas, and rollback rules; see docs/authority-accounting-transfer-design.md |
This proposal describes the target architecture. Individual pieces (like
Fetch/HttpEndpoint) are additive — they’re userspace processes that
compose existing caps into higher-level ones. No kernel changes needed
beyond Stages 4-6.
First Step After Transfer and ProcessSpawner — done 2026-04-23
The minimal demonstration of this architecture landed together with capability
transfer and ProcessSpawner:
ProcessSpawnercap inkernel/src/cap/process_spawner.rswraps ELF loading and address-space creation behind a typed capability.- Init spawns children — focused
make run-spawnboots a single-init manifest; the kernel boots only the separateinitbinary frominitConfig.init, theninitspawns the focused demo graph frominitConfig.servicesthroughProcessSpawner, grants child-local endpoint owners and client facets, then releases parent endpoint facets before waiting on eachProcessHandle. - Cross-process cap invocation — spawned client invokes the server’s Endpoint cap, server replies, both print to console.
This exercises: spawn cap, initial cap passing, manifest-declared export
recording, cross-process cap invocation, hostile-input rejection, and
per-process resource exhaustion paths. Deleting the unused legacy kernel
resolver is post-milestone cleanup tracked in docs/tasks/.
Open Questions
-
Restart supervision. Epoch-based cap revocation and generation-tagged stale reference detection are implemented for current grant/revoke flows. Restart policy still needs a supervisor contract that epoch-bumps caps served by the failed process, restarts from the manifest, and reconnects clients through explicit authority rather than ambient service lookup.
-
Cap discovery. How does a process learn what caps it was given? Resolved: name→(cap_id, interface_id) mapping passed at spawn via a well-known page (
CapSet). See Userspace Binaries Part 2.cap_idis the authority-bearing table handle.interface_idis the transported capnpTYPE_IDused by typed clients to check that the handle speaks the expected interface. -
Lazy spawning. Should the init process start everything eagerly, or should caps be backed by lazy proxies that spawn the backing service on first invocation?
-
Cap persistence. If the system reboots, should the cap graph be reconstructable from saved state? Or is it always rebuilt from init code?
-
Delegation depth. Can an application further delegate its
HttpEndpointcap to a subprocess? If so, the HTTP gateway needs to support fan-out. If not, how is this restriction enforced?
Proposal: Schema Registry Capability
Cap’n Proto is self-describing. When the compiler processes schema/capos.capnp
it emits a CodeGeneratorRequest containing every interface id, every method
name and ordinal, every parameter and result struct layout, every enum, and
every doc comment. That machine-readable reflection data exists today; it just
is not served at runtime. This proposal defines a SchemaRegistry capability
that serves it.
Status: Proposal. No implementation. The prerequisite work – schema
doc-comment authoring across schema/capos.capnp and preservation of those
comments in the generated-bindings pipeline – is tracked separately and is
also a prerequisite for the System Manual Phase 3.
This proposal records the design and its authority model so it can be built
once those prerequisites land.
Problem
Every capability interface in capOS has a precise machine-readable definition: method names, ordinals, parameter struct field names and types, result struct layouts, enums. Today that information lives only in the host-side compiler output, the checked-in generated Rust bindings, and in the heads of developers who have read the schema. A running capOS instance cannot answer:
- “What methods does this interface expose?”
- “What ordinal does
listMethodsmap to?” - “What fields does the parameter struct for
resolveMethodcontain?”
This gap affects three categories of caller:
- Interactive shell. A user typing
call @cap.method(args)in the capOS shell wants the shell to resolve the human method name to an ordinal, check argument types against the parameter struct schema, encode the capnp message, dispatch the call, and decode the result – all without requiring the user to have memorized ordinals or wire layouts. - Dynamic and agent-driven callers. A process or agent that receives a capability without compile-time bindings cannot easily discover what methods are available. Today it must carry out-of-band schema knowledge or guess. A machine-readable registry eliminates that gap.
- Cross-language and network tooling. A host-side tool connecting to a running capOS instance via the remote-session gateway needs schema metadata to encode and decode capnp messages without shipping language-specific generated bindings for every possible capability type.
What Cap’n Proto’s Self-Description Provides
The capnp compiler’s CodeGeneratorRequest contains:
- For interfaces: the 64-bit interface id, the interface name, and for each method: its ordinal (call slot number), method name, the type id of the parameter struct, and the type id of the result struct.
- For structs: the struct’s 64-bit type id, name, and for each field: field name, ordinal slot, and the type of the value (a primitive, a struct type id, a list, a capability type id, etc.).
- For enums: the enum type id, name, and for each enumerant: name and numeric value.
- Doc comments: the raw doc-comment text attached to interfaces, methods,
structs, and fields. These are preserved in the
CodeGeneratorRequestwhen the compiler receives them; the current generated-bindings pipeline strips them. Preserving them is a tracked prerequisite.
The registry bakes this data into a boot-packaged blob at make time, exactly
as the System Manual bakes its corpus. Both are read-only deliveries of
build-time information; neither reflects the live state of a running object.
Relationship to the System Manual
The SchemaRegistry and the Manual capability share one substrate: the same
CodeGeneratorRequest blob baked at build time. They are two delivery modes of
that shared reflection data:
- Manual (see System Manual Capability):
human prose delivery. It renders the schema into
man(2)-style interface pages withSYNOPSISsections generated from method signatures andDESCRIPTIONsections from doc comments. It serves text, structured for human reading. - SchemaRegistry: machine-readable metadata delivery. It serves structured
SchemaNodevalues carrying interface ids, ordinals, type ids, and field layouts. It serves data, structured for programmatic consumption.
The two can share a service implementation that reads the same blob; the interface shape differs because the consumers differ.
Interface
struct MethodInfo {
ordinal @0 :UInt16; # call slot number
name @1 :Text; # as written in the .capnp source
paramTypeId @2 :UInt64; # type id of the parameter struct
resultTypeId @3 :UInt64; # type id of the result struct
docComment @4 :Text; # empty until doc-comment prerequisite lands
}
struct FieldInfo {
slot @0 :UInt16;
name @1 :Text;
typeKind @2 :TypeKind;
structTypeId @3 :UInt64; # set when typeKind == struct
docComment @4 :Text;
}
enum TypeKind {
void @0; bool @1; int8 @2; int16 @3; int32 @4; int64 @5;
uint8 @6; uint16 @7; uint32 @8; uint64 @9; float32 @10; float64 @11;
text @12; data @13; list @14; enum_ @15; struct_ @16; interface_ @17;
anyPointer @18;
}
struct SchemaNode {
typeId @0 :UInt64;
displayName @1 :Text;
union {
interface @2 :InterfaceSchema;
struct_ @3 :StructSchema;
enum_ @4 :EnumSchema;
other @5 :Void;
}
}
struct InterfaceSchema {
methods @0 :List(MethodInfo);
docComment @1 :Text;
}
struct StructSchema {
fields @0 :List(FieldInfo);
docComment @1 :Text;
}
struct EnumSchema {
enumerants @0 :List(EnumerantInfo);
docComment @1 :Text;
}
struct EnumerantInfo {
value @0 :UInt16;
name @1 :Text;
}
struct SearchResult {
typeId @0 :UInt64;
displayName @1 :Text;
kind @2 :Text; # "interface", "struct", or "enum"
snippet @3 :Text; # first line of doc comment, if present
}
interface SchemaRegistry {
# Resolve a method name on an interface to its ordinal and struct type ids.
resolveMethod @0 (interfaceId :UInt64, name :Text)
-> (ordinal :UInt16, paramTypeId :UInt64, resultTypeId :UInt64);
# Fetch the full schema node for a given type id.
lookupType @1 (typeId :UInt64) -> (node :SchemaNode);
# List all methods on an interface (for discovery without a known name).
listMethods @2 (interfaceId :UInt64) -> (methods :List(MethodInfo));
# Keyword search across names and doc comments in the baked blob.
search @3 (query :Text) -> (candidates :List(SearchResult));
# The build/commit this schema blob was produced from.
buildInfo @4 () -> (commit :Text, builtAt :Text);
}
The interface is additive-only; future methods append at higher ordinals,
matching the convention already established by SystemInfo and Manual.
Authority Model
This is the most important section of this proposal. The registry embodies capOS Design Principle 4 – “the interface IS the permission” – from a different angle:
Discovery does not grant call authority.
Holding a SchemaRegistry capability lets the caller learn the shape of an
interface: its method names, ordinals, parameter field names and types, and
doc comments. It does not grant the caller permission to invoke those methods.
To call Console.writeLine, the caller still needs a live Console capability
in its CapSet. The registry answers “what can this type do in general?” – the
live capability answers “what can I do right now, and am I permitted to do it?”
This split is not a weakening of the capability model; it is the correct
expression of it. Consider the analogy: knowing that a bank offers a “transfer”
operation does not give you a bank account. Learning that ProcessSpawner
exposes a spawn method does not give you a ProcessSpawner.
Practical consequences:
SchemaRegistryis read-only and holds no authority beyond serving schema metadata from the build-time blob. It is safe to grant to any process or agent that needs dynamic method discovery.- The kernel remains the validation trust boundary. When a call arrives via
the ring, the kernel dispatches it to the named
CapObject. The registry is a client-side convenience for encoding the request correctly; the kernel validates the incoming message on dispatch and rejects malformed calls regardless of whether the caller used the registry. - A caller that uses the registry to build a call message is still subject to the normal ring dispatch path. The registry cannot bypass or relax kernel validation.
- Fail-safe. If the registry is not granted, dynamic clients fall back to compile-time bindings or refuse to operate. The registry enhances ergonomics; it is not on the critical authority path.
Data Source: Build-Time, Not Live-Object Reflection
The registry does not introspect a running object. It serves the static schema
baked from the CodeGeneratorRequest at build time. This has two consequences:
- No live-object coupling. The registry knows nothing about which capabilities are currently allocated, which processes hold them, or what runtime state a live capability has. It knows only what the schema says all instances of a given interface can do.
- Blob freshness is tied to the build. Like the System Manual blob,
buildInfo @4carries the commit and build timestamp so a caller can tell which schema version is loaded. A running instance with a stale blob reflects that build’s schema, not any live update.
Where It Lives
SchemaRegistry is a userspace service backed by a boot-packaged schema blob,
consistent with the broader capOS policy of putting metadata and policy
enforcement in userspace while the kernel handles dispatch and isolation. The
implementation mirrors the System Manual service:
- At
maketime, a host tool reads the compiler’sCodeGeneratorRequestoutput and produces a compact, read-only binary blob. - The blob is packaged in the boot image alongside the manifest, delivered like
BootPackageCapentries. - A userspace
schema-registryservice reads the blob from theBootPackage, implements theSchemaRegistryinterface, and is granted to processes that need it via the manifest cap grants or theAuthorityBrokerbundle.
The shared blob between Manual and SchemaRegistry is a build artifact.
Whether they share a single service binary or run as two services consuming the
same blob is an implementation decision; the capability interface is the
boundary.
Primary Use Cases
Shell call @cap.method(args) dispatch
The shell receives a human-typed method invocation. To dispatch it:
- Inspect the live capability to get its interface id (the
interface_idsurface is present today viacapos-lib/src/cap_table.rs). - Call
SchemaRegistry.resolveMethod(interfaceId, methodName)to get the ordinal, parameter type id, and result type id. - Use
lookupType(paramTypeId)to get the parameter struct schema and validate or interactively prompt the user for each field. - Encode the capnp message with the resolved ordinal and parameter encoding.
- Submit the call via the ring. On completion, use
lookupType(resultTypeId)to decode the result message for display.
This eliminates the requirement for the shell to carry compile-time knowledge of every capability interface ordinal and struct layout.
Dynamic / Late-Bound Clients
A process or agent that receives a capability without compile-time bindings
can call listMethods(interfaceId) to enumerate what the interface supports,
then use resolveMethod for each call it intends to make. This enables
generic capability explorers, cross-version bridges, and agent-driven
automation that adapts to the interface rather than hardcoding ordinals.
Cross-Language and Network Tooling
A host-side tool connecting to a running capOS instance via the remote-session
gateway fetches the schema blob via lookupType / listMethods calls relayed
through the remote session, and uses the result to encode and decode capnp
messages in any language that has a capnp parser. This decouples the tooling
from language-specific generated bindings.
Schema-Driven Test Harnesses
A test harness can use the registry to enumerate all methods on an interface and generate exerciser calls with synthetic arguments, validating that the live capability handles all known methods without panicking – a form of schema-conformance fuzzing driven by the registry itself.
Sequencing and Prerequisites
Two prerequisites are shared with the System Manual Phase 3:
- Doc-comment authoring in
schema/capos.capnp. The schema currently carries minimal doc comments. ThedocCommentfields in the registry’s schema nodes will be empty until this authoring work lands. The registry is still useful without doc comments – method names, ordinals, and struct layouts are fully present – but the schema-as-documentation story depends on this work. - Doc-comment preservation in the generated-bindings pipeline. The
tools/capnp-buildscript currently strips doc comment text from the emitted Rust bindings. The registry’s blob builder must read the rawCodeGeneratorRequestbefore that stripping occurs, so this prerequisite is about pipeline ordering, not a new tool.
The registry interface and blob format can be designed and the boot-packaging
infrastructure written before those prerequisites land; the docComment fields
start empty and are populated once the prerequisite lands.
Relationship to Existing Proposals
- System Manual (System Manual Capability):
the human-readable twin. Both share the
CodeGeneratorRequestblob source; neither is a prerequisite of the other. They can be built in either order or together. - SystemInfo proposal (System Info Capability):
SystemInfo provides scalar system facts;
SchemaRegistryprovides interface metadata. No overlap. - Interactive command surfaces (Interactive Command Surfaces):
a future typed
CommandSessionmay use the registry to validate command arguments before dispatch. - Remote-session UI (Remote Session CapSet Clients): host-side tooling that relays capability calls through the remote session is a primary consumer of the registry’s cross-language tooling use case.
Open Questions
- Blob sharing or dual instantiation? The System Manual and Schema Registry share a blob source. Whether they are implemented as one service that exposes two capability interfaces or two separate services that each read the blob at startup is an implementation choice. Two interfaces, one service is the likely outcome; this should be decided when the first implementation starts.
- Schema node format evolution. As
schema/capos.capnpevolves, the blob format must evolve with it. Whether the blob is a verbatimCodeGeneratorRequestwire encoding, a normalized subset, or a purpose-built indexed structure is a build-tool design question. - Search index. The
searchmethod needs a keyword index built into the blob atmaketime rather than a linear scan. The index strategy (inverted index over name tokens and doc comment words) should be decided when the blob builder is implemented.
Design Grounding
- Cap’n Proto reflection model and
CodeGeneratorRequestwire format:capnpcrate documentation and the capnp language reference. - Interface id and
interface_id()surface:capos-lib/src/cap_table.rs. - Boot-packaged blob delivery pattern:
kernel/src/cap/boot_package.rs. - Shared substrate with the System Manual: System Manual Capability, particularly the “schemaReflection source” section.
- Authority model grounding:
docs/capability-model.mdand Design Principle 4 inCLAUDE.md.
Proposal: Session Archive and Gantt Effort Pipeline
Development tasks in capOS each carry a real start and finish time. The autonomous development loop records these directly for tasks it executes; for earlier work the timing is recoverable from agent session transcripts. Collected together and attributed to branches and tasks, that timing data enables two things: a whole-history development Gantt and a dataset for predicting how long a future task will take.
Status: Proposal. The foundation is partially landed. A per-day task ledger
exists in docs/tasks/done/, where each done entry carries the real branch
commit SHAs and, for tasks executed by the autonomous development loop, real
started and completed timestamps sourced from the run-telemetry log. A
prepare-commit-msg hook stamps Plan-Item, Run-Id, and Agent-Kind
trailers on commits so the commit-to-task-to-run mapping is native to git
history. The session-transcript ETL, the derived dataset builder, and the
duration-prediction model are future work this proposal scopes.
Goals
- Predict how long a future task will take from historical effort patterns, using features derivable from the task’s commits and metadata.
- Render a whole-history development Gantt over the landed branch and task ledger, attributing each interval to the task that produced it.
- Feed that data back into planning: size estimates, milestone forecasting, and identification of subsystems or slice classes that consistently take longer than anticipated.
Timing Sources
Two sources provide per-task effort data, at different points in the project timeline:
Run-telemetry log (loop-era tasks). The autonomous development loop writes
a record per task run to a local telemetry log. Each record carries: a run id,
the task id, the agent kind, a session id, a started timestamp (when the
agent began), and a completed timestamp (when the agent finished and the
branch was merged or abandoned). These timestamps are exact wall-clock values,
not estimates. They are written to the local run-telemetry log (ephemeral, not
committed) and promoted to the task’s done/ file as started: and
completed: front-matter fields when the task closes. That promotion is the
boundary between local operational state and the durable public record.
Agent session transcripts (pre-loop history). For tasks worked before the autonomous development loop existed, timing must be reconstructed from agent session transcripts. Two transcript formats exist in the project history:
- A Claude session JSONL format: one JSON object per turn, with a UTC timestamp,
a role (
userorassistant), message content, and tool-call records. - A Codex session-rollout format: a structured log of model turns with file edits, shell commands, and timestamps.
Both formats carry enough information to recover: when a session started, which files were touched, which repository and branch were active, and approximately when the session ended (last turn timestamp). Cross-tool interval merging (a task worked in two different tools during the same calendar day) is a rare edge case; in practice each task belongs primarily to one tool and one continuous session window.
Pipeline
The pipeline has four stages:
1. Collect
Gather transcript files from wherever they reside. The Claude JSONL transcripts are stored under a well-known local path per session. The Codex rollouts are scattered across machines and backup directories and must be enumerated by a manifest or directory scan. Neither format is committed to the repository; they are local/backup artifacts. The collect stage produces a manifest of transcript files keyed by session id and format type.
2. Normalize
Parse each transcript into a common event schema:
{
"session_id": "...",
"format": "claude-jsonl" | "codex-rollout",
"started_at": "<UTC ISO timestamp>",
"ended_at": "<UTC ISO timestamp>",
"repo": "<repo name>",
"branch": "<branch name or null>",
"files_touched": ["<relative path>", ...],
"tool_calls": <count>,
"role_turns": <count>
}
The started_at and ended_at values are the first and last turn timestamps
in the session. For the duration estimate, idle time between turns (long pauses
between user and assistant turns, or overnight gaps within a session file) is
clipped: only contiguous active intervals – where consecutive turn timestamps
are within a configurable idle threshold – count toward the active duration.
The result is an idle-clipped active duration attributed to the session.
Per-task effort is the sum of idle-clipped active durations across all sessions
whose branch matches the task’s task branch. For tasks with a single session
this is trivial; for tasks where a session covered multiple branches, the
attribution is prorated by file overlap or left to manual annotation.
3. Recap and Index
After normalization, a recap step produces a per-task effort index: task id,
branch, real started/completed timestamps (from the run-telemetry promotions for
loop-era tasks, from the session-normalized estimate for pre-loop tasks), idle-
clipped active duration, agent kind, and the commit SHAs that belong to the
task. This index is written to a structured file (JSON Lines, one record per
task) under target/ during the build and is the input to the dataset builder
and the Gantt renderer. It is a derived artifact; the sources of truth are git
history, the docs/tasks/done/ ledger, and the transcript files.
4. Store in Object Storage
The normalized transcript archive and the per-task effort index are stored in object storage (GCS or S3) under a versioned prefix. This serves two purposes: it makes the archive portable across machines, and it provides a stable input for the prediction dataset builder that does not depend on the local transcript directory layout. The object storage upload is a manual or CI-triggered step, not part of every build.
Commit Provenance
The prepare-commit-msg hook (landed at tools/githooks/prepare-commit-msg)
stamps three trailers on every commit:
Plan-Item: <task-id>– the task this commit belongs to.Run-Id: <run-id>– the run-telemetry log entry for this work session.Agent-Kind: <kind>– which implementation agent produced the commit.
These trailers make the commit-to-task and commit-to-run mappings native to git
history and queryable by git log --grep. A Gantt renderer can walk git log
and group commits by Plan-Item, attributing intervals to tasks without any
external database. The run-telemetry log fills in wall-clock start/end; git
provides the commit sequence and churn metrics.
Prediction Dataset
The prediction dataset is a derived artifact built by a script from git history and the per-task effort index. It is not stored in task front matter; the task front matter carries only the real timestamps and commit SHAs, not derived features.
Features (X): per-task git-derived metrics over the task’s commits: list:
- Commit count.
- Churn: insertions + deletions.
- Files changed (total and unique).
- Subsystems touched: a subsystem label per changed file, derived from the
directory prefix (e.g.
kernel/,capos-lib/,docs/,schema/). - Categorical fields: milestone/track, slice class (
behavior,read-side-proof,harness-hardening,docs-status), hazard families checked.
Label (y): real effort in minutes – the idle-clipped active duration from the session archive, or the run-telemetry-derived interval for loop-era tasks.
Granularity: one record per branch merge (feature/work-unit granularity). This matches the size at which future tasks are dispatched and avoids the noise of per-commit or per-day fragments. Tasks that span multiple branches (a prerequisite branch plus a follow-up) are modeled as separate records linked by a dependency field; the prediction target is per-branch, and milestone forecasting aggregates across the dependency graph.
Model: a regression over the feature set above, using a simple baseline (linear regression or gradient-boosted trees) before investing in anything more complex. The first useful output is a p50/p90 interval per slice class and subsystem combination, not a precise point estimate.
Gantt Rendering
The Gantt is rendered from the per-task effort index: each task becomes a bar
spanning its started_at to completed_at (or started_at plus active
duration for pre-loop tasks where only the duration is reliable). Tasks are
grouped by milestone and slice class, and bars are colored by subsystem. The
output is a static SVG or a simple HTML/SVG file – not an interactive
dashboard. The rendering script reads the per-task effort index from target/
and writes target/gantt.svg or target/gantt.html. It is not part of the
default build.
Sequencing and Prerequisites
The following are already landed:
docs/tasks/done/ledger with realstarted:andcompleted:fields for loop-era tasks.prepare-commit-msghook stampingPlan-Item,Run-Id, andAgent-Kindtrailers.- Run-telemetry log entries for loop-era tasks (local, ephemeral).
The following are future work:
- Transcript collector and normalizer. Write parsers for the Claude JSONL and Codex rollout formats, the idle-clipping logic, and the per-task effort index builder. This is a standalone Python or Rust host tool; no kernel changes.
- Backfill pass. Run the normalizer over the existing transcript archive
to populate pre-loop effort estimates for tasks in
docs/tasks/done/. Where transcripts are unavailable, leave the duration field asnullwith asource: unavailableannotation; do not invent estimates. - Object storage upload. Configure the archive upload to GCS or S3 and set up the versioned prefix scheme.
- Dataset builder. Write the script that joins the per-task effort index with git metrics to produce the prediction dataset.
- Baseline model. Train and evaluate the baseline duration-prediction model
on the dataset. Publish the p50/p90 per-slice-class table as a static
docs/page once it has enough data to be meaningful. - Gantt renderer. Write the script and add a
make gantttarget.
Steps 1-2 can proceed independently of 3-6 and are the highest-value items: the backfill populates the effort ground truth that all downstream uses depend on.
Authority and Privacy
- The transcript archive and the run-telemetry log are not committed to the repository. They are local/private artifacts.
- The per-task effort index written to
target/is a derived artifact and is gitignored; it may contain session ids and durations but no message content. - The
docs/tasks/done/entries carry onlystarted:,completed:, andcommits:fields sourced from the telemetry and git; they do not carry message content, file system paths, or host-identifying information. - The prediction dataset contains only git-derived metrics and duration labels; no transcript content.
Relationship to Existing Proposals
- Task State and Agent Telemetry (task-state-and-agent-telemetry-proposal.md): the task file schema and run-telemetry structure that this proposal reads from. The two proposals are complementary: that proposal defines the task lifecycle and local operational state; this proposal defines what to do with the timing data once it exists.
- agentic development experiment (capOS Agentic Development Experiment): the autonomous development loop whose run-telemetry log is the primary timing source for loop-era tasks.
Open Questions
- Idle threshold. What inter-turn gap counts as idle and is excluded from active duration? A 30-minute threshold is a reasonable starting point; the right value depends on the observed gap distribution in the transcript archive.
- Multi-branch tasks. Some tasks span a prerequisite branch plus a follow-up
fix branch. The current model treats each merge as a separate record; a cleaner
approach may be a
parent-task:field in the task front matter so the effort can be rolled up. - Backfill completeness. Transcript files from early project history may be incomplete or unavailable. The normalizer must handle missing sessions gracefully; the dataset must mark incomplete records rather than imputing durations.
- Model selection. Whether a simple linear baseline is sufficient or whether a richer model (gradient-boosted trees, conformal prediction intervals) is warranted depends on the dataset size and variance. Defer this decision until the backfill pass is complete and the distribution is known.
Design Grounding
- Task ledger schema and run-telemetry promotion:
docs/tasks/README.mdand task-state-and-agent-telemetry-proposal.md. - Commit-provenance trailers:
tools/githooks/prepare-commit-msg. - Slice-class vocabulary and hazard families:
CLAUDE.md(Autonomous Slice Hygiene section) andREVIEW.md.
Proposal: Session-Bound Invocation Context
Current design authority now lives in Session Context, with endpoint transport details in IPC and Endpoints. This proposal is retained as the archival decision record for why capOS replaced caller-selected endpoint identity and the service-object migration with session-bound invocation context.
Replace caller-selected endpoint identity and the Service Object Identity Migration with a simpler invariant: every process runs in exactly one live session context. The kernel attaches that context to invocations and enforces privacy/transfer invariants, but does not reveal subject details to endpoint servers unless the call explicitly requests disclosure and policy allows the requested fields through a broker/service disclosure scope.
Capabilities decide what a process may call. The calling process’s session context says who invokes, subject to privacy rules. Services receive only the minimum routing/privacy metadata required by the invoked capability; request fields remain ordinary data and must not select authority or caller identity.
Problem
The prior service-object direction fixed a real bug: clients must not be able to choose a service-visible numeric badge during spawn or IPC delegation. The design then added service-minted object capabilities and a subject/proof open protocol so services could bind identity without trusting request payloads.
That is too much machinery for the intended capOS process model. Normal workload processes should not be bags of unrelated user sessions. They should have one immutable session context, assigned at spawn, and all invocations from that process should be attributable to that context. Delegated-subject on-behalf-of behavior is a separate design and is intentionally out of this first implementation path.
The target should therefore remove the caller-selected badge without replacing
it with a second service-object identity system. For a service such as chat,
holding ChatRoot already means the process may attempt to join chat under its
own session. More granular authority can come from narrower capabilities
granted by AuthorityBroker, not from client-selected receiver selectors or
local proof tokens on every open call.
Decision
capOS adopts these invariants:
- Each process has exactly one immutable
SessionContext. - The session context is assigned at spawn and shared by all threads in that process.
- System services run under explicit service/system sessions.
- Network gateways create or select a session for each admitted connection and spawn per-session workers or shells; they do not run multiple user sessions as ambient subject context inside one ordinary workload process.
- Endpoint CALL delivery includes a privacy-preserving caller-session reference and optional freshness result, not full subject metadata by default.
- A held capability is the authority to invoke service root methods such as
ChatRoot.join; the caller session supplies the invocation subject context. Services learn principal, profile, or display metadata only through explicit disclosure. - Request fields such as
user,role,participant,principal, orsessionare data. Services may validate them against the caller session, but they do not identify the caller or authorize by themselves. - Subject disclosure is opt-in and policy-bounded. A call must explicitly ask for disclosure, and the requested fields must be allowed by a service-specific disclosure capability/scope. Without both signals, the server gets only an opaque session-local handle suitable for same-session state and audit correlation within that service.
- Cross-session capability transfer is supported when the transferred cap’s transfer scope permits it. The transferred cap carries invoke authority; the receiver’s session remains the invocation subject. Session-local caps require an explicit broker or service regrant operation.
The existing synthetic service-object routing proof remains useful as evidence that request bytes cannot spoof endpoint receiver metadata, but the service object identity model is no longer the active design direction.
Normative Invariants
- Every normal workload process has exactly one immutable
SessionContext. SessionContextis installed only by trusted spawn, session-manager, or broker paths; request payloads, shell strings, manifest data, endpoint receiver metadata, and copiedUserSessioncaps cannot mutate or replace it.- Capability possession remains the authority to invoke an interface. A live session without the target capability cannot call the target service.
- A normal endpoint call from a dead, revoked, or stale workload session fails closed, except for explicitly designated recovery, logout, or renewal caps.
- Session liveness is a revocable lease state, not only a timestamp embedded in immutable process metadata. A session may be live, logged out, revoked, expired, or recovery-only.
- Renewal must not relabel an existing process to a different session subject and must not blindly revive all previously issued grants. Renewal either extends the existing session liveness record under policy or returns fresh broker grants with distinguishable grant/session epochs.
- Endpoint default delivery never includes global principal, profile, account, role, tenant, external-claim, auth-factor, display-name, or source-network fields.
- Subject-detail disclosure requires both an explicit method/call disclosure request and a matching service-scoped disclosure scope.
- Disclosure is field-granular and service-scoped; an opaque session reference from one service is non-portable and non-authority-bearing in another.
- Cross-session raw cap transfer is rejected unless the cap’s transfer scope permits it.
- After an allowed cross-session transfer, the receiver process session is the invocation context; raw transfer never implies act-on-behalf-of source session semantics.
service_regrant_onlycaps cannot cross sessions through raw copy, move, IPC, or spawn grants. A service or broker regrant path must mint the target session authority explicitly.- Legacy receiver metadata remains internal transport state. It must not be user-facing syntax, manifest policy, subject disclosure, or service identity.
Authority And Context
Capability possession answers one question:
May this process invoke this capability/interface at all?
It does not answer:
Which live session is this invocation attributable to?
Is that session still fresh?
Which resource/profile bucket should pay for server-side state?
What subject facts may this service learn?
May this capability be transferred into another session?
Those are invocation-context and disclosure questions. The split is deliberate.
ChatRoot can mean “the holder may ask chat to join”; it does not by itself
tell chat whether the call is from an operator, a guest, an anonymous Telnet
session, or an expired session, nor whether chat may see a global principal id.
A service decision has three layers:
capability authority
+ invocation subject context
+ service-local policy/state
Only the first layer is authority to invoke. The session layer supplies information about who invokes, freshness, resource/accounting labels, and what may be disclosed to the service. Service-local policy may accept or reject the operation based on that information, but the session context is not a second capability.
Examples:
ChatRootmeans the holder may ask chat to join, subject to chat policy and whatever session facts the call explicitly requests and broker/service policy makes available to chat.ChatModeratormeans the holder may call moderator methods, again under the caller’s live session.TerminalSessionmeans the holder may read/write that terminal endpoint, but audit and policy still see the process session.
Session-bound invocation context exists so services can make those second-order decisions without trusting payload fields and without forcing the kernel to reveal private subject metadata to every endpoint server. The kernel can say “this call came from a live session and here is an opaque service-scoped reference”; the service or broker can decide whether that is enough, whether a guest-specific facet is required, or whether the user must explicitly disclose bounded subject facts.
The kernel enforces capability possession, process session assignment, and disclosure invariants. It may report freshness/liveness as invocation context. Session expiry should bound behavior through capability lifecycle, broker refusal, or service policy, not by treating the session context itself as a second authority. The kernel still does not interpret chat rooms, handles, moderator state, adventure players, account roles, OIDC claims, or tenant groups.
Privacy And Disclosure
Session-bound invocation context must not become ambient subject leakage. A service should not receive global principal identifiers, account names, display names, profile names, external issuer keys, group claims, auth factors, source network, or tenant metadata merely because a process called an endpoint.
The default endpoint metadata is privacy-preserving:
caller_session_ref = opaque, service-scoped, non-portable reference
session_live = true/false or epoch/freshness result
That is enough for a service to keep per-session state, reject stale sessions, and correlate its own audit events without learning a broader identity.
Current proof implementation:
scoped_ref: low 64-bit ABI field of the opaque reference.scoped_ref_hi: high 64-bit ABI field of the opaque reference.epoch:u64.derivation: HMAC-SHA256 with an entropy-backed boot key, a non-reused endpoint service-scope id, and the kernel session id.
The ABI layout is preserved, but the old unkeyed low-half value is not. Both
scoped_ref and scoped_ref_hi are halves of the keyed opaque reference.
epoch is a separate domain-separated keyed value so service-local
freshness/audit correlation rotates with the same boot key and endpoint scope
without being folded into the opaque reference itself.
Current caller_session_ref derivation rules:
width:
128 bits minimum for the opaque reference, separate from freshness epoch.
derivation:
keyed opaque value over boot secret, service scope, and kernel session id.
scope:
a non-reused endpoint service-scope id plus the boot-scoped key. Endpoint
object replacement or boot-key replacement intentionally rotates the
reference. Stable service-audit identity across upgrades remains future work.
reuse:
logout/login or session recreation gets a new kernel session id and therefore
a new service-scoped reference.
stale epoch:
stale references may remain recognizable to the same service for bounded
audit/denial correlation, but they must not become live again after expiry.
service move/upgrade:
endpoint replacement currently breaks correlation. Retaining correlation
across service replacement requires a future stable service-audit scope.
privacy:
global principal, account, profile, display name, auth source, and tenant
metadata are not derivable from the opaque reference without broker/audit
disclosure authority.
Richer disclosure requires both an explicit act and an allowed policy scope:
- the client calls a method whose contract requests disclosure, such as
ChatRoot.join(discloseProfile = true, handle = "alice"), or transfers aSessionDisclosurecapability as part of that call; AuthorityBrokeror service policy grants a root/facet with a matching disclosure scope, such as “chat may see display name and profile class”;- an administrator-configured system service may expose methods whose contract explicitly requests audit disclosure, but those methods still need bounded service policy for the fields they receive.
Disclosure should be minimized and service-scoped. A chat service may need a display name, guest/operator class, and per-service audit pseudonym. It does not need raw OIDC claims, credential identifiers, account-store records, or global principal ids unless a later policy explicitly grants that.
Session Context
A SessionContext is kernel-carried metadata minted through trusted session
creation paths and installed by ProcessSpawner:
SessionContext {
session_id,
principal_id,
principal_kind,
auth_strength,
policy_profile_id,
resource_profile_id,
created_at_ms,
expires_at_ms,
epoch,
}
The exact ABI can be smaller in the first implementation. The required properties are immutability for the process lifetime, a stable kernel-visible session id for enforcement, a service-scoped opaque reference for default endpoint delivery, and enough freshness metadata for brokers/services to fail closed or revoke/withhold capabilities when a session expires or is revoked. These conceptual fields may exist in trusted session storage. They are not endpoint-delivered default metadata. Endpoint delivery gets only a service-scoped opaque session reference and liveness/freshness result unless an explicit disclosure request and matching disclosure scope allow named fields.
The session context is not a replacement for capabilities. A process with a
valid operator session but no ChatRoot cannot join chat. A process with
ChatRoot but an expired session should lose or fail to refresh the
capability authority that was issued for that session.
Session Lifecycle, Logout, And Renewal
The completed milestone proves fail-closed stale-session behavior for current
shell and endpoint authority. Follow-up lifecycle slices now provide a
kernel-backed mutable liveness record for SessionManager-minted sessions,
remote gateway logout/close propagation, and endpoint RETURN cleanup for
already-admitted calls after caller logout/session death. Fixed wall-clock
expiry is still not a usable long-running interactive policy by itself:
production session lifecycle also needs revocation, renewal/recovery, live
proxy cleanup, audit reason separation, and a dedicated result-cap move-source
rollback proof. Clean local owner-shell exit now calls the held
UserSession.logout() before process exit; richer shell replacement and
renewal UX remains future work.
The intended liveness model is:
SessionContext {
session_id,
principal_id,
principal_kind,
auth_strength,
policy_profile_id,
resource_profile_id,
created_at_ms,
liveness_cell_id,
}
SessionLivenessCell {
session_id,
session_epoch,
state: live | logged_out | revoked | expired | recovery_only,
not_before_ms,
not_after_ms,
policy_epoch,
resource_profile_epoch,
audit_record_id,
}
SessionContext remains immutable for the process lifetime. The liveness cell
is trusted session-manager state that can be logged out today and later
revoked, expired, or renewed. This preserves the one-session-per-process
invariant while allowing usable session renewal and explicit logout. A process
cannot install a different session id into itself; if policy requires a new
subject, the broker launches a replacement process or shell with a new
SessionContext.
This splits lifetime checks into three composable layers:
session liveness:
Is this process's invocation subject still live?
grant lease:
Is this broker-issued bundle or individual grant still valid?
object/facet epoch:
Has the target live object/facet generation been revoked or replaced?
The kernel or trusted wrapper caps should check session liveness before normal endpoint enqueue, before local non-endpoint shell-bundle operations, and before installing fresh result caps into a caller. Broker-issued caps may additionally bind to grant leases. Service objects and endpoint-backed facets keep using object epochs or service-specific revocation for target invalidation. The current endpoint RETURN path rechecks caller liveness before copying result bytes, application-exception payloads, result-cap records, or returned caps into the caller; stale returns cancel the in-flight call and notify the caller with invoke-failed when a completion can be posted.
Renewal is a narrow recovery operation, not generic authority resurrection:
- pre-expiry renewal may extend the liveness cell when account state, policy epoch, resource profile, auth freshness, and maximum lifetime permit it;
- post-expiry calls are limited to explicit logout, renewal, recovery, and bounded self-diagnostic methods;
- renewal returns fresh grant leases or wrapper caps when existing grants need a new policy decision;
- old ordinary grants do not become fresh merely because the session renewed;
- explicit revocation beats renewal except for a separately named recovery policy;
- password-authenticated local shells should default to explicit logout, terminal/connection close, process-tree exit, or administrator revocation rather than an unavoidable short wall-clock TTL. Idle lock, step-up, or renewal prompts are policy options, not kernel authority rules.
Logout and clean owner-shell exit close the liveness cell for sessions owned by
that shell or gateway through UserSession.logout(). Closing the shell process
still releases local cap table edges through process-exit cleanup, but session
logout is the operation that makes the session no longer live for retained
session-bound grants, children, and future broker decisions.
Kernel Contract
The kernel should enforce generic mechanics only:
- A process has one session context pointer or compact session descriptor.
- Spawning a child requires selecting the child’s session context. The default is to inherit the parent’s session; creating a different session is broker or session-manager capability authority.
- Session expiry is represented as freshness metadata and capability lifecycle:
normal workload endpoint calls from dead, revoked, or stale sessions fail
closed except for explicit recovery, logout, or renewal caps. The current
implementation rejects stale normal endpoint invocations before transfer
preparation or enqueue, rejects fresh shell-bundle minting for stale sessions,
and expires retained broker-issued non-endpoint shell bundle caps at their
bound session boundary.
RestrictedLauncherrejects spawn/list calls after the session it was minted for expires, and broker-issuedSystemInforesults are session-bound wrappers. The current endpoint RETURN path also rejects already-admitted returns after caller logout/session death before installing result bytes, application-exception payloads, result-cap records, returned caps, or move-source commits into the stale caller. The session context itself is not the authority being invoked. Remaining lifecycle work should extend the mutable liveness cell from logout to administrator revocation, recovery-only state, and pre-expiry renewal without relabeling a running process. - Endpoint delivery includes privacy-preserving caller session metadata alongside the existing method, params, transfer descriptors, and result target. It must not include subject details unless the SQE/method contract explicitly requests them and a granted disclosure scope permits them. The current implementation uses a CALL SQE disclosure mask intersected with cap-held disclosure scope for field-granular delivery; unsupported fields are rejected or narrowed, and global principal ids and display names remain absent from default endpoint metadata.
- Capability transfer checks session scope. Same-session transfer preserves the held cap. Cross-session transfer is rejected unless the cap is explicitly cross-session-shareable or the transfer is the result of a broker/service delegation method.
- Legacy receiver metadata remains transport state only. It must not be exposed as user-facing identity syntax, manifest policy, service capability, or a workaround for subject disclosure.
The kernel should not validate external tokens, parse account stores, evaluate roles, or choose application objects.
Broker And Service Contract
AuthorityBroker and related session services decide which capabilities a
session receives:
SessionManager.login/guest/anonymous -> UserSession metadata/control cap
trusted broker/session-manager spawn path -> child SessionContext
AuthorityBroker.shellBundle(session) -> launcher fixed to that SessionContext,
ChatRoot, SystemInfo, ...
For basic local service access, no additional subject/proof token is required. The process session context supplies caller information and a default service-scoped session reference, and the held capability supplies access to the service. Human-readable or policy-rich subject details are separate disclosure, not automatic endpoint metadata.
UserSession remains useful as an informational/control capability and broker
input. It is not itself the ambient invocation subject, and copying it into a
process cannot install a second process session. A trusted broker or
session-manager path may use a verified UserSession to spawn a child with a
matching immutable SessionContext; ordinary cap transfer only transfers that
capability object.
External assertions still stop at the admission boundary. OIDC, passkey,
certificate, cloud workload, or SSH-authenticated claims are validated by
admission/session services, normalized into a local or pseudonymous session,
and then disappear from ordinary application calls. Chat should not parse OIDC
claims, and ChatRoot.join should not require a bearer proof object merely to
learn who the caller is.
Chat Flow
The target chat flow is:
login/setup/guest
-> UserSession metadata/control cap
trusted broker/session-manager spawn path
-> child process with SessionContext(operator or guest)
AuthorityBroker.shellBundle(session)
-> ChatRoot if the profile may use chat
spawn chat-client with inherited session and ChatRoot
chat-client:
ChatRoot.join(channel = "general", handle = "alice")
The kernel delivers the endpoint call with privacy-preserving caller session metadata:
target = ChatRoot
method = join
caller_session_ref = chat-scoped opaque session reference
session_live = true
payload = { channel = "general", handle = "alice" }
chat-service checks:
- the caller holds
ChatRoot; - the caller session is live;
- the requested channel and handle are syntactically valid request data.
Then it stores service-local state keyed by the caller session:
ParticipantRecord {
caller_session_ref,
service_assigned_member_label,
optional_disclosed_display_name,
joined_channels,
quota_bucket,
audit_context,
}
If chat needs to distinguish operator from guest, use explicit disclosure with
a matching disclosure scope. If chat only needs narrower behavior, the broker
may grant GuestChatRoot with behavior that encodes the policy without
revealing subject fields. The service should not receive the global principal
id by default.
Later calls can use the same root/facet capability:
Chat.send(channel = "general", text = "hi")
Chat.poll(max_events = 32)
Chat.who(channel = "general")
If the service permits multiple handles for one session, it may return a
server-issued participant_id as data. That id must be scoped to the caller
session and validated on every use:
Chat.send(participant_id = 7, channel = "general", text = "hi")
participant_id = 7 is not transferable authority. A different session cannot
use it unless chat or the broker performs an explicit share/delegation
operation.
Moderator behavior is a narrower capability, not a generic role bit in a payload:
AuthorityBroker.shellBundle(operator_session) -> ChatModerator
ChatModerator.kick(participant_id, channel)
The call still carries the operator session for audit and policy.
Transfer Rules
Same-session delegation is ordinary capability transfer:
operator shell -> child helper in the same session
transfers ChatRoot or ChatModerator
The child acts under the same session context, so no subject ambiguity exists.
Cross-session transfer is where the distinction matters most:
capability transfer carries authority to invoke;
the receiver process session supplies who invokes.
If session A transfers a cap to session B and the transfer is allowed, later calls are made by session B, not by session A. The service sees the transferred capability as the invoked authority and session B as the invocation subject context. It must not infer that session B is impersonating session A merely because the cap originally came from A.
This is acceptable for caps whose semantics are deliberately shareable, such as a read-only document, a public chat invite, or a scoped terminal endpoint intended for handoff. It is wrong for caps that encode session-local standing, such as “my chat participant”, “my account settings”, or “my active adventure player”, unless the service explicitly defines what sharing means.
Therefore caps need an explicit transfer scope:
same_session: may move/copy only to processes with the same session context;cross_session_shareable: may be transferred to another session and then invoked as the receiver’s session;service_regrant_only: cannot be raw-transferred across sessions; the holder must ask the service or broker to issue a new cap for the target session.
Session-local services that want to share state across sessions should use an explicit regrant/share path:
Chat.share(participant_id, target_session_or_invitation)
AuthorityBroker.delegate(source_session, target_session, requested_cap)
The service or broker records the policy decision and mints or grants the appropriate capability for the target session. Raw transfer of a session-scoped cap across sessions must fail closed unless the cap has an explicit cross-session-shareable scope.
This keeps privacy and accountability aligned. The transferred cap is not a portable identity token for the source session. If the receiver invokes it, the receiver’s session context is used for audit/disclosure by default. If the service needs to preserve source attribution, it should encode that as service-local state during an explicit share/regrant operation, not rely on the kernel to attach source-session subject data to future receiver calls.
The useful matrix is:
cap transfer only:
receiver gets authority to invoke;
receiver invokes as its own process session.
service regrant:
service or broker issues a new target-session capability;
future calls still invoke as the target process session.
What Happens To Service Object Routing
The synthetic service-object routing proof added in commit a4655f0 should not
drive the next design step. Its useful artifacts are narrower:
- delegated-client relabeling is contained;
- receiver-cookie spoofing through request bytes is tested;
- close/revoke/stale-cookie paths have coverage;
- internal receiver metadata can be generation-checked.
Those mechanics can remain as low-level transport tests. They are not the application authority model. The completed migration stopped before subject/proof root opening and shared-service conversion to service object capabilities.
Migration Plan
- Record this proposal as the selected Stage 6 direction and mark Service Object Identity Migration as superseded.
- Add the kernel/process invariant: every process has exactly one immutable session context, including explicit service/system sessions.
- Thread caller session metadata through endpoint CALL delivery.
- Define session freshness propagation and the cap lifecycle rule needed to close the open review finding: expired sessions must not continue to receive or refresh interactive capability authority.
- Define cap transfer scopes for
same_session,cross_session_shareable, andservice_regrant_only. - Replace chat’s legacy receiver-selected member identity with session-keyed
participant state and broker-granted
ChatRoot/ChatModeratorfacets. The first chat migration is implemented for ordinaryChatmembership: member records are keyed by the endpoint caller-session key, visible member labels are service-assigned, and join handles remain non-authority request data. - Apply the same pattern to adventure and terminal/stdio bridges. Aurelian ordinary player state is keyed by live endpoint caller-session metadata instead of receiver badges. Terminal output requires live caller-session dispatch, and shell-serviced stdio bridge waits bind to opaque live caller-session metadata while rejecting mismatched callers. Focused adventure NPC/chat authority is broker- or manifest-issued rather than caller-chosen.
- Retire user-facing badge/receiver selector syntax. Keep receiver metadata only as internal endpoint transport state or hostile-test fixture.
Non-Goals
- Reintroducing POSIX
uid/gidauthorization. - Allowing clients to choose identity through request bytes.
- Making external tokens ordinary application-service credentials.
- Delegated-subject or act-on-behalf-of semantics; those belong in a separate proposal and should not block this first implementation path.
- Preserving Service Object Identity Migration as the active design.
- Building network-transparent object references in this slice. Future remote-capability transport is grounded separately in Spritely, OCapN, and CapTP and must preserve this proposal’s local rule that sessions are broker/kernel-attached, not chosen by request bytes.
Open Questions
- Whether all caps are
same_sessionby default, or whether every cap entry should carry an explicitsame_session,service_regrant_only, orcross_session_shareablescope. - How much session metadata should be copied into endpoint delivery headers
versus looked up by
session_idin a kernel/session table. - Whether multi-connection gateways must always spawn per-session workers, or may multiplex unauthenticated transport while delegating all session-bearing work to child processes.
Proposal: Storage, Naming, and Persistence
What replaces the filesystem in a capability OS where Cap’n Proto is the universal wire format.
The Problem with Filesystems
In Unix, the filesystem is the universal namespace. Everything is a path:
/dev/sda, /etc/config, /proc/self/fd/3, /run/dbus/system_bus_socket.
Paths are ambient authority — any process can open /etc/passwd if the
permission bits allow. The filesystem conflates naming, access control,
persistence, and device abstraction into one mechanism.
capOS has capabilities instead of paths. Access control is structural (you can only use what you were granted), not advisory (permission bits checked at open time). This means:
- No global namespace needed — each process sees only its granted caps
- No path-based access control — the cap IS the access
- No distinction between “file”, “device”, “socket” — everything is a typed capability interface
A traditional VFS would reintroduce ambient authority through the back door. Instead, capOS needs a storage and naming model native to capabilities and Cap’n Proto.
Core Insight: Cap’n Proto Everywhere
Cap’n Proto is already used in capOS for:
- Interface definitions —
.capnpschemas define capability contracts - IPC messages — capability invocations are capnp messages
- Serialization — capnp wire format crosses process boundaries
If we extend this to storage, then:
- Stored objects are capnp messages
- Configuration is capnp structs
- Binary images are capnp-wrapped blobs
- The boot manifest is a capnp message describing the initial capability graph
No format conversion anywhere. The same tools (schema compiler, serializer, validator) work for IPC, storage, config, and network transfer.
Architecture
Three Layers
Target architecture after the manifest executor and process-spawner work:
Boot Image (read-only, baked into ISO)
│
│ capnp-encoded manifest + binaries
│
v
Kernel (creates initial caps from manifest)
│
│ grants caps to init
│
v
Init (builds live capability graph)
│
├──> Filesystem services (FAT, ext4 — wrap BlockDevice as Directory/File)
│
├──> Store service (capability-native content-addressed storage)
│ backed by: virtio-blk, RAM, or network
│
└──> All other services (receive Directory, Store, or Namespace caps)
Layer 1: Boot Image
The boot image (ISO/disk) contains a capnp-encoded system manifest loaded as a Limine module alongside the kernel. The manifest describes:
struct SystemManifest {
# Manifest schema version, validated before other fields
schemaVersion @0 :UInt32;
# Binaries available at boot, keyed by name
binaries @1 :List(NamedBlob);
# Init's config blob: first-process metadata plus service graph
initConfig @2 :CueValue;
# Kernel boot parameters
kernelParams @3 :SystemConfig;
}
struct NamedBlob {
name @0 :Text;
data @1 :Data;
}
struct CueValue {
union {
null @0 :Void;
boolean @1 :Bool;
intValue @2 :Int64;
uintValue @3 :UInt64;
text @4 :Text;
bytes @5 :Data;
list @6 :List(CueValue);
fields @7 :List(CueField);
}
}
struct CueField {
name @0 :Text;
value @1 :CueValue;
}
Capability source identity is already structured in the bootstrap manifest, so source selection does not depend on parsing authority strings:
{
name: "client"
expectedInterfaceId: 0xacf0c15a7b2e0041
source: service: {
service: "endpoint-server"
export: "client"
}
}
Kernel and service source objects inside initConfig select the authority to grant. The
expectedInterfaceId field carries the generated Cap’n Proto interface
TYPE_ID and only checks that the granted object speaks the expected schema.
It cannot replace source identity: many different objects may expose the same
interface while representing different authority.
The build system (Makefile) generates this manifest from a human-authored
description and packs it into the ISO as manifest.bin. Current code embeds
every SystemManifest.binaries entry into that manifest as NamedBlob data,
including the release-built init and smoke-demo ELFs. The kernel now boots only
initConfig.init; focused init-executor manifests expose the manifest to the
separate init binary as a read-only BootPackage capability, while default
shell-led manifests boot capos-shell directly without a BootPackage executor.
Remaining cleanup is to narrow the long-term boot package shape after the
single-init split.
Using a CueValue tree instead of AnyPointer keeps the manifest directly
decodable in no_std userspace without depending on Cap’n Proto reflection.
Transitional Schema Note
ServiceEntry, CapSource::Service, and ServiceEntry.exports are no longer
kernel schema fields. ProcessSpawner, copy/move cap transfer, focused
init-owned generic manifest execution, the default standalone-init service
graph, focused shell-led login smokes, and the 15.4 initConfig schema split
are implemented. The current boot manifest shape is:
struct SystemManifest {
# Manifest schema version, validated before other fields
schemaVersion @0 :UInt32;
# Binaries available at boot, keyed by name
binaries @1 :List(NamedBlob);
# Init's config blob (replaces the service graph)
initConfig @2 :CueValue;
# Kernel boot parameters (serial policy, shell MOTD, feature flags)
kernelParams @3 :SystemConfig;
}
ServiceEntry / CapRef disappeared from the schema and became plain CUE
fields inside initConfig.services. Init reads them at runtime and calls
ProcessSpawner directly. validate_manifest_graph,
validate_bootstrap_cap_sources, and the remaining transitional service-graph
schema are no longer kernel bootstrap checks. They remain in capos-config for
mkmanifest and the focused init executor while that executor still accepts the
transitional service graph. Kernel bootstrap already uses a first-service
cap-table builder rather than the old multi-service resolver. See
docs/proposals/service-architecture-proposal.md — “Legacy Manifest Fields
After Stage 6” for the deprecation plan.
During the current transition, initConfig.init is still per-manifest launch
metadata: it selects the single boot process binary and the kernel-sourced caps
for that process. initConfig.services, cross-service cap sources, exports,
and restart policy are init-owned configuration for focused executor manifests.
Focused harnesses that boot a demo as init keep using that first-process cap
bundle until those smokes are migrated behind a fixed generic init.
Layer 2: Kernel Bootstrap
Target design for the kernel’s boot role:
- Parse the system manifest (read-only capnp message from Limine module).
- Hash the embedded binaries for optional measured-boot attestation.
- Create kernel-provided capabilities:
Console,Timer,DeviceManager,ProcessSpawner,FrameAllocator,VirtualMemory(per-process), and a read-onlyBootPackagecap exposingSystemManifest.binariesandinitConfig. - Spawn init — exactly one userspace process — with that cap bundle.
Current boot has reached the single-init split and the initConfig schema
split. system.cue puts the standalone init binary in initConfig.init for
the default service-graph process; init reads BootPackage and starts the
shell, remote-session CapSet gateway, and resident services from
initConfig.services.
Focused shell-led manifests such as system-smoke.cue still put
capos-shell in initConfig.init for narrow login proofs. Focused
init-executor manifests such as system-spawn.cue also put the separate
init binary in initConfig.init; that binary reads BootPackage and spawns
the focused demo graph from initConfig.services through ProcessSpawner.
The unused kernel resolver has been retired. The remaining cleanup is replacing
per-manifest init bundles with a fixed generic-init bootstrap ABI.
Layer 3: Init and the Live Capability Graph
Target init reads initConfig from the BootPackage cap and executes it:
fn main(caps: CapSet) {
let spawner = caps.get::<ProcessSpawner>("spawner");
let boot = caps.get::<BootPackage>("boot");
let config = boot.init_config()?; // CueValue
// Walk service entries from the config and spawn in dependency order
for entry in config.field("services")?.iter()? {
let binary = boot.binary(entry.field("binary")?.as_str()?)?;
let granted = resolve_caps(entry.field("caps")?, &running_services, &caps);
let handle = spawner.spawn(binary, granted, entry.field("restart")?.into())?;
running_services.insert(entry.field("name")?.as_str()?.into(), handle);
}
supervisor_loop(&running_services);
}
In this target model, init is a generic manifest executor rather than a
hardcoded service graph. The system topology is defined in the boot
package’s initConfig, not in init’s source code. Changing what services
run means rebuilding the boot image with a different config blob, not
recompiling init. Manifest graph resolution stops being a kernel concern.
The current transition uses initConfig.services as the service graph; init
reads the BootPackage manifest, validates a metadata-only
ManifestBootstrapPlan, resolves kernel and service cap sources, records
exported caps, spawns children in manifest order, and waits for their
ProcessHandles.
Two Storage Models
capOS supports two complementary storage models, both exposed as typed capabilities:
Filesystem Capabilities (Directory, File)
For accessing traditional block-based filesystems (FAT, ext4, ISO9660) and
for POSIX compatibility. A filesystem service wraps a BlockDevice and
exports Directory/File capabilities.
BlockDevice (raw sectors)
│
└──> Filesystem service (FAT, ext4, ...)
│
├──> Directory caps (namespace over files)
└──> File caps (read/write byte streams)
This model maps naturally to USB flash drives, NVMe partitions, and
network-mounted filesystems. The open() and sub() operations return new
capabilities via IPC cap transfer (see “IPC and Capability Transfer” below).
Capability-Native Store (Store, Namespace)
For capOS-native data: configuration, service state, content-addressed object
storage. A store service wraps a BlockDevice and exports Store/Namespace
capabilities.
BlockDevice (raw sectors)
│
└──> Store service
│
├──> Store cap (content-addressed put/get/list inventory)
└──> Namespace caps (mutable name→hash mappings)
Content-addressing provides automatic deduplication, verifiable integrity,
and immutable references. Store.list returns the live inventory of content
hashes in that Store, so holders that need crash/reboot recovery can rediscover
stored content without a separate mutable root pointer. Namespaces add mutable
bindings on top when callers need stable names rather than inventory scans.
Bridging the Two Models
The models are composable. An adapter service can bridge between them:
- FsStore adapter: exposes a Directory tree as a content-addressed Store (hash each file’s contents, directory listings become capnp-encoded objects)
- StoreFS adapter: exposes Store/Namespace as a Directory tree (each name maps to a File whose contents are the stored object)
- Import/export: a utility service reads files from a Directory and stores them in a Store, or materializes Store objects as files in a Directory
In both cases the adapter is a userspace service holding caps to both subsystems. No kernel mechanism needed — just capability composition.
File I/O Interfaces
Directory, File, Store, and Namespace caps may be scoped to a user session, guest profile, anonymous request, or service identity, but the cap remains the authority. POSIX ownership metadata is compatibility data inside these services, not a system-wide authorization channel. See User Identity and Policy.
BlockDevice
Raw sector access, served by device drivers (virtio-blk, NVMe, USB mass
storage). The driver receives hardware capabilities (MMIO, IRQ,
FrameAllocator for DMA) and exports a BlockDevice cap.
interface BlockDevice {
readBlocks @0 (startLba :UInt64, count :UInt32) -> (data :Data);
writeBlocks @1 (startLba :UInt64, count :UInt32, data :Data) -> ();
info @2 () -> (blockSize :UInt32, blockCount :UInt64, readOnly :Bool);
flush @3 () -> ();
}
For bulk transfers, readBlocks/writeBlocks accept a SharedBuffer
capability instead of inline Data (see “Shared Memory for Bulk Data”
below). The inline-Data variants work for metadata reads and small
operations; the SharedBuffer variants avoid copies for large I/O.
File
Byte-stream access to a single file. Served by filesystem services. Created
dynamically when a client calls Directory.open() — the filesystem service
creates a File CapObject for the opened file and transfers it to the
caller via IPC cap transfer.
interface File {
read @0 (offset :UInt64, length :UInt32) -> (data :Data);
write @1 (offset :UInt64, data :Data) -> (written :UInt32);
stat @2 () -> (size :UInt64, created :UInt64, modified :UInt64);
truncate @3 (length :UInt64) -> ();
sync @4 () -> ();
close @5 () -> ();
}
close releases the server-side state for this file (open cluster chain
cache, dirty buffers). The kernel-side CapTable entry is removed by the system
transport via CAP_OP_RELEASE when the local holder releases it; capos-rt
owned handles queue local releases on final drop and expose explicit release
flushing for ordinary userspace. CapabilityManager is
management-only (list(), later grant()); it does not expose a drop()
method because ordinary handle lifetime belongs to the transport, not to an
application call on the same table that dispatches it.
Attenuation: a read-only File wraps the original and rejects write,
truncate, sync calls. An append-only File rejects write at offsets
other than the current size.
Directory
Namespace over files on a filesystem. Served by filesystem services.
open() and sub() return new capabilities via IPC cap transfer.
interface Directory {
open @0 (name :Text, flags :UInt32) -> (file :File);
list @1 () -> (entries :List(DirEntry));
mkdir @2 (name :Text) -> (dir :Directory);
remove @3 (name :Text) -> ();
sub @4 (name :Text) -> (dir :Directory);
create @5 (name :Text) -> ();
rename @6 (from :Text, to :Text) -> ();
}
struct DirEntry {
name @0 :Text;
size @1 :UInt64;
isDir @2 :Bool;
}
sub() returns a Directory scoped to a subdirectory — the analog of chroot.
The caller cannot traverse upward or see the parent directory. open() with
create flags creates a new file if it doesn’t exist.
The flags field in open() is a bitmask: CREATE = 1, TRUNCATE = 2,
APPEND = 4. No READ/WRITE flags — those are determined by the
Directory cap’s attenuation (a read-only Directory returns read-only Files).
Writable Directory Mutations and the Single-Writer Policy
create @5 makes a new empty file and rename @6 renames an entry within the
same parent. Both have additive ordinals so the read-only Directory
implementations stay wire-compatible — they simply reject the mutating methods
(mkdir/remove/sub/create/rename) fail-closed, the way a read-only
File rejects write. Unlike open with CREATE, create fails closed if the
name already exists; rename fails closed if the source is absent or the
destination already exists, and does not support cross-directory moves.
The first writable filesystem service adopts a fail-closed single-writer
policy: a writable filesystem tree admits one writer at a time. The first
granted cap to perform a mutation claims the writer slot; a mutation through any
other concurrently granted cap fails closed with a typed Failed exception
("writable filesystem rejects a second concurrent writer (single-writer policy)") rather than racing. There is no lease/release lifecycle — the first
writer keeps the slot — and list/sub reads are allowed for any holder. This
deliberately closes the milestone’s concurrent-writer-policy decision without
expanding scope to advisory locks, lock leases, or multi-writer coordination
(see Open Question 6). The implementation (kernel/src/cap/writable_fs.rs, proof
make run-storage-writable) is now disk-backed: it mounts a CAPOSWF1
sub-volume (a flat node-record array with parent pointers plus a bump-allocated
data region) over the kernel-owned virtio-blk driver, keeps the RAM tree as the
working copy, and write-through-commits every directory/file mutation in the
order data sector → node-record sector → superblock (the ordering commit point),
mirroring the disk-backed Store. The persistent Store CAPOSST1 sub-volume
co-locates on the same disk image (at LBA 0; the filesystem superblock sits at a
fixed higher LBA), so filesystem mutations and store object writes/deletes
survive a reboot together — make run-storage-writable boots QEMU twice against
one combined image and phase 2 verifies every surviving name, size, content,
directory entry, and store object plus the deleted object’s absence.
Unclean-shutdown recovery is proven by make run-storage-writable-recovery. A
slot becomes live on the next mount only once the superblock’s bumped
node_count is observed, so a forced poweroff in the window between a node
record’s durable write and that commit leaves an orphan slot the next mount
ignores: the interrupted allocation is atomically absent, never a torn or
half-live entry. The proof builds the kernel with the proof-only
storage_writable_recovery feature, which arms an induced forced poweroff in
exactly that window (recovery_crash_after_record); pass 1 commits durable
mutations and a Store survivor and then triggers the window (the harness
kill -9s QEMU after the kernel marker), and pass 2 re-mounts and verifies
recovery to a consistent tree with the committed state intact, the interrupted
allocation absent, no torn record, and a usable post-recovery write. The proof
is bounded to that single record-vs-commit window under host-page-cache
durability (the virtio driver negotiates no VIRTIO_BLK_F_FLUSH, and a
kill -9 preserves the host page cache); it proves the superblock-commit
ordering invariant, not a general media crash-consistency guarantee against
host power loss or a lost write-back cache. The co-located CAPOSST1 Store
now has bounded tombstone reclamation through make run-storage-persist; this
does not add a new media power-loss guarantee or reclaim writable-file extents.
Writable File content paths layer onto the same tree. open with the
CREATE/TRUNCATE/APPEND flags (or a write through the returned File)
claims the same filesystem-wide writer slot, so file writes obey the single
writer policy alongside directory mutations; a plain (flags == 0) open and the
read/stat methods are reads allowed for any holder. write @1 overwrites or
extends at the supplied offset, zero-filling any gap; a handle opened APPEND
lands every write at end-of-file regardless of the offset argument. truncate @3
shrinks (discards the tail) or extends (zero-fills) the file, and close @5
releases only that handle — the file survives in the directory until
Directory.remove, which marks the file node so any outstanding File cap fails
closed. File content is bounded by MAX_FILE_BYTES (64 KiB) and persists to a
bump-allocated disk extent on each mutation; a rewrite that outgrows the current
extent allocates a fresh one and leaks the old (file-extent compaction deferred).
Because
each write/truncate already wrote through the block device (the virtio
driver negotiates no VIRTIO_BLK_F_FLUSH, so there is no separate media barrier
to issue), sync @4 succeeds as an honest write-side no-op (a read-only File
still rejects it). Crash consistency rests on the superblock-commit ordering
rather than a media barrier: an interrupted allocation is atomically absent on
remount (proven by make run-storage-writable-recovery, above). A post-write
media-durability flush against a write-back cache (for host power loss, not the
guest-side forced poweroff that proof exercises) remains future hardening, not
claimed here.
Syscall Trace: Reading a File from a FAT USB Drive
Four userspace processes: App, FAT service, USB mass storage, xHCI driver.
With promise pipelining (one submission):
Cap’n Proto promise pipelining lets the App chain dependent calls without waiting for intermediate results. The App submits a single pipelined request: “open this file, then read from the result”:
# Single pipelined submission (SQEs with PIPELINE flag):
# call 0: dir.open("report.pdf") → answer_id=200, user_data=100
# call 1: answer 200 result_cap[0].read(offset=0, len=4096)
cap_submit([
{cap=2, method=OPEN, answer=200, user_data=100, params={"report.pdf", flags=0}},
{cap=PIPELINE(answer=200, result_cap=0), method=READ, user_data=101, params={offset:0, length:4096}},
])
→ kernel routes call 0 to FAT service via Endpoint
→ FAT service reads directory entry from BlockDevice
→ FAT service creates FileCapObject, replies with File cap as result cap 0
→ kernel sees pipelined call 1 targeting the File cap from call 0
→ kernel dispatches call 1 to the same FAT service (or direct-invokes
the new File CapObject if it's a local endpoint)
→ FAT service maps offset → cluster chain → LBA
→ FAT service submits CALL SQE: {cap=blk_cap, method=READ_BLOCKS, params={lba, count}}
→ USB mass storage → xHCI → hardware → back up
← completion: {data: [4096 bytes]}, File cap installed as cap_id=5
One app-to-kernel transition. The kernel resolves the pipeline dependency
internally through the sideband CapTransferResult record at index 0; it does
not inspect the Cap’n Proto result payload. The App never needs a userspace
round trip for the intermediate File cap, though the cap is installed and usable
afterward.
This is a core Cap’n Proto feature: by expressing “call method on the
not-yet-resolved result of another call,” the client avoids a round-trip
for each link in the chain. For deeper chains (e.g., dir.sub("a").sub("b") .open("file").read(0, 4096)), the savings compound — one submission instead
of four sequential syscalls.
The capability-ring version should follow the Cap’n Proto/CapTP prior-art shape captured in Cloudflare, Cap’n Proto, Workers RPC, and Cap’n Web and Spritely, OCapN, and CapTP: pipelined targets live in answer/result-cap namespaces, not in caller-selected global ids; result-cap metadata stays outside the Cap’n Proto payload; broken answers propagate failure to dependent calls; and answer slots, queued dependent calls, queued bytes, and remote references are charged to bounded resource ledgers. This is design grounding, not an OCapN or Cap’n Web wire-compatibility target.
Without pipelining (two sequential ring submissions):
Without promise pipelining, the App submits two separate CALL SQEs via the ring, blocking on each completion before submitting the next:
# 1. Open file (App holds Directory cap, cap_id=2)
# App writes CALL SQE: {cap=2, method=OPEN, params={"report.pdf", flags=0}}
cap_enter(min_complete=1, timeout=MAX)
→ kernel routes CALL to FAT service via Endpoint
→ FAT service reads directory entry from BlockDevice
→ FAT service creates FileCapObject for this file
→ FAT service posts RETURN SQE with [FileCapObject] in xfer_caps
→ kernel installs File cap in App's table → cap_id=5
← App reads CQE: result={file: cap_index=0}, new_caps=[5]
# 2. Read 4096 bytes from offset 0
# App writes CALL SQE: {cap=5, method=READ, params={offset:0, length:4096}}
cap_enter(min_complete=1, timeout=MAX)
→ kernel routes CALL to FAT service
→ FAT service maps offset → cluster chain → LBA
→ FAT service submits CALL SQE: {cap=blk_cap, method=READ_BLOCKS, params={lba, count}}
→ kernel routes to USB mass storage
→ mass storage submits CALL SQE: {cap=usb_cap, method=BULK_TRANSFER, params={scsi_cmd}}
→ kernel routes to xHCI driver
→ xHCI programs TRBs, waits for interrupt
← returns raw sector data
← returns sector data
← FAT service extracts file bytes, posts RETURN SQE with {data: [4096 bytes]}
This works but costs two round-trips where pipelining needs one. The synchronous path is useful for simple cases and bootstrapping; pipelining is the intended steady-state model.
In both cases, the intermediate IPC hops (FAT → USB mass storage → xHCI) are invisible to the App.
Capability-Native Store
The Store Capability
Once the system is running, persistent storage is provided by a userspace service — the store. It’s backed by a block device (virtio-blk), and exposes a content-addressed object store where objects are capnp messages.
interface Store {
# Store a capnp message, returns its content hash
put @0 (data :Data) -> (hash :Data);
# Retrieve by hash
get @1 (hash :Data) -> (data :Data);
# Check existence
has @2 (hash :Data) -> (exists :Bool);
# Delete (if caller has authority — see note below)
delete @3 (hash :Data) -> ();
}
Note on delete: In a content-addressed store, deleting a hash can break
references from other namespaces pointing to the same object. delete on the
base Store interface is dangerously broad — a StoreAdmin interface
(separate from Store) may be more appropriate, with delete restricted to a
GC service that can verify no live references exist. Open Question #3 (GC)
should be resolved before implementing delete. The attenuation table below
lists Store (full) as “Read, write, delete any object” — in practice, most
callers should receive a Store attenuated to put/get/has only.
Content-addressed means:
- Deduplication is automatic (same content = same hash)
- Integrity is verifiable (hash the data, compare)
- References between objects are just hashes embedded in capnp messages
- No mutable paths — “updating a file” means storing a new version and updating the reference
Mutable References: Namespaces
A Namespace capability provides mutable name-to-hash mappings on top of
the immutable store:
interface Namespace {
# Resolve a name to a store hash
resolve @0 (name :Text) -> (hash :Data);
# Bind a name to a hash (if caller has write authority)
bind @1 (name :Text, hash :Data) -> ();
# List names (if caller has list authority)
list @2 () -> (names :List(Text));
# Get a sub-namespace (attenuated — restricted to a prefix)
sub @3 (prefix :Text) -> (ns :Namespace);
}
A Namespace cap scoped to "config/" can only see and modify names under
that prefix. This is the analog of a chroot — but structural, not a kernel
hack. The sub() method returns a new Namespace cap via IPC cap transfer.
Future: union composition. The research survey recommends
extending Namespace with Plan 9-inspired union semantics — a union(other, mode) method that merges two namespaces with before/after/replace ordering.
This adds composability without a global mount table. See
research survey §6.
IPC and Capability Transfer
Several storage operations return new capabilities: Directory.open()
returns a File, Directory.sub() returns a Directory, Namespace.sub()
returns a Namespace. This requires dynamic capability management — the kernel
must install new capabilities in a process’s CapTable at runtime as part of
IPC.
The Capability Ring
All kernel-userspace interaction goes through a shared-memory ring pair (submission queue + completion queue), inspired by io_uring. SQE opcodes map to capnp-rpc Level 1 message types. The ring is allocated per-process at spawn time and mapped into the process’s address space.
Syscall surface: 2 syscalls. New capabilities, operations, and transfer mechanisms are expressed as new SQE opcodes instead of expanding the syscall ABI.
| # | Syscall | Purpose |
|---|---|---|
| 1 | exit(code) | Terminate current thread; process exits after its last live thread |
| 2 | cap_enter(min_complete, timeout_ns) | Process pending SQEs, then wait until enough CQEs exist or the timeout expires |
Writing SQEs is syscall-free, but ordinary capability CALLs make progress
through cap_enter. Timer polling handles non-CALL ring work and only CALL
targets that explicitly opt into interrupt-context dispatch. cap_enter
flushes pending SQEs and can block the process until min_complete
completions are available or a finite timeout expires. An indefinite wait uses
timeout_ns = u64::MAX; timeout_ns = 0 keeps the call non-blocking. A future
SQPOLL-style worker can reintroduce a zero-syscall CALL-completion hot path
without running arbitrary capability methods from timer interrupt context.
The ring structs and synchronous CALL dispatch are implemented and working.
See capos-config/src/ring.rs for the shared ring structs and
kernel/src/cap/ring.rs for kernel-side processing.
Ring Layout
One 4 KiB page per process, mapped into both kernel (HHDM) and user space:
┌─────────────────────────┐ offset 0
│ Ring Header │ SQ/CQ head, tail, mask, flags
├─────────────────────────┤ offset 128
│ SQE Array (16 × 64B) │ submission queue entries
├─────────────────────────┤ offset 1152
│ CQE Array (32 × 32B) │ completion queue entries
└─────────────────────────┘
SQ: userspace owns tail (producer), kernel owns head (consumer)
CQ: kernel owns tail (producer), userspace owns head (consumer)
SQE Opcodes
Five opcodes handle everything — client calls, server dispatch, capability transfer, pipelining, and lifecycle:
| Opcode | capnp-rpc analog | Purpose |
|---|---|---|
CALL | Call | Invoke method on a capability |
RETURN | Return | Respond to incoming call (server side) |
RECV | (implicit) | Wait for incoming calls on Endpoint |
RELEASE | Release | Drop a capability reference |
FINISH | Finish | Release pipeline answer state |
TIMEOUT | — | Post a CQE after N nanoseconds (io_uring-inspired) |
TIMEOUT is an alternative to the timeout_ns argument on cap_enter:
it works with zero-syscall polling (kernel fires the CQE on a timer tick)
and composes with LINK/DRAIN for deadline-based chains.
SQE flags: PIPELINE (cap_id is a promise reference), LINK (chain to
next SQE), MULTISHOT (keep generating CQEs), DRAIN (barrier).
Promise Pipelining
A CALL SQE can target either a concrete CapId or a PromisedAnswer
reference (via the PIPELINE flag + pipeline_dep/pipeline_field fields).
pipeline_dep names the earlier answer and pipeline_field is a zero-based
CapTransferResult record index in that answer’s sideband result-cap list, not
a Cap’n Proto schema field. The kernel resolves the dependency chain internally:
SQE[0]: CALL dir.open("report.pdf") → answer_id=200, user_data=100
SQE[1]: CALL [PIPELINE: dep=200, result_cap=0].read(0, 4096) → user_data=101
One cap_enter call. The kernel dispatches SQE[0], resolves result cap record
0 from the completion sideband, and dispatches SQE[1] against it without
returning to userspace between steps or parsing the result payload.
The Endpoint Kernel Object
For cross-process IPC, an Endpoint connects client-side proxy caps to a server’s receive loop:
Client's CapTable Server's CapTable
┌─────────────────┐ ┌──────────────────┐
│ cap 2: Proxy │ │ cap 0: Endpoint │
│ → endpoint ────────── Endpoint ◄──── RECV SQE ──│ │
│ badge: 42 │ (kernel obj) │ │
└─────────────────┘ └──────────────────┘
The server posts a RECV SQE (with MULTISHOT flag). Incoming calls appear
as CQEs with badge, interface_id, method_id, and a kernel-assigned call_id.
The server responds by posting a RETURN SQE referencing the call_id.
interface_id is the transported schema ID for the interface being invoked.
It should equal the generated TYPE_ID for that capnp interface. cap_id is
the authority-bearing table handle; interface_id is only the protocol tag.
The target capability entry owns one public interface; method_id selects a
method inside that interface, while cap_id identifies the object being
invoked. If the same backing state needs another interface, the transport
should mint a separate capability entry for that interface rather than letting
one handle accept multiple unrelated interface_id values.
Direct-Switch IPC
When a client’s CALL targets a cap served by a blocked server (waiting on RECV), the kernel marks that server as the direct IPC handoff target so the next context-switch path runs the callee before unrelated round-robin work. The current implementation still uses the ordinary saved-context restore path; small-message register transfer remains a future fastpath after measurement. See research survey §2.
Capability Transfer via Ring
Capabilities travel as sideband arrays (CapTransferDescriptor) alongside capnp
message bytes:
- CALL params: params buffer contains the capnp message bytes followed by
xfer_cap_counttransfer descriptors packed ataddr + len, which must be aligned toCAP_TRANSFER_DESCRIPTOR_ALIGNMENT. - RETURN results: server result buffers carry the capnp reply bytes and may
carry return transfer descriptors on
addr + len; the kernel inserts destination capability records in the caller’s result buffer after the normal result bytes. Count is reported in CQEcap_countand those records are written asCapTransferResult { cap_id, interface_id }values atresult_addr + result. The requested result buffer (result_len) must be large enough for both normal reply bytes and all appendedcap_countrecords.
xfer_cap_count > 0 with malformed descriptor metadata (bad mode bits, reserved
bits, _reserved0, or misalignment) fails closed as
CAP_ERR_INVALID_TRANSFER_DESCRIPTOR. Kernels that have not yet enabled transfer
handling should return CAP_ERR_TRANSFER_NOT_SUPPORTED for transfer-bearing SQEs.
The capnp wire format’s WirePointerKind::Other encodes capability indices
in messages. The sideband arrays map these indices to actual CapIds. The
kernel does not parse capnp messages — it transfers a list of caps alongside
the opaque message bytes.
Dynamic Capability Management
Every open(), sub(), or resolve() creates and transfers a new
capability at runtime. The kernel’s CapTable insert() and remove() are
the primitives. Capabilities flow through RETURN SQE sideband arrays (and
through the manifest at boot). No separate cap_grant mechanism needed —
authority flow follows the ring’s IPC graph.
The CapTable generation counter handles stale references: when a File cap is
closed (slot freed, generation bumps), any cached CapId returns
StaleGeneration instead of accidentally hitting a new occupant.
Shared Memory for Bulk Data
Copying file data through capnp Data fields works for metadata and small
reads, but is impractical for anything above a few KB. A 1 MB read through
a capability CALL copies data four times: device → driver heap → capnp
message → kernel buffer → client buffer.
SharedBuffer Capability
SharedBuffer is the service-facing name this proposal uses for bulk-transfer
buffers. The implemented kernel/user substrate is MemoryObject: a capability
backed by physical pages that can be mapped into multiple address spaces
simultaneously. Zero copies between processes.
interface MemoryObject {
# Size and page count of the backing object.
info @0 () -> (pageCount :UInt32, sizeBytes :UInt64);
# Map a page-aligned object range into the caller's address space.
map @1 (hint :UInt64, offset :UInt64, size :UInt64, prot :UInt32) -> (addr :UInt64);
# Unmap a caller-local borrowed mapping backed by this object.
unmap @2 (addr :UInt64, size :UInt64) -> ();
# Update caller-local page permissions for a borrowed mapping.
protect @3 (addr :UInt64, size :UInt64, prot :UInt32) -> ();
}
The kernel creates MemoryObjects through the existing FrameAllocator
capability. Held MemoryObject caps charge the holder’s frame-grant quota; mapped
address-space pages are tracked as borrowed pages and keep the same backing
alive until unmapped or process teardown. A later SharedBuffer alias or
allocator may wrap this ABI for storage/network interfaces, but current code
should use MemoryObject directly.
File I/O with SharedBuffer
File and BlockDevice interfaces support both inline-Data and SharedBuffer modes:
# Small read (< ~4 KB): inline in capnp message
file.read(offset=0, length=256) → {data: [256 bytes]}
# Large read: caller provides SharedBuffer, server fills it
let buf = frame_alloc.allocContiguous(256); # 1 MB MemoryObject / SharedBuffer
file.readBuf(offset=0, buf, length=1048576) → {bytesRead: 1048576}
# Data is now in buf's mapped pages — no copy through kernel
Extended File interface with SharedBuffer support:
interface File {
read @0 (offset :UInt64, length :UInt32) -> (data :Data);
write @1 (offset :UInt64, data :Data) -> (written :UInt32);
readBuf @2 (offset :UInt64, buffer :SharedBuffer, length :UInt32) -> (bytesRead :UInt32);
writeBuf @3 (offset :UInt64, buffer :SharedBuffer, length :UInt32) -> (written :UInt32);
stat @4 () -> (size :UInt64, created :UInt64, modified :UInt64);
truncate @5 (length :UInt64) -> ();
sync @6 () -> ();
close @7 () -> ();
}
The readBuf/writeBuf methods accept a SharedBuffer cap, currently a
MemoryObject cap transferred via IPC. The server maps the buffer, performs DMA
or memory copies into it, then returns. The caller reads directly from the
mapped pages.
For BlockDevice, the same pattern applies — the driver maps the SharedBuffer, programs DMA descriptors pointing to its physical pages, and the device writes directly into the shared memory.
When to Use Each Mode
| Scenario | Mechanism | Why |
|---|---|---|
| Reading a 64-byte config value | File.read() inline Data | Copy overhead negligible |
| Reading a 10 MB binary | File.readBuf() SharedBuffer | Avoids 4× copy overhead |
| FAT directory entry (32 bytes) | BlockDevice.readBlocks() inline | Small metadata read |
| Streaming video frames | File.readBuf() + ring of SharedBuffers | Continuous zero-copy |
| Network packet buffers | SharedBuffer ring between NIC driver and net stack | DMA-capable pages |
Attenuation
Storage services mint restricted capabilities using wrapper CapObjects:
| Capability | Authority |
|---|---|
Directory (full) | Open, list, mkdir, remove, sub |
Directory (read-only) | Open (returns read-only Files), list, sub only |
File (full) | Read, write, truncate, sync |
File (read-only) | Read and stat only |
File (append-only) | Read, stat, write at end only |
Store (full) | Read, write, delete any object |
Store (read-only) | Get and has only |
Namespace (full) | Resolve, bind, list under prefix |
Namespace (read-only) | Resolve and list only |
Blob (single object) | Read one specific hash |
SharedBuffer (read-only) | Map as read-only (page table: R, no W) |
An application that only needs to read its config gets a read-only
Directory scoped to its config path. It can’t write, can’t see other
apps’ directories, can’t access the raw BlockDevice.
Naming Without Paths
Traditional OS: process opens /var/lib/myapp/data.db — a global path.
capOS: process receives a Directory or Namespace cap at spawn time,
opens "data.db" within it. The process has no idea where on disk this
lives. It can’t traverse upward. There is no global root.
# Traditional: global path namespace
/
├── etc/
│ └── myapp/
│ └── config.toml
├── var/
│ └── lib/
│ └── myapp/
│ └── data.db
└── sbin/
└── myapp
# capOS: per-process capability set (no global namespace)
Process "myapp" sees:
"config" → Directory(read-only, scoped to myapp's config files)
"data" → Directory(read-write, scoped to myapp's data files)
"state" → Namespace(read-write, scoped to myapp's store objects)
"log" → Console cap
"api" → HttpEndpoint cap
The process doesn’t know or care about the backing storage layout. It just uses the capabilities it was granted.
Configuration
Build-Time Config (Boot Manifest)
The system manifest is authored at build time. The human-writable source
could be any format — TOML, CUE, or even a Makefile target that generates
the capnp binary. What matters is that it compiles to a SystemManifest
capnp message baked into the ISO.
Example source (TOML, compiled to capnp by a build tool):
[services.virtio-net]
binary = "virtio-net"
restart = "always"
caps = [
{ name = "device_mmio", source = { kernel = "device_mmio" } },
{ name = "interrupt", source = { kernel = "interrupt" } },
{ name = "log", source = { kernel = "console" } },
]
exports = ["nic"]
[services.net-stack]
binary = "net-stack"
restart = "always"
caps = [
{ name = "nic", source = { service = { service = "virtio-net", export = "nic" } } },
{ name = "timer", source = { kernel = "timer" } },
{ name = "log", source = { kernel = "console" } },
]
exports = ["net"]
[services.fat-fs]
binary = "fat-fs"
restart = "always"
caps = [
{ name = "blk", source = { service = { service = "usb-storage", export = "block-device" } } },
{ name = "log", source = { kernel = "console" } },
]
exports = ["root-dir"]
[services.my-app]
binary = "my-app"
restart = "on-failure"
caps = [
{ name = "api", source = { service = { service = "http-service", export = "api" } } },
{ name = "docs", source = { service = { service = "fat-fs", export = "root-dir" } } },
{ name = "data", source = { service = { service = "store", export = "namespace" } } },
{ name = "log", source = { kernel = "console" } },
]
A build tool validates this against the capnp schemas (does virtio-net
actually export "nic"? does http-service support endpoint() minting?)
and produces the binary manifest.
Runtime Config (via Store)
Once the store service is running, configuration can be stored there and updated without rebuilding the ISO. The store is just another capability — a config-management service could watch for changes and signal services to reload.
Connection to Network Transparency
If capabilities are the only abstraction, and capnp is the only wire format, then the transport is irrelevant:
- Local IPC: capnp message copied between address spaces by kernel
- Local store: capnp message written to block device
- Remote IPC: capnp message sent over TCP to another machine
- Remote store: capnp message fetched from a remote store service
A capability reference doesn’t encode where the backing service lives. The kernel (or a proxy) handles routing. This means:
- A
Directorycap could be backed by local FAT or a remote 9P server - A
Namespacecap could be backed by local storage or a remote store - A
Fetchcap could route through a local HTTP service or a remote proxy - A
ProcessSpawnercap could spawn locally or on a remote machine
The system manifest could describe services that run on different machines, and the capability graph spans the network. This is the “network transparency” item in the roadmap — it falls out naturally from the model.
Persistence of the Capability Graph
The live capability graph (which process holds which caps) is ephemeral — it exists in kernel memory and is lost on reboot. The system manifest describes the intended graph, and init rebuilds it on each boot.
For true persistence (resume after reboot without re-initializing):
- Each service serializes its state to the store before shutdown
- On next boot, the manifest includes “restore from store hash X” hints
- Services read their saved state from the store and resume
This is application-level persistence, not kernel-level. The kernel doesn’t snapshot the capability graph — services are responsible for their own state. This avoids the complexity of EROS-style transparent persistence while still allowing stateful services.
Managed Cloud Backing
The local Store/Namespace interfaces define capOS persistence semantics. A
cloud backend must be an adapter behind those interfaces, not a new ambient
authority path. Services such as the adventure profile, expedition, and ledger
services should serialize bounded Cap’n Proto records to a store capability; the
caller should not know whether that store is backed by RAM, local disk, or a
managed cloud service.
For cloud-first application data, use a narrow bridge service:
capOS service -> Store/Namespace or app-specific SaveStore cap -> Cloud bridge
-> provider APIs
The bridge owns provider credentials and exposes only typed save/load/append operations. Ordinary clients never receive provider credentials, bucket names, database document paths, or broad write authority.
Recommended GCP mapping for game/profile style state:
- Firestore Native mode for small mutable indexes and profile summaries that need transactional compare-and-set behavior.
- Cloud Storage for larger immutable snapshots, evidence blobs, exports, and content-addressed objects. Object versioning and lifecycle policy should bound accidental overwrite recovery and storage growth.
- Cloud Run for a small HTTPS or capnp-over-HTTP bridge endpoint when capOS cannot yet link provider SDKs directly.
- Secret Manager for bridge-side service credentials and rotation; secrets do not enter ordinary capOS game clients.
Provider-specific records must still carry capOS-level schema version, content hash or release id, profile/tenant id, monotonic version, size limit, and migration policy. Writes that race on the same mutable profile or checkpoint must use an explicit version precondition and fail closed when stale. Append-only ledgers should append new records with previous-record hashes rather than rewriting history. Local QEMU tests should use a fake cloud bridge that enforces the same stale-write, append-only, wrong-profile, and size-bound rules before any real provider integration is accepted.
User-Owned Browser Transport
Some user data should be portable without giving the capOS service operator a database role over it. For private player backup/sync, a browser can act as the transport to user-owned storage:
capOS save service -> encrypted save capsule -> browser
browser OAuth/Firebase session -> Google Drive appDataFolder or Firebase user doc
This is not the same as the managed cloud bridge above. In the browser-transport
model, the user grants Drive/Firebase access to the web app, the browser writes
opaque encrypted capsules, and capOS never receives the provider tokens. The
encryption key follows the storage domain: local capOS storage uses local
capOS-host key material, while GCP-backed game-world state uses Cloud KMS
envelope encryption: a per-world or per-shard KMS KEK wraps service-owned DEKs.
Google Drive’s appDataFolder is a good fit for app-private backup files
because it is hidden from ordinary Drive views and can use the narrow
drive.appdata scope. Firebase/Firestore can also carry per-user encrypted
capsule documents and provide offline cache/sync behavior, but the backend
cannot validate encrypted game semantics beyond metadata and access rules.
Treat user-owned blobs as backup material, not authority:
- The service validates signatures, profile id, content hash, schema version, monotonic version, previous hash, and size bounds before import.
- Append-only ledgers, reward witness records, market receipts, and multiplayer outcomes remain service-owned or cloud-bridge-owned authoritative records.
- A user may delete, duplicate, or roll back private blobs; restore code must handle that as an expected input, not as trusted history.
- Game-world key capabilities, DEKs, and KMS decrypt/unwrap grants should not be exposed to the browser. For GCP-backed worlds, DEK unwrap and plaintext use are KMS/IAM-backed authority granted to the relevant game-world service. For local capOS storage, local key backup/recovery is a separate local-host policy.
For GCP-backed game-world state, provision one Cloud KMS key ring and symmetric
CryptoKey KEK per world instance or shard. This follows the CloudKmsKeySource
envelope model from the cryptography/key-management and volume-encryption
proposals: Cloud KMS wraps or unwraps DEKs, and the game-world service uses the
unwrapped DEK internally as service authority, modeled as a SymmetricKey
capability. Grant Cloud KMS roles at the CryptoKey level where possible:
roles/cloudkms.cryptoKeyEncrypter for encrypt-only writers that wrap new DEKs,
roles/cloudkms.cryptoKeyDecrypter for restore or migration paths that unwrap
existing DEKs, and roles/cloudkms.cryptoKeyEncrypterDecrypter only for the
narrow game-world service that genuinely needs both operations. Do not model
browser OAuth identities, Drive/Firebase handles, or capOS clients as holders of
DEKs or KMS decrypt/unwrap grants, and do not rely on per-key-version IAM for
this design.
Key rotation and world retirement are service operations, not browser-vault features. Rotation creates new Cloud KMS KEK versions for future DEK wrapping but does not re-encrypt existing capsules, rewrite wrapped DEK blobs, or disable/destroy old versions. Managed re-encryption or rewrapping must unwrap the old DEK while its KEK version remains usable, decrypt and validate the capsule inside the game-world service, then write a new capsule with a new DEK or a DEK rewrapped by the current primary KEK version. Old KEK versions should only be disabled or destroyed after inventory proves no accepted wrapped DEK depends on them. Retiring a world removes IAM decrypt authority first; disabling key versions can make protected capsules inaccessible, while destruction is delayed by the scheduled destruction period and irreversible once complete, so audit retention and recovery must be settled before destruction.
Phases
Phase 1: Boot Manifest (parallel with Stage 4)
- Define
SystemManifestschema inschema/ - Build tool (
tools/mkmanifest) that compilessystem.cueinto a capnp-encoded manifest and packs it into the ISO as a Limine module - Kernel parses the manifest and now creates only the
initConfig.initprocess - Focused init-executor manifests pass the manifest to the separate
initbinary as bytes through the read-only BootPackage capability - The separate
initbinary is a generic manifest executor for the defaultsystem.cuepath and focused init-executor smokes; focused shell-led smokes still usecapos-shellasinitConfig.init - No persistent storage yet — boot image is the only data source
Phase 2: File I/O Interfaces in Schema (parallel with Stage 6)
Depends on: IPC (Stage 6) for cross-process cap transfer. Endpoint, RECV,
RETURN, capability transfer in CALL params, and capability transfer in RETURN
results are already implemented. The BlockDevice / File / Directory /
DirEntry / Store / Namespace schema has now landed in full. The
File / Directory / Store / Namespace interfaces also have RAM-backed
kernel CapObject implementations (Phase 3 slices 1-3); BlockDevice remains
schema-only. Userspace services that export Directory / File / Store /
Namespace caps over a real backing store have since landed (Phase 3 below),
and the kernel RAM-backed caps are now qemu-only proof/fixture surface rather
than a production persistence service – see
Kernel Storage Cap Backers Are Fixtures.
That history shaped two named downstream adapters:
- POSIX adapter Phase P1.4 (vendored
dashport) does not require the userspace service for its v0 smoke: the bootstrap-granted RAM-backedDirectory+Namespacekernel caps from Phase 3 slices 1-3 are an adequate read-only in-rodata pseudo-fs backing, so P1.4 is now ready to start on the userspacelibcapos-posixfile/dir/stdio/env/printf surface and on dash vendoring; see POSIX Adapter Phase P1.4 anddocs/backlog/posix-adapter-dash-port.md. P1.3 (pipe + recording ProcessSpawner-driven fork-for-exec) landed without storage caps, so P1.4 is the next surface that consumes this proposal. - WASI host adapter Phase W.5 (Preview 1 filesystem) similarly consumes the same kernel cap shape and is unblocked from the same cap-surface perspective; remaining W.5 work is on the wasi-host adapter side. See WASI Host Adapter Phase W.5.
Concrete work:
- Add
BlockDevice,File,Directory, andDirEntrytoschema/capos.capnp, regenerate the checked-in capnp bindings, add theBLOCKDEVICE_INTERFACE_ID/FILE_INTERFACE_ID/DIRECTORY_INTERFACE_IDconstants, and add acapos-confighost roundtrip test. This was schema-only when it landed; kernelCapObjectimplementations followed in Phase 3 slices 1-3 (theStore/Namespaceinterfaces were added in slice 3).SharedBufferis not a separate interface – bulk transfers reuse the existingMemoryObjectcapability, and the inline-Dataread/write/readBlocks/writeBlocksvariants are the v0 surface. - Demo: two-process file server (in-memory File/Directory service + client) that the POSIX and WASI adapters can resolve preopens against
Phase 3: RAM-backed Store (after Phase 2)
Depends on: IPC (Stage 6) for cross-process store access. Same downstream
blockers as Phase 2 – the POSIX adapter v0 plan resolves /etc / /lib
under a read-only Namespace once this lands.
Concrete work:
- Slice 1: minimal RAM-backed
FileCapObject(kernel/src/cap/file.rs).FileCapis backed by a single in-kernelVec<u8>byte buffer and implements the inline-Datasurface of the landedFileinterface –read/write/stat/truncate/sync/close– with per-call payloads bounded at 64 KiB.close()invalidates the cap: the cap-tableget_slotpath consultsvalidate_live()(which returnsRevokedonce closed), and an in-call()guard is the defense-in-depth backup, so a post-close call fails closed with an application exception. A newKernelCapSource::filegrant source lets a manifest grant the cap; themake run-file-server-smokeQEMU smoke (demos/file-server-smoke/,system-file-server-smoke.cue) drives write/read/stat/close round-trips and asserts the closed-cap rejection. Bulk-buffer /MemoryObject-mapped variants are later slices. - Slice 2: minimal RAM-backed
DirectoryCapObject(kernel/src/cap/directory.rs).DirectoryCapis an in-memory namespace (BTreeMap<String, DirectoryEntry>, where each entry is aFileCapor a sub-DirectoryCap) implementing the landedDirectoryinterface –open/list/mkdir/remove/sub.open/mkdir/submint aFile/Directoryresult capability through the existing IPC result-cap transfer machinery (no new transfer authority); file read/write goes through the transferredFilecaps, never through theDirectory.removedeletes an entry andrevoke()s the backing object so every cap already handed out for it fails closed on its next dispatch, and refuses a non-empty sub-directory;close()invalidates the cap and recursively revokes the subtree.sub()has no attenuation beyond the structural scoping every sub-Directoryalready has – per-method read-only attenuation is deferred. A newKernelCapSource::directorygrant source lets a manifest grant the cap; themake run-directory-server-smokeQEMU smoke (demos/directory-server-smoke/,system-directory-server-smoke.cue) drives open/list/mkdir/remove/sub with cap transfer and asserts the post-remove fail-closed rejection. - Slice 3:
StoreandNamespaceinterfaces inschema/capos.capnpplus minimal RAM-backedStore/NamespacekernelCapObjects (kernel/src/cap/store.rs,kernel/src/cap/namespace.rs). The schema additions are purely additive (Store/Namespaceinterfaces and thestore @34/namespace @35KernelCapSourceordinals); theSTORE_INTERFACE_ID/NAMESPACE_INTERFACE_IDconstants and acapos-confighost roundtrip test landed alongside.StoreCapis a content-addressed blob store (BTreeMap<[u8; 32], Vec<u8>>keyed by the SHA-256 content hash fromcapos_lib::content_hash) implementingput/get/has/delete;putis idempotent for identical content, blob and count bounds keep oneStorefrom ballooning the kernel heap, anddeleteis kept on the base interface for this focused proof (theStoreAdminsplit and a GC-verified delete remain deferred – see thedeletenote above).NamespaceCapis a name->hash binding map (BTreeMap<String, Vec<u8>>for bindings plus aBTreeMap<String, Arc<NamespaceCap>>ofsubchildren) implementingresolve/bind/list/sub;bindoverwrites an existing name (mutable references are the point),sub(prefix)mints a structurally scoped child node and transfers it through the existing IPC result-cap machinery (no new transfer authority, idempotent for a repeated prefix), and the parent->child recursiverevoke()reuses the same finite-tree lock-ordering invariantDirectoryCapdocuments. The bindings are opaque hash bytes – aNamespaceCapdoes not hold aStoreCapreference or verify the hash names a live blob in this slice. NewKernelCapSource::store/KernelCapSource::namespacegrant sources let a manifest grant the caps; themake run-store-namespace-smokeQEMU smoke (demos/store-namespace-smoke/,system-store-namespace-smoke.cue) drivesStoreput/has/get/delete andNamespacebind/resolve/list/sub with cap transfer and asserts two fail-closed rejections (aStore.getof an unknown hash and aNamespace.resolveof an unbound name). - Implement
Storeas a userspace service over an exportedEndpoint, moving it out of the kernel data path: a two-process provider->consumer demo (demos/store-service/,system-userspace-store-smoke.cue,make run-userspace-store-smoke) servesput/get/has/deletefrom an in-RAMBTreeMap<[u8;32], Vec<u8>>– no kernelStorecap in the data path. It mirrors the kernelStoreCapblob-count bound and publishes a narrower 4 KiB service-specific inline blob limit because the endpoint-framed request must fit in the service receive buffer; the smoke proves the largest accepted inline blob and the first rejected over-limit blob. The client uses the stockcapos-rtStoreClientover the service endpoint relabelled toSTORE_INTERFACE_IDvia the manifestexpectedInterfaceId. Still RAM, not yet a real store. - Implement a persistent
Store+Namespaceuserspace service backed by a grantedBlockDevice, moving the durable serve boundary out of the kernel: a three-process demo (demos/storage-persist-service/,system-storage-persist-service.cue,make run-storage-persist-service) servesStore(put/get/has/delete/list) andNamespace(resolve/bind/list/sub) from a single service that owns the on-diskCAPOSUS1whole-state snapshot over a virtio-blkBlockDevice– no kernelStore/Namespacecap in the data path. The snapshot stores content-addressed blob bytes (keys recomputed and re-verified on load) and name->hash bindings; a superblock names the live snapshot length, its content hash, and a monotonic generation, and every mutation writes the new payload fully into the standby of two alternating A/B payload regions (selected by generation parity) and FLUSHes it before the single-sector superblock write flips the generation, so the previously committed snapshot survives a crash at any write boundary.Namespace.subreturns a scopedNamespacecap by pre-minting a bounded pool ofNamespace-typed service-object facets of the service’s own namespace endpoint (each a distinct receiver cookie, minted through a spawned sub-helper) and transferring one through the IPC result-cap path; scoped calls route back to the same endpoint by cookie. The client reaches both interfaces through manifest-granted service caps relabelled toSTORE_INTERFACE_ID/NAMESPACE_INTERFACE_ID, and the two-bootmake run-storage-persist-serviceproves the marker and note objects and their bindings survive a reboot (the service reloads them before the second boot writes anything) even after the harness garbages the standby payload region between the boots, simulating a commit interrupted mid payload write (torn-commit recovery proof). - Serve the result-cap-returning userspace
Directory+Filefilesystem interfaces from userspace: a three-process demo (demos/storage-fs-service/,system-storage-fs-service.cue,make run-userspace-directory-file-smoke) runs a service (the init process) that owns an in-memory filesystem tree and servesDirectory(open/list/mkdir/remove/sub/create/rename) andFile(read/write/stat/truncate/sync/close) over a single endpoint, dispatched by the call’s stamped interface id and receiver-cookie badge – no kernelreadonly_fs/writable_fs/installable_imagecap in the data path.Directory.open(-> File),mkdir/sub(-> Directory) transfer result caps from bounded pools of pre-minted typed service-object facets of the same endpoint (minted through the spawned subhelper, each a distinct cookie). The client reaches the tree through a writable root (aDirectoryclient-endpoint facet) and a read-only root (aDirectoryservice-object facet over the same tree); read-only attenuation is structural – the read-only root and the read-onlyFilehandles it returns fail mutation methods closed by routing on the cookie, not a rights flag. The proof drives the positive surface plus fail-closed cases (closed/staleFilehandle, path traversal via..//, absent paths, read-only mutation, oversize writes). The existing kernel-backed WASI filesystem smoke (make run-wasi-fs) stays green as the explicitly fixture-labeled kernelDirectory/Filepath. The follow-up cleanup retiring the kernel storage cap backers as production routes has landed – see Kernel Storage Cap Backers Are Fixtures below. - Backed by RAM (no disk driver yet, data lost on reboot)
- Backed by a real store (persistent userspace service over
BlockDevice, survives reboot) - Services can store and retrieve capnp objects at runtime
- Demonstrate the naming model with a userspace
Namespaceservice -
Namespace.sub()returns new caps via IPC cap transfer
Kernel Storage Cap Backers Are Fixtures
The kernel Store, Namespace, File, Directory, readOnlyFsRoot,
persistentStore, and writableFsRoot grant sources were the proof paths that
landed the typed storage interfaces. Now that the userspace services above own
the production serve boundary – the RAM Store service
(demos/store-service, make run-userspace-store-smoke), the disk-backed
Store + Namespace service (demos/storage-persist-service,
make run-storage-persist-service), and the Directory + File filesystem
service (demos/storage-fs-service, make run-userspace-directory-file-smoke)
– the kernel backers are explicitly proof/fixture surface, not production
storage routes. Production storage is userspace-served; no production manifest
grants kernel-owned storage state ownership (the default system.cue boot
grants none).
The kernel grant sources are gated accordingly:
- The RAM-backed
file/directory/store/namespacesources are gated behind theqemufeature in both the bootstrap cap-table builder (kernel/src/cap/mod.rs) and theProcessSpawnerspawn-grant path (kernel/src/cap/process_spawner.rs). The default non-qemuproduction kernel fails closed on these sources. They remain available only as the in-RAM pseudo-fs backing for the qemu interface proofs (make run-store-namespace-smoke,make run-file-server-smoke,make run-directory-server-smoke,make run-storage-naming) and for the POSIX/WASI/dash adapter smokes (make run-posix-*,make run-wasi-fs). - The disk-backed virtio
read_only_fs_root/persistent_store/writable_fs_rootsources (kernel/src/cap/readonly_fs.rs,persistent_store.rs,writable_fs.rs) were already gated behindqemu(withstorage_fat_read/cloud_*_over_nvme_proofvariants for the FAT and NVMe proof arms) and fail closed in the default production kernel. They back the storage regression proofsmake run-storage-fs,make run-storage-persist, andmake run-storage-writable(plus the FAT and NVMe proof targets), which stay green as explicitly fixture-labeled kernel paths.
In short: the kernel keeps these backers only as named qemu/cloud-proof fixtures; a default production build has no kernel storage grant route, so the typed storage interfaces are served from userspace.
Phase 4: BlockDevice Drivers and Filesystem (after virtio infrastructure)
- virtio-blk driver (userspace, reuses virtqueue infrastructure from networking smoke test)
BlockDevicetrait implementation- FAT filesystem service: wraps BlockDevice, exports Directory/File caps
- SharedBuffer integration for bulk reads (depends on Stage 6 MemoryObject)
- Store service uses BlockDevice for persistence (the persistent userspace
Store+Namespaceservice above,make run-storage-persist-service) - System state survives reboot via the persistent userspace store
(
make run-storage-persist-service); manifest restore hints remain future work
Phase 5: Network Store (after networking)
- Store service can replicate to or fetch from a remote store
- Capability references transparently span machines
- Directory cap backed by a remote filesystem (9P-style)
- Managed cloud bridges can back selected Store/Namespace or app-specific SaveStore capabilities without changing caller authority. First target: GCP-backed profile/ledger/snapshot storage for the adventure demo, with local fake-cloud tests and no provider credentials in ordinary clients.
- User-owned browser transport can store encrypted save capsules in Google Drive
appDataFolderor Firebase user documents. This is for private backup/sync, not authoritative shared state.
Relationship to Other Proposals
- Networking proposal — the NIC driver and net stack are services described in the manifest, not hardcoded. The store could be backed by network storage once networking works. A remote Directory cap (9P over capnp) reuses the same File/Directory interfaces.
- Service architecture proposal — the manifest replaces code-as-config for init. ProcessSpawner, supervision, and cap export work as described there, but driven by manifest data instead of compiled Rust code. IPC Endpoints are the mechanism for service export.
- Capability model — IPC cap transfer (Endpoint + RETURN SQE) is the
mechanism that makes
open()andresolve()work. SharedBuffer is the bulk data path that makes file I/O practical. Both are tracked indocs/roadmap.mdStage 6. - POSIX Adapter — Phase P1.4 (vendored
dashport) consumes theNamespace+File+Directorycap surface defined here; that surface landed as RAM-backed kernelCapObjects in Phase 3 slices 1-3 and is the v0 backing for the dash smoke’s read-only in-rodata pseudo-fs. P1.3 (recording-shim pipe + fork-for-exec) has already landed without storage caps, so P1.4 is the next adapter consumer. The POSIX path resolver,open/read/write/stat/unlink,/etcand/libpreopen scoping, and the dash port itself all sit on this proposal’s Phase 2/3 schema. - WASI Host Adapter — Phase W.5 (Preview
1 filesystem:
fd_read/fd_write/fd_seek/fd_pread/fd_pwrite/fd_filestat_get/path_open/path_filestat_get/path_unlink_file) consumes the same cap shape and is unblocked from the cap-surface side (Phase 3 slices 1-3 land the RAM-backedDirectory/Namespace/Filecaps). Preopened-dir fds map toNamespacecaps from the manifest;path_openresolves through that namespace’sStore/Filecapability. Phases W.2/W.3/W.4 (stdout, argv-grant,random_get) shipped without storage caps, so W.5 is the next adapter consumer alongside POSIX P1.4. - Userspace Binaries Parts 4 and 5 —
the POSIX adapter (Part 4) and the WASI host adapter (Part 5) both describe
their filesystem stories as translations onto this proposal’s
Namespace/Directory/File/Storesurface. Part 4 sketches theNamespace-rooted POSIX fd table and theNamespace + Store -> file I/Otranslation; Part 5 maps each preopened-dir fd to aNamespacecap. - Adventure game proposal — profile, expedition, ledger, and content persistence use application-level save records through Store/Namespace or an app-specific cloud bridge. The game should not persist by snapshotting a live process or exposing provider credentials to clients.
- Cryptography/key-management and volume-encryption proposals — the
Cloud KMS path uses envelope encryption. KMS wraps DEKs under KEKs; capOS
services use local
SymmetricKeyauthority for plaintext operations.
Open Questions
-
Manifest validation. How much can the build tool verify statically? Cap export names depend on runtime behavior of services. Should services declare their exports in their own metadata (like a package manifest)?
-
Schema evolution. When a service’s capnp interface changes, stored objects referencing the old schema need migration. Cap’n Proto has backwards-compatible schema evolution, but breaking changes need a story.
-
Garbage collection. Content-addressed store accumulates unreferenced objects. Who GCs? A separate service with
Storeread + delete authority? Reference counting in the namespace layer? -
Large objects. Storing multi-megabyte binaries as single capnp
Datafields is wasteful (capnp allocates contiguously). SharedBuffer partially addresses this for I/O, but the Store’sput/getinterface still takesData. Options: chunked storage (Merkle tree of hashes), a streamingBlobinterface, or SharedBuffer-aware Store methods. -
Trust model for the manifest. The boot manifest has full authority to define the system. Who signs it? How do you prevent a tampered ISO from granting excessive caps? Secure boot integration?
-
File locking and concurrent access. Multiple processes opening the same file through the same filesystem service need coordination. Options: mandatory locking in the filesystem service (rejects conflicting opens), advisory locking via a separate Lock capability, or single-writer enforcement at the Directory level (open with exclusive flag).
-
RETURN+RECV atomicity. When a server posts a RETURN SQE followed by a RECV SQE, there must be no window where a client call can arrive but the server isn’t listening. SQE LINK chaining (RETURN → RECV) should provide this atomicity — the kernel processes both SQEs as a unit.
Proposal: Standard App Capabilities (AppData, Powerbox, Attenuated Sharing)
Status: future design. No implementation. This proposal defines three app-facing capability patterns; the
AppDatacap is the nearest-term, self-contained piece, the powerbox and sharing-mint depend on a trusted display path and the attenuation wrappers respectively.
Summary
Google Drive, examined closely, spends a lot of effort reluctantly
re-inventing capabilities on top of an ambient REST API: the drive.file
scope plus the Picker (an app may touch only files the user explicitly hands
it), the appDataFolder space (per-app private storage invisible to the user
and other apps), and the role lattice (reader/writer/…) for sharing.
Each is a workaround for the fact that the base API is ambient-by-default and
gated by OAuth scopes – a category rights bitmask re-checked server-side.
capOS does not have that base problem: there is no ambient authority, no path VFS, and access is narrowed by handing a more-restricted typed capability. So capOS can express Drive’s three good ideas as the native mechanism rather than the exception, and more cleanly:
- AppData – a per-process private storage root, granted at spawn and never duplicated. Isolation is structural (only one holder), not a server scope check keyed to an OAuth client id.
- Powerbox (a
FilePicker/resource-picker broker) – a user-mediated grant where a trusted selector the app cannot script returns a real, fresh, method-narrowed capability for exactly what the user chose. This is whatdrive.file+ Picker is trying to be. - Attenuated sharing – “share read-only” means handing a
Filewrapper that lackswrite; escalation is impossible by construction, not by per-request ACL evaluation.
The goal is to make application development both simpler (apps ask for a
private scratch space or a user-picked file instead of negotiating a global
namespace and scope ladder) and more secure (least authority by default,
enforced structurally). These caps are backend-independent: they sit unchanged
in front of RAM, local disk, and a future Google Drive backend
(docs/proposals/drive-storage-backend-proposal.md).
What capOS already has (build on, do not reinvent)
- Storage caps
Store/Namespace/Directory/Fileexist inschema/capos.capnpand as RAM-backed kernelCapObjects.Directory.sub()/Namespace.sub(prefix)already return structurally-scoped child caps that cannot traverse upward (the chroot analog). Seedocs/proposals/storage-and-naming-proposal.mdandkernel/src/cap/. - An attenuation table is already designed in
storage-and-naming-proposal.md(read-only / append-onlyFile, read-onlyDirectory/Store/Namespacewrappers) but is not yet implemented – currentsub()has structural scoping with no per-method attenuation. This proposal’s sharing pattern depends on landing those wrappers. authority_broker(kernel/src/cap/authority_broker.rs) is already a decision point that mints a bundle of capabilities for a session based on itsSessionContextprincipal/profile (the login ->shellBundle/remoteClientBundleflow). It is the proto-powerbox; the powerbox below generalizes it from session-establishment-time to a per-request, user-confirmed grant.session_context(kernel/src/session_context.rs) binds one immutable identity per process. AnAppDataroot and a powerbox grant can both key onSessionContext.principal_id, exactly asauthority_brokeralready does.- Manifests grant exactly the caps an app receives, with a grant
mode(Raw/ClientEndpoint/Move/ServiceObject). Per-app scoping today is “the manifest grants asub()-scopedDirectory.”
The genuinely new surface is: a per-app AppData interface, a per-request
powerbox/file-picker mechanism (the term “powerbox” is currently unused in the
repo), and a service that mints attenuated caps for sharing to another
principal.
Design lessons from Google Drive
| Drive concept | What it really is | capOS pattern |
|---|---|---|
appDataFolder + drive.appdata scope | Per-app hidden storage, server-scope-gated | AppData cap: one holder, structural isolation |
drive.file + Picker | User-mediated per-file grant (ACL expansion) | Powerbox broker mints/returns a per-object cap |
OAuth scope ladder (drive vs drive.readonly) | Category rights bitmask on a principal | (rejected) method-narrowed wrapper caps |
Roles (reader..owner) | ACL lattice entries, re-checked per request | Attenuated wrapper caps (subset of methods) |
expirationTime permission | Server-enforced time-boxed ACL entry | Revocation/expiry membrane held by the grantor |
anyone / link sharing | Bearer grant (authority = possession) | Bearer cap – deliberately flagged, audited |
| Shortcut (pointer file) | Reference to a target id | Namespace name -> cap binding |
Revisions / keepForever | Per-file version list | Content-addressed Store blobs + mutable pointer + GC pin |
The recurring lesson: Drive’s least-privilege features are the ones where it
was forced to approximate object capabilities (drive.file, Picker,
appDataFolder); its scope ladder and server-side ACL are the ambient base it
is working around. capOS should adopt the former natively and not import the
latter.
1. AppData – per-app private storage
Every process can be granted, at spawn, a private storage root that no other principal holds a copy of. Isolation requires no policy check: the cap is simply never handed to anyone else.
interface AppData {
open @0 (name :Text) -> (file :File); # create-or-open within this app's root
list @1 () -> (entries :List(Text));
remove @2 (name :Text) -> ();
}
- Backing: an
AppDatacap is a thin role over aDirectory(orNamespace) scoped to the app – in the simplest form a manifest-grantedDirectory.sub("<app>"). It can be backed by RAM today, local disk later, or the DriveappDataFolderspace (see the backend proposal). - Isolation vs Drive: Drive enforces appData isolation with a server-side scope check keyed to the OAuth client id (ambient identity gating a shared namespace). capOS hands each process a private cap and never duplicates it – cross-app leakage is not possible, not merely disallowed.
- Quota: attach a storage budget to the cap (per the
resource-accounting-proposal.mdledger model) instead of charging a global per-user pool. This is a deliberate divergence from Drive’s unified per-human quota (see Non-Goals). - Lifecycle: the root and its storage are reclaimed when the principal is
destroyed – the cap analog of Drive deleting
appDataFolderon app uninstall.
AppData is the nearest-term piece: over RAM it is a small userspace service
plus a manifest grant, with no dependency on the powerbox or the attenuation
wrappers.
2. Powerbox – user-mediated capability grants
A powerbox is a trusted broker that, on an app’s request, presents the user
a selector the requesting app cannot script or read through, and on the user’s
confirmation mints and returns a fresh capability for exactly the chosen
object – optionally method-narrowed. It generalizes authority_broker from
“mint a bundle at login” to “mint one cap per user gesture.”
interface FilePicker {
pickFile @0 (mode :AccessMode) -> (file :File);
pickFiles @1 (mode :AccessMode) -> (files :List(File));
pickDir @2 (mode :AccessMode) -> (dir :Directory);
}
enum AccessMode { readOnly @0; readWrite @1; }
- Why better than
drive.file+ scope: the returnedFileis a real handle scoped to one object, narrowed at mint time (nodrive.readonlystring), revocable locally by dropping it, with no “+ files the app created” fuzzy second clause and no server ACL round-trip. The user gesture is the grant. - Prior art: this is the Genode “parent routes the session request
according to policy” pattern (
docs/research/genode.md§Session Routing) and the Sculpt/nitpicker user-mediated resource model. capOS’sauthority_brokeris the analog of Genode parent-routing; the powerbox is its per-request generalization. - Hard prerequisite – trusted display: a powerbox is only as trustworthy as the path that shows the selector. The user must be able to trust that the selector UI is the system’s, not a spoof drawn by the requesting app. capOS does not yet have a multiplexed trusted-display primitive (the nitpicker analog); today the trusted surface is the shell/session/terminal. The file-picker powerbox therefore depends on either (a) a text-mode trusted selector hosted by the session/shell, or (b) a future trusted display service. This is the powerbox’s gating dependency and is called out as an open question.
- The powerbox is not file-specific in principle – the same broker shape
can mediate user-confirmed grants of other resource caps. This proposal
scopes the first instance to file/dir/storage selection; a general
Powerboxis future work.
3. Attenuated sharing – wrapper caps + revocation membrane
Sharing is delegating a capability, optionally narrowed to a smaller interface, optionally through a revoker.
interface File {
# ... existing read/write/stat/...
shareAs @N (role :ShareRole, expiresAt :UInt64)
-> (handle :File, revoke :Revoker);
}
enum ShareRole { reader @0; commenter @1; writer @2; }
interface Revoker { revoke @0 () -> (); }
- Roles as method subsets: a
readeris aFilecap exposing only the read-side methods (todayread/stat);writeradditionally exposeswrite/truncate. Escalation is impossible because the method literally is not on the object the grantee holds – not because an ACL is re-evaluated. This is a monotone lattice expressed structurally. (Acommenterrole, as in theShareRoleenum below, implies a comment surface the currentFileinterface –read/write/stat/truncate/sync/close– does not yet have; it is illustrative of the lattice, not of an existing method.) - Depends on the attenuation wrappers already designed in
storage-and-naming-proposal.mdbut not yet implemented. Landing those read-only/narrowedFile/Directorywrappers is the prerequisite forshareAs. - Clawback is the one place capabilities are weaker than Drive’s mutable
ACL: a handed-out cap cannot be unilaterally downgraded later.
shareAstherefore mints the shared handle through a revocation membrane and returns theRevokerto the grantor, so “un-share later” andexpiresAtare supported – at the cost of an interposed membrane and a trusted clock on the sharing path. - Shared directories / group ownership (Drive shared drives) map to a
group-owned
Directorywith per-member role wrappers; deferred to future work.
Uniformity across storage backends
All three caps are defined over the existing typed storage interfaces, so they
are identical whether the backing is RAM, local disk
(docs/proposals/storage-and-naming-proposal.md), or Google Drive
(docs/proposals/drive-storage-backend-proposal.md). An app that uses an
AppData cap and a FilePicker does not know or care which backend serves it.
This is the same backend-agnosticism the storage proposal already states for
Store (“backed by virtio-blk, RAM, or network”).
Honest mismatches and non-goals
- Bearer / link sharing (
anyone): capabilities are bearer tokens, so link-sharing maps “cleanly” – which is exactly the risk. It drops user mediation entirely (anyone with the bytes has access). Treat it as a deliberately-flagged, audited exception, never a default; prefer a powerbox grant orshareAsto a named principal. - Clawback / instant global revoke: Drive’s owner can demote any grantee at any time via the central ACL. capOS gets this only where caps were minted through a revoker; there is no zero-cost equivalent of Drive’s org-wide instant revoke for already-forwarded caps.
- Unified human quota: Drive charges one per-user quota across spaces. capOS uses per-cap budgets; reconciling “this app’s AppData counts against the human’s storage” is a policy question with no clean cap answer. Per-cap budgets are the default; a unified human-facing view is out of scope.
- Scope tiering is administrative, not technical: Drive’s restricted-scope verification is a business/review gate, not a security mechanism. It has no capability analog and is explicitly not imitated; structural narrowing replaces it.
- Trusted display is a real gap: without it, the powerbox selector can be spoofed by the requesting app. This proposal does not deliver a trusted display; it depends on one (open question below).
Relationship to existing proposals
storage-and-naming-proposal.md– owns the storage caps, the attenuation table this proposal’s sharing depends on, and the existing “Managed Cloud Backing” / “User-Owned Browser Transport” sections. A small reconciling update there should cross-referenceAppDataand the powerbox; this proposal is the standalone home for the three patterns.userspace-authority-broker-proposal.md– proposes moving broker policy into init-owned userspace; the powerbox should live wherever the broker lands.oidc-and-oauth2-proposal.md– the OAuth consent screen is itself a powerbox grant; the patterns are consistent.docs/research/{genode,plan9-inferno,eros-capros-coyotos}.md– Genode parent-routing/powerbox, Plan 9 per-process namespaces (anAppDatamounted alongside other storage is the union-namespace pattern), and the EROS persistence contrast (capOS keeps application-level persistence, not transparent single-level store).
Phasing
- AppData over RAM (near-term, self-contained): a userspace
AppDataservice plus a manifest grant; QEMU proof that two apps cannot see each other’s data. No powerbox/wrapper dependency. - Attenuation wrappers (implement the already-designed read-only/narrowed
File/Directorywrappers): prerequisite for sharing. shareAs+ Revoker (sharing-mint): once wrappers exist; adds the revocation membrane and a trusted clock on the sharing path.- FilePicker powerbox (gated on a trusted display path): start with a
session/shell-hosted text-mode selector; generalize to a
Powerboxand a trusted display service later.
Open questions
- What is the trusted-display primitive the powerbox selector renders through – a shell/session-hosted selector, or a new multiplexed display service (the nitpicker analog)?
- Should
AppDataquota integrate withresource-accounting-proposal.mdledgers, and how does it relate to a future unified human-facing storage view? - Does
shareAsbelong on each storage interface (File/Directory/Store) or on a separateSharingminting service that takes a cap and returns a narrowed one? - Is the first powerbox instance file/dir/storage-only, or should the general
Powerboxshape (mediating any resource cap) be defined up front?
Proposal: Google Drive Storage Backend
Status: future design. No implementation. The native backend is gated behind the userspace-driver authority gate, a userspace network stack, an outbound TLS client, an HTTP client, and the OAuth2 service – none of which exist yet. The browser-transport model is the near-term path and is already partially specified in
storage-and-naming-proposal.md.
Summary
Let a Google-authenticated user use their own Google Drive as a capOS storage
backend, exposed behind the same storage capabilities apps already use
(Store / Namespace / Directory / File, and the AppData cap from
docs/proposals/standard-app-capabilities-proposal.md). The user’s Drive –
specifically the per-app appDataFolder space – becomes the backing for an
app’s AppData cap, and selected user files become File caps minted through
the powerbox.
There are two delivery models, and this proposal keeps them explicit because they have very different trust and readiness profiles:
- Browser-transport (near-term): the user’s browser holds the Google OAuth
session and does the TLS/HTTP to Drive; capOS never sees Google tokens and
stores only encrypted capsules in
appDataFolder. This is already sketched instorage-and-naming-proposal.md(“User-Owned Browser Transport”) and is feasible without a capOS network/TLS/OAuth stack. - Native backend (deep-future): a capOS userspace service holds the OAuth refresh token and performs outbound HTTPS to the Drive API itself. This is the more capable model and the more demanding one – it sits behind the full network/TLS/HTTP/OAuth dependency chain.
In both models, Drive sits behind a backend adapter, not pretended to be a set of first-class local object caps (see Trust Model).
Why Drive
- The user already owns the storage and the quota; capOS does not provision a server.
- The
appDataFolderspace is a near-exact fit for the per-appAppDatacap: Google already provides per-app private storage invisible to the user and other apps under the narrow, non-sensitivedrive.appdatascope. - Drive’s
drive.file+ Picker consent model maps onto the capOS powerbox, so user-selected files become capabilities without granting the all-files scope. - It is a concrete, widely-available validation of the storage caps’
backend-agnosticism that
storage-and-naming-proposal.mdalready asserts (“Store service backed by virtio-blk, RAM, or network”).
Architecture
A userspace Drive storage service implements the standard storage cap interfaces and translates their methods into Drive REST calls:
app --(File/Directory/Store/AppData cap)--> Drive storage service
| uses
v
DriveAccount cap (OAuth tokens) --> OAuthClient / AccessToken
| (oidc-and-oauth2-proposal)
v
OutboundHttpRequest --> TLS client --> userspace net stack
(networking) (certificates-and-tls)
- The service consumes, does not redefine the OAuth capabilities from
oidc-and-oauth2-proposal.md(OAuthClient,AccessToken.authorize/attenuate,RefreshToken), passing each Drive request as theOutboundHttpRequeststruct thatAccessToken.authorizedecorates with the bearer credential. The refresh token lives in the OAuth service; the Drive service holds aDriveAccountcap that exposes only the typed operations the user consented to. - It consumes the outbound TLS client from
certificates-and-tls-proposal.mdand the HTTP client / userspace network stack fromnetworking-proposal.mdPhase C. - It is the network analog of the virtio-blk-backed FS service in
docs/proposals/storage-and-naming-proposal.md: sameDirectory/File/Storecaps in front, a different backend behind.
Concept mapping (Drive -> capOS standard caps)
| Drive | capOS standard cap (Proposal A) |
|---|---|
appDataFolder space (drive.appdata) | AppData cap backing |
drive.file + Picker selection | powerbox FilePicker returns a File cap |
| File id | the File cap handle |
| Folder | Directory cap |
Roles (reader/writer) | shareAs wrapper caps |
| Shortcut | Namespace binding |
| Revisions | content-addressed Store blobs + pointer |
| OAuth scopes | (not modeled internally) method-narrowed DriveAccount |
A key consequence: capOS does not model OAuth scopes internally. A
DriveAccount cap exposes only the methods the user consented to; “read-only
Drive access” is a DriveAccount whose wrapper omits write methods, not a
drive.readonly scope string re-checked server-side.
Dependency stack and gating
A native Drive client backend needs, bottom-up:
| Layer | Need | capOS state |
|---|---|---|
| NIC / virtio-net | packet I/O | partly present (virtio-net MSI-X + delivery); driver must move to userspace |
| TCP | reliable stream | present (smoltcp, in-kernel/transitional); must move to a userspace net process |
| TLS 1.2/1.3 | confidentiality + server auth (X.509 chain, trust roots, AEAD/ECDHE) | not implemented – certificates-and-tls-proposal.md is future design (rustls + webpki-roots planned); the hardest single piece |
| HTTP/1.1 or HTTP/2 | Drive REST transport (Google prefers HTTP/2) | not implemented |
| JSON | request/response + metadata; resumable upload state machine | tractable (serde_json no-std) |
| OAuth2 token flow | PKCE/device-flow handshake, refresh->access exchange, sealed refresh-token storage | designed but unimplemented (oidc-and-oauth2-proposal.md) |
| Trusted wall-clock | token expiry, cert validity, permission expirationTime | weak today; needed for TLS cert validity |
The native backend is therefore gated on, in order:
docs/backlog/hardware-boot-storage.md Task 5 (userspace-driver authority
gate) -> networking-proposal.md Phase C (userspace net stack + NIC driver) ->
certificates-and-tls-proposal.md (outbound TLS) -> an HTTP client ->
oidc-and-oauth2-proposal.md (OAuth service). This is the same authority gate
that blocks userspace networking generally; the Drive backend is one of its
downstream consumers, not a way around it.
Delivery models
Browser-transport (near-term)
The user’s browser, already authenticated to Google, holds the OAuth session
and performs the TLS/HTTP to Drive. capOS hands the browser an opaque,
client-side-encrypted capsule to store in the app’s appDataFolder; capOS
never sees Google tokens. This reuses the remote-session / browser-capability
surface and the KMS envelope-encryption pattern in
storage-and-naming-proposal.md (“User-Owned Browser Transport”). It is
feasible before any capOS network/TLS/OAuth stack exists, and is the
recommended first delivery. This proposal’s role here is to reconcile that
existing section with the AppData/powerbox vocabulary, not to redefine it.
Native backend (deep-future)
A capOS Drive service holds the OAuth refresh token and does outbound HTTPS
itself. For a headless/embedded OS the realistic OAuth flows are
authorization-code + PKCE with a loopback redirect (http://127.0.0.1:port)
when a same-host browser is reachable, otherwise the device flow (show URL +
code on one device, poll for tokens). PKCE is non-negotiable – capOS has no
trustworthy on-device confidential client secret. Token lifecycle: persist only
the refresh token in a sealed cap (an AppData-style or credential_store
secret), exchange for short-lived access tokens on demand, and treat the access
token as an ephemeral bearer credential passed to the HTTP path, never
persisted.
Trust model and honest mismatches
A Drive-backed File is not a true local object capability – it is a
bearer credential to a remote, server-authoritative ACL. The authority lives
on Google’s server, which re-checks it on every request against a mutable table.
Consequences the design must respect:
- No local revocation/attenuation guarantee. Dropping a local handle does
not revoke access Google still grants; narrowing a
DriveAccountwrapper does not change Google’s server-side scope. capOS can wrap Drive behind the adapter but cannot give a remote file the local revocation/attenuation semantics of a true cap. - Offline = non-functional. Unlike a local cap, a Drive-backed cap is dead without network.
- Global mutable namespace / instant org-wide revoke are Drive server-authoritative features with no clean local-cap equivalent; they stay behind the adapter.
- Quota is Drive’s per-user pool, not a per-cap budget; an app’s
appDataFolderusage counts against the human’s Drive quota.
Therefore Drive is exposed strictly as a backend adapter that serves the storage caps with documented remote semantics, never as a drop-in for local object caps. Apps that need local revocation/attenuation/offline guarantees should use a local backend; apps that want the user’s Drive accept the remote semantics.
Phasing
- Reconcile + capsule model (near-term, browser-transport): align the
existing “User-Owned Browser Transport” section of
storage-and-naming-proposal.mdwith theAppData/powerbox vocabulary; define the encrypted-capsule format and theappDataFoldercapsule lifecycle. No capOS network/TLS/OAuth dependency. - OAuth service + outbound HTTPS prerequisites (deep-future): land the gated chain (userspace net stack, TLS, HTTP, OAuth service) per their own proposals. This proposal only consumes them.
- Native
DriveAccount+ Drive storage service (deep-future): implement the service that mapsAppData/File/Directory/Storeonto Drive REST using the OAuth/TLS/HTTP caps; prove anappDataFolderround-trip and a powerbox-picked file read in a QEMU smoke against a Drive API stand-in. - Sharing bridge (future): map
shareAsto Drive permissions where the remote semantics allow, with the bearer/clawback caveats flagged.
Relationship to existing proposals
docs/proposals/standard-app-capabilities-proposal.md– defines theAppData/powerbox/shareAscaps this backend serves.docs/proposals/storage-and-naming-proposal.md– owns the storage caps, the “Managed Cloud Backing” and “User-Owned Browser Transport” sections (the near-term Drive path), and the backend-agnosticism this validates.docs/proposals/oidc-and-oauth2-proposal.md– the OAuth token capabilities this backend consumes; the refresh token lives there.docs/proposals/certificates-and-tls-proposal.md– the outbound TLS client.docs/proposals/networking-proposal.md– Phase C userspace net stack + the HTTP client; the shared authority gate.docs/research/{eros-capros-coyotos,plan9-inferno}.md– application-level persistence (vs transparent single-level store) and per-process namespaces (a Drive backend unioned into an app namespace alongside local storage).
Open questions
- Is the encrypted-capsule (browser-transport) model sufficient for the first user-facing Drive feature, deferring the native backend until the network stack is real?
- Where does the refresh token live – the OAuth service’s own sealed store, a
credential_storeextension, or a dedicatedDriveAccountobject? - Does the native backend target the Drive REST API directly, or go through a capOS-hosted proxy that holds the Google credentials (narrowing the on-device trust surface, at the cost of running a proxy)?
- How are Drive’s server-side semantics (revocation, quota, mutable ACL)
surfaced to apps so they are not surprised by a
Filecap that behaves unlike a local one?
Proposal: Error Handling for Capability Invocations
How capOS communicates errors from capability calls back to userspace processes.
Current design authority now lives in Error Handling. This proposal is retained as the archival decision record and original rationale for the implemented two-level model.
This proposal defines a two-level error model: transport errors (the invocation mechanism itself failed) and application errors (the capability processed the request and returned a structured error). The design aligns with Cap’n Proto’s own exception model and the patterns used by seL4, Zircon, and other capability systems.
Status note: The shared-memory capability ring +
cap_enterhas replacedcap_callas the invocation surface, and the two-level error model described below is implemented for the current ring, runtime, and endpoint IPC surface. Transport errors arrive as negativeCapCqe.resultcodes (see “Current CQE Error Namespace”); application errors arrive as a serializedCapExceptionwithCAP_ERR_APPLICATION_EXCEPTION. TheCapExceptionschema andExceptionTypetaxonomy live inschema/capos.capnp(enum ExceptionTypeandstruct CapExceptionnear the bottom of the schema), the kernel side serializes them throughkernel/src/cap/ring.rs(including theINVALID_ARGUMENT_SENTINELchannel for the capOS-onlyinvalidArgumentvariant), andcapos-rt/src/client.rsdecodes them intoClientError::Application(ApplicationException).Related documents:
docs/architecture/error-handling.mdis the current design authority for the implemented error layers.docs/architecture/capability-ring.mdowns the current ring transport contract that carries the CQE status values.docs/proposals/service-architecture-proposal.mdcaptures the cross-process spawn and revoked-endpoint surface that exercisesDisconnectedand the endpoint RETURN exception flag end-to-end.docs/design-risks-register.mdrecords the open contracts that flow into this proposal: R6 (deferredCAP_OP_RELEASE) and R15 (application-exception serialization depends on result-buffer capacity).docs/capability-model.mddescribes the broader capability model the error layers sit inside; this proposal owns only the error model.The “Problem Statement”, “Syscall Return Convention”, “Kernel Implementation”, “Userspace API”, and “Migration Path” sections below describe the original
cap_call-era design that motivated the model. They are kept as historical context; the “Current CQE Error Namespace”, “CapException Schema”, and “Application-Level Errors in Interface Schemas” sections describe current behavior.
Current CQE Error Namespace
The capability ring uses signed 32-bit CapCqe.result values. Non-negative
values are opcode-specific success results; negative values are kernel transport
errors defined in capos-config/src/ring.rs:
| Code | Name | Meaning |
|---|---|---|
-1 | CAP_ERR_INVALID_REQUEST | Malformed request metadata or an opcode value not reserved in the ABI. |
-2 | CAP_ERR_INVALID_PARAMS_BUFFER | SQE parameter buffer is unmapped, out of range, or not readable. |
-3 | CAP_ERR_INVALID_RESULT_BUFFER | SQE result buffer is unmapped, out of range, or not writable. |
-4 | CAP_ERR_INVOKE_FAILED | Capability lookup or invocation failed before a successful result was produced. |
-5 | CAP_ERR_UNSUPPORTED_OPCODE | Opcode is reserved in the ABI but not yet dispatched. Currently returned for CAP_OP_FINISH; CAP_OP_RELEASE has kernel dispatch and reports stale/non-owned caps as request/invoke failures. |
-6 | CAP_ERR_TRANSFER_NOT_SUPPORTED | Transfer mode or sideband descriptor layout is recognized as unsupported by this kernel. |
-7 | CAP_ERR_INVALID_TRANSFER_DESCRIPTOR | xfer_cap_count descriptor layout malformed or contains reserved bits. |
-8 | CAP_ERR_TRANSFER_ABORTED | Transaction-in-progress transfer failed and must not produce partial capability state. |
-9 | CAP_ERR_APPLICATION_EXCEPTION | A structured CapException was serialized into the caller-provided result buffer. |
-10 | CAP_ERR_APPLICATION_EXCEPTION_TRUNCATED | An application exception occurred, but no detail fit in the available result buffer. |
This is deliberately a small transport namespace. Interface-specific failures should be encoded in the result payload once the target capability successfully handles the request.
Revoked capabilities use the same application-exception path when the caller
provided a result buffer. Ordinary capability CALLs and endpoint CALL/RECV on a
revoked cap serialize a Disconnected CapException and complete with
CAP_ERR_APPLICATION_EXCEPTION. Runtime clients decode that CQE into
ClientError::Application(ApplicationException { type: Disconnected, ... }).
Endpoint RETURN is asymmetric because the result belongs to the original caller,
not the returning receiver. A receiver can set
CAP_SQE_RETURN_APPLICATION_EXCEPTION on CAP_OP_RETURN to return a
serialized CapException to the original caller; the receiver’s own RETURN CQE
still reports only whether the RETURN transport succeeded. If a receiver tries
to RETURN through a revoked endpoint while an in-flight caller still has a
result buffer, the kernel first preflights completion-queue space for both
caller and receiver, then removes the in-flight call, serializes a
Disconnected exception into the caller’s buffer, and posts the caller
completion with CAP_ERR_APPLICATION_EXCEPTION. The receiver always gets
CAP_ERR_APPLICATION_EXCEPTION_TRUNCATED because revoked RETURN has no
receiver-owned result payload. If the caller did not provide a result buffer,
the caller also receives the truncated code. Lookup or CQ-space failures that
cannot be tied to a result buffer remain transport failures.
Revoking an endpoint cap through a child CapabilityManager also cancels
endpoint wait state on that object: owner endpoint revoke cancels all queued
calls, pending receives, and in-flight calls, while non-owner endpoint facet
revoke cancels entries tied to the managed child pid. Those cancellation
completions use the existing endpoint-cancel transport result because they
describe already-pending SQEs, not a fresh invocation with a result buffer.
Current Implementation Inventory
Implemented typed exception paths:
- Ordinary
CAP_OP_CALLcapability implementations that returncapnp::Errorare serialized asCapExceptionpayloads when the SQE has a writable result buffer.capnp::ErrorKind::{Failed, Overloaded, Disconnected, Unimplemented}map to the matchingExceptionType; all other Cap’n Proto decode/validation kinds map toFailed. - Ordinary revoked-cap calls serialize
Disconnectedwhen a result buffer is present. - Endpoint CALL and RECV on a revoked endpoint serialize
Disconnectedwhen a result buffer is present. - Live endpoint CALL target errors that arise after a valid endpoint cap is
identified serialize as
CapExceptionwhen the caller supplies a result buffer. Endpoint queue-capacity, parameter-slot, call-id, and in-flight capacity failures are reported asOverloaded. - Endpoint RETURN through a revoked endpoint reports
Disconnectedto the original caller when that caller has a result buffer, and reports the receiver-side no-payload/truncated application-exception code. - Endpoint RETURN with
CAP_SQE_RETURN_APPLICATION_EXCEPTIONcopies the receiver-provided serializedCapExceptionto the original caller and postsCAP_ERR_APPLICATION_EXCEPTION; if no payload fits, the original caller getsCAP_ERR_APPLICATION_EXCEPTION_TRUNCATED. capos-rtdecodesCAP_ERR_APPLICATION_EXCEPTIONintoClientError::Application(ApplicationException)and treatsDisconnectedas breaking the local capability handle. Truncated application exceptions decode asFailedwith an empty diagnostic message. Endpoint servers can usecapos-rt’ssubmit_endpoint_return_exception()helper to produce that RETURN shape.
Intentional generic transport paths:
- Capability lookup failures before a target object is identified still return
CAP_ERR_INVOKE_FAILED; these remain transport errors. - Malformed SQE metadata, bad params/result buffers, unsupported opcodes, and malformed transfer descriptors remain transport errors.
- Endpoint delivery/receive/return rollback failures that arise while restoring
queues, committing sideband transfers, posting to completion queues, or
writing endpoint payloads still use
CAP_ERR_INVOKE_FAILED,CAP_ERR_TRANSFER_ABORTED, orCAP_ERR_INVALID_RESULT_BUFFER. Result-buffer validation and endpoint payload copy failures are transport errors because no safe payload destination exists. - Existing QEMU coverage proves
Disconnectedfor revocation and one ordinary localUnimplementedruntime path. Theendpoint-roundtripQEMU demo proves local live-endpointOverloadedserialization for endpoint queue saturation. Cross-processDisconnectedis covered for revoked endpoint use, andmake run-spawnnow proves cross-process endpoint RETURN propagation forFailed,Overloaded, andUnimplementedapplication exceptions. The same focused spawn proof runsring-reserved-opcodes, which checks that the RETURN exception flag is rejected outside its valid shape and that an endpoint caller with no result buffer receivesCAP_ERR_APPLICATION_EXCEPTION_TRUNCATED.
Target Contract
For this milestone, a kernel path should produce a typed CapException when
all of the following are true:
- A capability invocation target was identified, or an endpoint operation is acting on an already accepted call/receive relationship.
- The failure is attributable to invocation semantics rather than malformed ring transport metadata.
- The affected caller supplied a result buffer that can hold a serialized exception.
If the same invocation-level failure occurs with no result buffer or an
insufficient result buffer, the CQE result is
CAP_ERR_APPLICATION_EXCEPTION_TRUNCATED. If no target capability or accepted
IPC relationship exists, the failure stays in the transport namespace. Result
buffer validation failure also stays transport-level because no safe payload
destination exists.
The exception serialization path respects two per-process resource-profile
limits wired from the manifest ResourceProfile fields (both defaulting to
65 536 bytes, the kernel ceiling):
ringScratchLimitBytes– bounds the ring input and output scratch buffers. Any CALL withparams_lenexceeding the effective input limit is rejected withCAP_ERR_INVALID_REQUESTat the transport layer before capability dispatch.replyScratchLimitBytes– bounds the reply scratch used byserialize_application_exception_to_userandserialize_disconnected_exception_to_process. The effective reply limit ismin(replyScratchLimitBytes, ringScratchLimitBytes); if the serialized exception exceeds this limit, the caller receivesCAP_ERR_APPLICATION_EXCEPTION_TRUNCATEDinstead. Prior to this wiring, reply scratch was unconstrained at the global 64 KiB ceiling regardless of the process’sringScratchLimitBytes, which caused spurious TRUNCATED results for tightly constrained processes. Both limits are enforced as of commit 4fc0466d (replyScratchLimitBytes) and commit 1bcfbad4 (ringScratchLimitBytes).
The exception types keep their Cap’n Proto client-response meaning;
InvalidArgument is the capOS-only addition introduced with Scheduler
Phase D Task 1 (commit cb8c58b1, 2026-05-07). The canonical worked example
is SchedulingPolicyCap.setWeight in schema/capos.capnp, whose schema
comment states the cap rejects out-of-range or zero values with a
CapException of type invalidArgument and does NOT silently clamp:
Failed: deterministic invocation failure, deserialization error, or a target-side invariant failure. New caps that validate parameters at the cap boundary should returnInvalidArgumentinstead ofFailedfor caller bugs;Failedis for “the cap tried and could not”.Overloaded: temporary resource exhaustion after a valid target invocation has begun.Disconnected: target object, endpoint facet, or peer relationship is gone.Unimplemented: target object is live but does not implement the requested method.InvalidArgument: the cap accepted the call (target lives, message parsed) but a parameter value violates the documented contract. Distinct fromFailedbecause the caller is expected to correct its input and retry, not back off or treat the cap as broken. Carried on the wire today throughINVALID_ARGUMENT_SENTINELinkernel/src/cap/ring.rs; userspace decode incapos-rt::client::ApplicationExceptionreturnsExceptionType::InvalidArgument.
Exception messages are diagnostic only. They must not include kernel pointers, secret payload bytes, or other process-private data.
Schema Style Guide
Use the three error layers consistently:
| Layer | Use for | Do not use for |
|---|---|---|
| CQE status | Ring, transport, kernel dispatch, malformed SQE, missing target, invalid buffer, unsupported ABI/version, and other failures where no safe capability-level payload exists. | Normal service/domain outcomes. |
CapException | Capability-level infrastructure failure after a target or accepted endpoint relationship exists: decode failure, unknown method, target gone, temporary overload after dispatch, or target invariant failure. | Expected application/domain rejection. |
| Schema result union | Ordinary application or domain outcome: not found, permission denied by service policy, invalid business object, quota denied as a declared operation result, or accepted conditional failure. | Ring/transport failure or generic catch-all exceptions. |
Generated clients and future capos-service helpers should preserve this
split: CQE status is transport failure, decoded CapException is capability
infrastructure failure, and method result unions are the normal application
error surface.
Use CQE status for ring transport errors, invalid SQE layout, invalid cap slot, kernel dispatch failure, buffer access failure, unsupported ring ABI/SQE version, malformed transfer descriptors, and other transport-level failures where no safe typed payload boundary exists.
Use CapException for capability infrastructure failure: unknown method,
revoked capability, stale endpoint/session, permission or authority failure,
resource exhaustion at a capability boundary, service unavailable, and
unimplemented method.
Use schema result unions for normal domain/application outcomes:
notFound, permissionDenied as a domain decision, invalidInput with domain
meaning, alreadyExists, conflict, validation failure, and accepted/rejected
business results.
Anti-rules:
- Do not encode ordinary application outcomes as
CapException. - Do not expose internal traces, filesystem paths, kernel pointers, or service-local details in cross-service exceptions by default.
- Do not use generic
Texterrors where a stable union variant is possible. - Do not overload
CapException::failedfor every domain-level failure.
Preferred schema shape for ordinary domain outcomes:
struct OpenResult {
union {
file @0 :File;
notFound @1 :Void;
permissionDenied @2 :Void;
invalidPath @3 :Void;
unsupported @4 :Void;
}
}
Transfer-related transport mapping (3.6.0 ABI slice)
CAP_ERR_TRANSFER_NOT_SUPPORTEDis used for transfer-bearing SQEs that the kernel currently dispatches but does not yet process (xfer_cap_count != 0on kernels where sideband transfer is off).CAP_ERR_INVALID_TRANSFER_DESCRIPTORis used for structurally validly dispatched transfer SQEs where transfer metadata is malformed:- descriptor
transfer_modeis not exactlyCAP_TRANSFER_MODE_COPYorCAP_TRANSFER_MODE_MOVE; - any descriptor reserved bits are set;
- any descriptor
_reserved0field is non-zero; - descriptor region placement (
addr + len) is misaligned; - descriptor range overflows or cannot be safely bounded.
- descriptor
CAP_ERR_TRANSFER_ABORTEDis reserved for transaction failure after partial transfer side effects are prepared and must not be observed (all-or-nothing rollback boundary).CAP_ERR_INVALID_REQUESTremains for non-transfer transport malformation (unsupported opcodes for today, unsupported SQE fields not part of the transfer path, and malformed result/payload buffer pairs).
Historical: Pre-Ring cap_call Design
The sections from “Problem Statement” through “Migration Path” describe the
original cap_call synchronous syscall that preceded the capability ring.
They are preserved for design context; see the “Current CQE Error Namespace”
and “CapException Schema” sections above for current behavior.
Problem Statement
Currently, cap_call returns u64::MAX on any error and prints the details
to the kernel serial console. The userspace process receives no information
about what went wrong – it cannot distinguish “invalid capability ID” from
“method not implemented” from “out of memory inside the service.”
Every other capability system separates transport-level errors (bad handle, message validation failure) from application-level errors (the service processed the request and returned a meaningful error). capOS needs both.
Background: How Other Systems Do This
Cap’n Proto RPC Protocol
The Cap’n Proto RPC specification defines an Exception type in rpc.capnp:
struct Exception {
reason @0 :Text;
type @3 :Type;
enum Type {
failed @0; # deterministic failure, retrying won't help
overloaded @1; # temporary resource exhaustion, retry with backoff
disconnected @2; # connection to a required capability was lost
unimplemented @3; # method not supported by this server
}
trace @4 :Text;
}
These four types describe client response strategy, not error semantics.
The capnp Rust crate maps them to capnp::ErrorKind::{Failed, Overloaded, Disconnected, Unimplemented}.
Cap’n Proto’s official philosophy (from KJ library and Kenton Varda’s writings): exceptions are for infrastructure failures, not application semantics. Application-level errors should be modeled as unions in method return types.
Cloudflare Workers RPC and Spritely/OCapN CapTP reinforce the network-boundary
rule: remote promise breakage and error values are diagnostic material, not
authority inputs, and debug details such as traces or internal paths can leak
sensitive information. Future Workers RPC, Cap’n Web, CapTP, or OCapN-style
adapters must deliberately map remote errors into CapException or schema
result unions and strip or seal debug detail at the boundary. See
Cloudflare, Cap’n Proto, Workers RPC, and Cap’n Web
and
Spritely, OCapN, and CapTP.
Capability OS Error Models
| System | Transport errors | Application errors |
|---|---|---|
| seL4 | seL4_Error enum (11 values) from syscall return | In-band via IPC message payload (user-defined) |
| Zircon | zx_status_t (signed i32, ~30 values) from syscall | FIDL per-method error type (union in return) |
| EROS/Coyotos | Kernel-generated invocation exceptions | OPR0.ex flag + exception code in reply payload |
| Plan 9 (9P) | Connection loss (no in-band transport error) | Rerror message with UTF-8 error string |
| Genode | Ipc_error exception | Declared C++ exceptions via GENODE_RPC_THROW |
Common pattern: a small kernel error code set for transport failures, combined with service-specific typed errors for application failures.
POSIX errno: Why Not
POSIX errno is a global flat namespace of ~100 integers that conflates
transport errors (EBADF) with application errors (ENOENT). In a
capability system:
EACCES/EPERMdon’t apply – if you have the capability, you have permission; if you don’t, you can’t even name the resource.- A global error namespace conflicts with typed interfaces where errors should be scoped to the interface.
- No room for structured information (which argument was invalid, how much memory was needed).
- Not composable across trust boundaries – a callee’s errno has no meaning in the caller’s address space without explicit serialization.
Design
Principle: Two Levels, One Wire Format
Level 1 – Transport errors are returned in the syscall return value.
These indicate that the capability invocation mechanism itself failed before
the target CapObject was reached. No result buffer is written.
Level 2 – Application errors are returned as capnp-serialized messages in the result buffer. The capability was found and dispatched; the implementation returned a structured error. The syscall return value distinguishes this from a successful result.
Both levels use Cap’n Proto serialization for the error payload (level 2 always, level 1 when there’s a result buffer available). This keeps one parsing path in userspace.
Syscall Return Convention
The cap_call syscall (number=2) currently returns:
0..N– success, N bytes written to result bufferu64::MAX– error (undifferentiated)
New convention:
| Return value | Meaning |
|---|---|
0..=(u64::MAX - 256) | Success. Value = number of bytes written to result buffer. |
u64::MAX | Transport error: invalid capability ID or stale generation. |
u64::MAX - 1 | Transport error: invalid user buffer (bad pointer, unmapped, not writable). |
u64::MAX - 2 | Transport error: params too large (exceeds MAX_CAP_CALL_PARAMS). |
u64::MAX - 3 | Application error: the capability returned an error. A CapException message has been written to the result buffer. The message length is encoded in the low 32 bits of the value at result_ptr (the capnp message itself). |
u64::MAX - 4 | Application error, but the result buffer was too small or NULL. The error detail is lost; the caller should retry with a larger buffer or treat it as an opaque failure. |
The transport error codes are a small closed set (like seL4’s 11 values). New transport errors can be added, but the set should remain small and stable.
CapException Schema
Added to schema/capos.capnp:
enum ExceptionType {
failed @0;
overloaded @1;
disconnected @2;
unimplemented @3;
invalidArgument @4;
}
struct CapException {
type @0 :ExceptionType;
message @1 :Text;
}
This mirrors Cap’n Proto RPC’s Exception struct, plus a capOS-only
invalidArgument variant added with the Scheduler Phase D Task 1
schema slice (commit cb8c58b1, 2026-05-07). Capnp’s upstream Exception.Type
remains a closed four-value set; capOS extends CapException because a
capability boundary that validates arguments needs a typed signal
distinct from failed. The five types describe client response
strategy:
- failed – deterministic failure on the callee side, retrying
won’t help. Covers invariant violations, deserialization errors, and
any
capnp::ErrorKindvariant not in the other categories. As of the Phase D Task 1 slice, callee-side argument rejection no longer maps here – new caps that validate inputs at the cap boundary should returninvalidArgumentinstead. - overloaded – temporary resource exhaustion (out of frames, table full). Client may retry with backoff.
- invalidArgument – the request was syntactically a well-formed
capnp message but a parameter value violated the cap’s documented
contract (e.g.
SchedulingPolicyCap.setWeightrejectingweight = 0or values outside[MIN_WEIGHT, MAX_WEIGHT]). The kernel does not silently clamp; the caller is expected to fix its input and retry, not back off. Today this is signalled by kernel cap modules through a small sentinel-prefix channel inkernel/src/cap/ring.rs(INVALID_ARGUMENT_SENTINEL) because capnp 0.25 has noErrorKind::InvalidArgumentand the enum is#[non_exhaustive]. The dispatcher strips the sentinel before serializing theCapExceptionso the wire form is identical to the four upstream-aligned variants. - disconnected – the capability’s backing resource is gone (device removed, process exited). Client should re-acquire the capability.
- unimplemented – unknown method ID for this interface. Client should not retry.
The message field is a human-readable string for diagnostics/logging.
It must not contain security-sensitive information (internal pointers, kernel
addresses) since it crosses the kernel-user boundary.
Application-Level Errors in Interface Schemas
Following Cap’n Proto’s philosophy, expected error conditions that a caller should handle programmatically belong in the method return type, not in the exception mechanism.
Example – FrameAllocator can legitimately run out of memory:
struct AllocResult {
union {
ok @0 :UInt16; # result-cap handle index for a MemoryObject
outOfMemory @1 :Void;
}
}
interface FrameAllocator {
allocFrame @0 () -> (result :AllocResult);
allocContiguous @1 (count :UInt32) -> (result :AllocResult);
}
The caller can pattern-match on the result union without parsing an exception. This is the Zircon/FIDL model: transport errors at the syscall layer, application errors as typed return values.
When to use each:
| Situation | Mechanism |
|---|---|
| Bad cap ID, stale generation, bad buffer | Transport error (syscall return code) |
| Deserialization failure, unknown method | CapException with failed/unimplemented |
| Temporary resource exhaustion in dispatch | CapException with overloaded |
| Expected domain-specific error | Union in method return type |
| Bug in capability implementation | CapException with failed |
Kernel Implementation
CapObject trait change
The ring SQE does not carry a caller-supplied interface ID. The trait shape below keeps interface selection out of capability implementations because each capability entry owns one public interface:
#![allow(unused)]
fn main() {
pub trait CapObject: Send + Sync {
fn interface_id(&self) -> u64;
fn label(&self) -> &str;
fn call(
&self,
method_id: u16,
params: &[u8],
result: &mut [u8],
reply_scratch: &mut dyn ReplyScratch,
) -> capnp::Result<CapInvokeResult>;
}
}
Implementations serialize directly into the caller’s result buffer and return
a completion containing the number of bytes written, or Pending for async
endpoint calls. Dispatch uses the interface assigned to the target capability
entry; normal CALL SQEs do not need to repeat that interface ID. capnp::Error
carries ErrorKind with the four RPC exception types. The kernel’s dispatch
handler converts Err(capnp::Error) into a serialized CapException message
and writes it to the result buffer.
Syscall handler changes
In cap_call(), the error path changes from:
#![allow(unused)]
fn main() {
Err(e) => {
kprintln!("cap_call: ... error: {}", e);
u64::MAX
}
}
to:
#![allow(unused)]
fn main() {
Err(CapError::NotFound) => ECAP_NOT_FOUND,
Err(CapError::StaleGeneration) => ECAP_NOT_FOUND,
Err(CapError::InvokeError(e)) => {
// Serialize CapException to result buffer
let exception_bytes = serialize_cap_exception(&e);
if result_ptr != 0 && result_capacity >= exception_bytes.len() {
copy_to_user(result_ptr, &exception_bytes);
ECAP_APPLICATION_ERROR
} else {
ECAP_APPLICATION_ERROR_NO_BUFFER
}
}
}
The serialize_cap_exception function maps capnp::ErrorKind to
ExceptionType:
capnp::ErrorKind | ExceptionType |
|---|---|
Failed | failed |
Overloaded | overloaded |
Disconnected | disconnected |
Unimplemented | unimplemented |
| All other variants (deserialization, validation) | failed |
This matches how capnp-rpc maps exceptions to the wire format.
Userspace API
The init crate (and future userspace libraries) wraps cap_call in a
helper that interprets the return value:
#![allow(unused)]
fn main() {
pub enum CapCallResult {
Ok(Vec<u8>),
Exception(ExceptionType, String),
TransportError(TransportError),
}
pub enum TransportError {
InvalidCapability,
InvalidBuffer,
ParamsTooLarge,
}
pub fn cap_call(
cap_id: u32,
method_id: u16,
params: &[u8],
result_buf: &mut [u8],
) -> CapCallResult {
let ret = sys_cap_call(cap_id, method_id, params, result_buf);
match ret {
ECAP_NOT_FOUND => CapCallResult::TransportError(TransportError::InvalidCapability),
ECAP_BAD_BUFFER => CapCallResult::TransportError(TransportError::InvalidBuffer),
ECAP_PARAMS_TOO_LARGE => CapCallResult::TransportError(TransportError::ParamsTooLarge),
ECAP_APPLICATION_ERROR => {
let (typ, msg) = deserialize_cap_exception(result_buf);
CapCallResult::Exception(typ, msg)
}
ECAP_APPLICATION_ERROR_NO_BUFFER => {
CapCallResult::Exception(ExceptionType::Failed, String::new())
}
n => CapCallResult::Ok(result_buf[..n as usize].to_vec()),
}
}
}
Future: Batched Calls
When capOS adds batched capability invocations (async rings, pipelining), each request in the batch gets its own result status. The same two-level model applies per-request:
- Transport error for the batch envelope (invalid ring descriptor, bad capability table) fails the whole batch.
- Per-request transport errors (individual bad cap_id) fail that request.
- Application errors are per-request, written to each request’s result slot.
This matches how NFS compound operations and JSON-RPC batch requests work: a transport error on the batch vs per-operation results.
What This Does NOT Cover
- Error logging/tracing infrastructure. How errors get collected,
aggregated, or displayed is a separate concern, owned by
docs/proposals/system-monitoring-proposal.md. The kernel currently prints to serial; a futureErrorLog/ audit-log capability captures structured error streams there. - Retry policy. The
ExceptionTypehints at retry strategy (overloaded -> retry, failed -> don’t, invalidArgument -> fix input and retry), but the retry logic itself belongs in userspace libraries, not the kernel. - Error propagation across capability chains. When capability A calls
capability B which calls capability C, and C fails – how does the error
propagate back through A? The single-hop transport-vs-application split is
defined here; the cross-process spawn and endpoint-return surface that
exercises it end-to-end is owned by
docs/proposals/service-architecture-proposal.mdtogether with theCAP_SQE_RETURN_APPLICATION_EXCEPTIONshape incapos-config/src/ring.rs. - Result-buffer sizing. Truncation of serialized
CapExceptionpayloads when callers under-size their result buffer is tracked as R15 indocs/design-risks-register.md. The per-processringScratchLimitBytesandreplyScratchLimitBytesresource-profile fields now bound the reply scratch used at both serialization call sites, eliminating spurious TRUNCATED results for constrained processes. Each cap contract should still document its expected result-buffer capacity rather than relying on truncation behavior. - Deferred release vs revocation. Owned-handle Drop in
capos-rtenqueuesCAP_OP_RELEASErather than running synchronously; resource- pressure or revocation-sensitive flows that depend on aDisconnectedsurface must follow R6 indocs/design-risks-register.mdand preferCapabilityManager.revokeor epoch revocation rather than relying on Drop ordering. - Transactional semantics. Whether a failed operation has side effects
(partial writes, allocated-but-not-returned frames) is per-capability
semantics, not a kernel-level concern. The transfer-rollback boundary
carried by
CAP_ERR_TRANSFER_ABORTEDis the only transport-level all-or-nothing guarantee.
Migration Path
Phase 1: Transport error codes (minimal, no schema changes)
Change cap_call to return distinct error codes instead of u64::MAX for
all failures. Update the init crate to interpret them. No new schema types
needed – application errors still use u64::MAX - 3 but without a structured
payload (treated as opaque failure).
This is backward-compatible: existing userspace code that checks == u64::MAX
sees different values for different errors, but any >= u64::MAX - 255 check
catches all errors.
Phase 2: CapException serialization
Add ExceptionType and CapException to the schema. Implement
serialize_cap_exception in the kernel. Update init to deserialize and
display errors. Now userspace gets the exception type and message string.
Phase 3: Per-interface application errors
As interfaces mature, add typed error unions to method return types for
expected error conditions. FrameAllocator::allocFrame returns
AllocResult instead of bare UInt64. The exception mechanism remains for
unexpected failures.
Design Rationale
Why mirror capnp RPC’s Exception type instead of inventing our own?
Cap’n Proto already defines a well-thought-out exception taxonomy. The four
types (failed, overloaded, disconnected, unimplemented) map directly to
capnp::ErrorKind in Rust. Using the same vocabulary means capOS capabilities
can eventually participate in capnp RPC networks without translation. It also
means the Rust compiler enforces exhaustive matching on ErrorKind variants
that matter.
Why not put error codes in the syscall return value only (like seL4)?
seL4’s 11 error codes work because seL4 kernel objects are simple and
fixed-function. capOS capabilities are arbitrary typed interfaces – a file
system, a network stack, a GPU driver. The error vocabulary is open-ended.
Encoding all possible errors as syscall return values would either require an
ever-growing enum (fragile) or lose information (back to errno’s problems).
The capnp-serialized CapException in the result buffer gives unbounded
expressiveness without changing the syscall ABI.
Why not use capnp exceptions for everything (skip the transport error codes)?
Because transport errors happen before the capability is reached. There’s
no CapObject to serialize an exception. The kernel would have to synthesize
a capnp message on behalf of a non-existent capability, which is wasteful and
semantically wrong. A small integer return code is cheaper and more honest
about what happened.
Why not define a generic Result(Ok) wrapper in the schema?
Cap’n Proto generics only bind to pointer types (Text, Data, structs, lists,
interfaces), not to primitives (UInt32, Bool). A Result(UInt64) for
allocFrame wouldn’t work. Per-method result structs with unions are more
flexible and don’t hit this limitation. The cost is a bit more schema
boilerplate, which is acceptable given that capOS has a small number of
interfaces.
Why string-based messages (like Plan 9) instead of structured error fields?
String messages are adequate for diagnostics and logging. Structured error
data belongs in the typed return unions (Phase 3), where the schema enforces
what fields exist. Putting structured data in CapException would duplicate
the schema’s job and encourage using exceptions for flow control, which
Cap’n Proto explicitly warns against.
Security Review and Formal Verification Proposal
How to reason about the correctness and security of the capOS kernel and its
trust boundaries in a way that fits a research OS – pragmatic tooling now,
targeted verification where it pays off, no aspirational seL4-style full-
kernel proofs. The docs/research/sel4.md survey already concluded that
Isabelle/HOL-over-C verification does not transfer to Rust and that the
design constraints matter more than the proof artefact. This proposal
codifies that conclusion into a concrete tooling and process plan.
This proposal uses CWE for concrete vulnerability classes, CAPEC for attacker patterns, Rust language rules / unsafe-code guidance for low-level coding rules, Common Criteria protection-profile concepts for OS security functions, ITU-T X.800/X.805 security-services taxonomy as a completeness checklist, and capability-kernel practice (seL4/EROS-style invariants) for authority, IPC, object lifetime, and scheduler properties. Web-application checklists are not the baseline for OS design review.
Grounding sources:
- MITRE CWE for root-cause weakness labels: CWE-20 explicitly covers raw data, metadata, sizes, indexes, offsets, syntax, type, consistency, and domain rules; CWE also marks broad classes such as CWE-20 and CWE-400 as discouraged for final vulnerability mapping when a more precise child fits.
- MITRE CAPEC for attacker behavior, especially input manipulation (CAPEC-153), command injection (CAPEC-248), race exploitation (CAPEC-26 / CAPEC-29), and flooding/resource pressure (CAPEC-125).
- Rust Reference
and
Rust 2024 Edition Guide
for unsafe-block and
unsafe_op_in_unsafe_fnobligations. - seL4 MCS and the existing capOS research notes for capability-authorized access to kernel objects and CPU time.
- Common Criteria General Purpose Operating System Protection Profile for OS access-control, security-function, trusted-channel/path, and user-data protection concepts. capOS is not trying to certify against it; the PP is a vocabulary check for what an OS security review should not omit.
- ITU-T Rec. X.800 (03/91) Security architecture for OSI and X.805 (10/03) Security architecture for systems providing end-to-end communications for the layered security-services taxonomy: authentication, access control, non-repudiation, data confidentiality, data integrity, availability, privacy × infrastructure/services/ applications planes × end-user/control/management planes. Used as a completeness matrix: if a proposal claims to cover security but leaves one cell unaddressed (e.g. “we have confidentiality but no non-repudiation story for the management plane”), review should flag the gap. Also ITU-T X.810-X.816 for the individual framework breakdowns — authentication (X.811), access control (X.812), non-repudiation (X.813), confidentiality (X.814), integrity (X.815), audit and alarms (X.816).
1. Philosophy and Scope
capOS is explicitly a research OS whose design principle is “schema-first typed capabilities, minimal kernel, reuse the Rust ecosystem.” Three consequences shape this proposal:
- The schema is part of the TCB. A bug in the
.capnpschema, or in the way generated code is patched forno_std, is exactly as dangerous as a bug in the kernel. The schema, thecapnpcbuild pipeline, and the generated code all need review attention – not only hand-written kernel code. - The kernel should stay small. “Everything else is a capability” means the TCB is naturally bounded. Verification effort scales with TCB size, so resisting kernel bloat is itself a security property.
- The interface is the permission. Access control lives in capnp method
definitions and in userspace cap wrappers (a narrow cap is a different
CapObject), not in kernel rights bitmasks. Review must confirm that the kernel never short-circuits this: no ambient authority, no method that bypassesCapObject::call, no syscall that exposes an object without a capability handle.
Non-goals:
- Full functional-correctness proof of the kernel à la seL4. Infeasible in Rust today, and the payoff is low for a research system whose surface area is still changing.
- Proving information-flow / confidentiality properties end-to-end.
- Certifying a specific configuration for external deployment.
2. Trust Boundaries and Threat Model
Enumerating the boundaries forces every future review to ask “which boundary does this change touch?” and picks out the code paths that matter.
TCB Statement
Current demo/proof TCB is broader than the target production TCB. Security claims must name which one they rely on.
Current demo/proof TCB:
- kernel, including scheduler, memory management, capability dispatch, endpoint IPC, in-kernel networking, smoltcp runtime, line discipline, Telnet IAC filtering, PCI/virtio-net smoke code, and kernel-owned DMA buffers;
capos-config, schema/codegen output, manifest validator, and checked-in generated bindings;capos-rtruntime transport, userspace entry/panic/allocator glue, and typed handle release behavior;- standalone
init,AuthorityBroker,SessionManager,CredentialStore, shell launcher, restricted launcher, and demo services used by the active manifest; - focused QEMU manifests, host harnesses, and build tools used to construct and validate each proof image;
- QEMU virtio devices and host-local loopback forwarding for networking proofs.
Target production TCB:
- kernel primitives that enforce address-space isolation, capability tables, generation/epoch checks, ring transport validation, scheduler/thread safety, interrupt/timer correctness, and explicit DMA/IOMMU policy;
- schema definitions, generated-code owner, shared ABI constants, and the build/signature path for production boot images;
- minimal init/supervisor authority needed to assemble the service graph, grant narrowed caps, restart services, and expose scoped status/audit;
- credential, session, broker, key-vault, audit, and remote-ingress services that directly decide authentication, authorization, disclosure, and key use;
- production device managers, network stack, and storage services only to the extent they hold the corresponding device, network, or persistence authority.
Target non-TCB components should include ordinary applications, untrusted service binaries, domain libraries without privileged caps, shell children, and network peers. The target is not reached while default networking runs in the kernel TCB, the focused Telnet terminal-hosting fixture still relies on kernel TCP terminal handoff, SSH uses fixture/dev key material, or remote shells share pre-auth and post-auth process authority.
Current boundaries
| Boundary | Who trusts whom | Code that enforces it |
|---|---|---|
| Ring 0 ↔ Ring 3 | kernel trusts nothing from user | kernel/src/mem/paging.rs, kernel/src/mem/validate.rs, arch/x86_64/syscall.rs; exercised by init/ and demos/* |
| Kernel ↔ user pointer | kernel validates address + PTE perms under the process VM lock | AddressSpace::validate_user_buffer, copy_from_user, copy_to_user, and legacy validate_user_buffer for current-CR3 diagnostics |
| Manifest ↔ kernel | kernel parses capnp manifest at boot | capos-config::manifest, called from kmain |
| Build inputs ↔ TCB | kernel trusts schema/codegen/build artifacts | schema/capos.capnp, build.rs, Cargo.lock, Makefile |
| Host tools ↔ filesystem/process | tools must not let manifest/config input escape intended host boundaries | tools/mkmanifest, generators, CI scripts |
| ELF bytes ↔ kernel | kernel parses user ELF to map segments | capos-lib::elf |
| User ring ↔ kernel dispatch | kernel trusts no SQ state | kernel/src/cap/ring.rs |
CapObject::call wire format | kernel trusts no params bytes | generated capnp decoders + impls |
| Process ↔ process IPC | kernel routes calls between mutually isolated address spaces and trusts neither side’s buffers | kernel/src/cap/endpoint.rs, kernel/src/cap/ring.rs, kernel/src/sched.rs |
| Device DMA ↔ physical memory | kernel and device-manager trust no userspace driver-supplied device address, stale DMA handle, or stale interrupt route | kernel/src/dma_backend.rs, kernel/src/device_dma.rs, kernel/src/device_manager/, and the DDF cap objects select a DMA backend at boot, expose manager-owned bounce-buffer handles when no trusted remapping domain exists, hide host physical addresses/IOVAs from userspace providers, and bind DeviceMmio/DMAPool/Interrupt lifecycle to generation-checked ownership ledgers. The QEMU Intel path has bounded per-device remapping evidence; current no-IOMMU cloud/GCE paths are brokered bounce-buffer authority and still do not claim hostile bus-master isolation. |
| WASI host adapter sandbox | userspace wasm-host runs untrusted Preview 1 payloads inside the vendored wasmi interpreter; capOS trusts no wasm import beyond the explicit grant set on HostState | capos-wasm/src/wasi/preview1.rs translates wasm calls into typed Console/Timer/BootPackage/EntropySource invocations; per-instance argv text grants and random_get against the kernel EntropySource cap honor manifest-declared scope. Ungranted Preview 1 calls return ERRNO_NOSYS rather than fabricating authority. The boundary surface today covers W.1-W.4 (substrate, stdout-only stubs, argv grant, random_get production wiring); wasi_args and entropy fills are bounded by WASI_ARGS_MAX_* and RANDOM_GET_MAX_BYTES. Filesystem, environment beyond argv, full clocks, and remaining Preview 1 surface remain un-implemented refusals. |
| POSIX adapter v0 substrate | libcapos-posix exposes a narrow fork-for-exec / pipe / socket / clock surface to C code; capOS trusts that the recording-shim window stays scoped to the synthetic child branch and that explicit grants pass through ProcessSpawner.spawn | libcapos-posix per-process static fd table, single-thread errno cell, kernel UdpSocket/Timer/Pipe clients, and the recording-shim Move-grant stdio_<N> path. The pseudo-child branch never calls _exit() on execve() failure; surface remains research/v0, not a full POSIX TCB. |
| Persistent config overlay ↔ init | init trusts no bytes from the on-disk system/config/overlay.bin; it validates the overlay version, SHA-256 content hash, the base manifest’s declared extension points (allowed service caps, max additional services, minOverlayEpoch, settings allowances), and base-pin non-collision before composing, and rejects the whole overlay (booting the base manifest floor) on any violation | capos-config::manifest (SystemConfigOverlay::from_capnp_bytes + compose_onto), init/src/main.rs apply_config_overlay; proof make run-installable-overlay |
| Hardware cap teardown audit | kernel must record every acquire, release, rollback-detach, Drop-detach-failure, explicit driver-crash trigger, reset/disable trigger, interrupt-waiter trigger, and explicit bounded proof-buffer free for the DDF caps so post-mortem review can correlate device-manager state with cap lifecycle | kernel/src/cap/hardware_audit.rs emit helper invoked from device_mmio.rs, interrupt.rs, dma_pool.rs, dma_buffer.rs, plus the devicemmio_grant_source.rs, dmapool_grant_source.rs, and interrupt_grant_source.rs userspace-grant rollback paths. The bounded DMAPool grant source emits DmaPool acquire for its manifest grant; DMAPool.allocateBuffer can mint one manager-attached proof DMABuffer result cap with its own acquire/free-buffer/release-after-free audit, while duplicate proof-buffer allocation and real DMA allocation remain blocked. Parent-first DMAPool release records a pending parent detach and completes after typed DMABuffer.freeBuffer frees the proof page, after cap release frees the proof page, or after successful DMABuffer driver-crash/reset-disable cleanup frees that page, preserving the one final DmaPool release audit. The real driver-crash teardown trigger entry points on DeviceMmioCap, InterruptCap, DmaPoolCap, and DmaBufferCap (device_manager::trigger_driver_crash_for_* plus each cap’s on_driver_crash) emit event=driver-crash exactly once per successful detach; stale rerun stays silent. The reset/disable trigger entry points on all four cap types (trigger_reset_disable_for_* plus each cap’s on_reset_disable) mirror that single-emit policy with event=reset-disable. The first cap-specific interrupt-waiter trigger on InterruptCap (trigger_interrupt_waiter_for_interrupt plus InterruptCap::on_interrupt_waiter) mirrors the same policy with exactly one event=interrupt-waiter audit record for the first successful detach. DMABuffer.freeBuffer emits exactly one event=free-buffer record on the successful explicit proof-buffer free, invalidates later DMABuffer.info, and leaves the later cap release as a no-op detach. The DMA pool reset path keeps the zero-live/quiesced/scrubbed evidence precondition, and the DMA buffer reset path reuses the bounded FreeBuffer page cleanup path before evidence-gated parent-pool cleanup. All explicit-trigger impls use the load-bearing exhaustive match outcome.detach_label() { "ok" => emit event, "noop" => silent, label => kprintln + DropDetachFailed } shape, so any future non-"ok"/non-"noop" outcome label still surfaces as a DropDetachFailed audit rather than being silently dropped. Each event is a cap-audit: key=value line on COM1 carrying the cap tag (interface id), event class, BDF, owner, and the relevant generation fields, and emit_cap_audit also appends it to a bounded volatile ring. HardwareAuditLog.snapshot exposes the latest retained records to userspace with drop-oldest retention while reporting volatile-only persistence, unsigned signatures, manifest-granted read-only snapshot access, production subscriber admission policy not implemented, and the volatile snapshot truncation contract; a QEMU-only local-ring proof asserts all four truncation labels without mutating the live ring. Durable storage, signing, and production subscriber admission remain future work. The legacy hardware-cap-release: line is retained alongside the audit line. |
Attacker model
- Untrusted service binaries. Today’s services are checked into the repo, but the manifest pipeline is meant to load arbitrary binaries eventually. Assume every byte of a service’s SQEs, params buffers, result buffer pointers, and return addresses is attacker-controlled.
- Untrusted manifest. Once manifests are produced outside the repo (e.g. generated from CUE fragments, passed in as a Limine module), the manifest parser must reject every malformed input without panicking.
- Resource exhaustion. Once multiple mutually-untrusting services run, a service can attack by filling rings, endpoint queues, capability tables, frame pools, scratch arenas, logs, or CPU time. Boundedness and accounting are security properties, not performance polish.
- Build input drift. The schema/codegen path is already part of the TCB. External build inputs such as the bootloader checkout, Rust dependencies, capnp code generation, and generated-code patching must be reproducible enough that review can tell what changed.
- Host tooling input. Build tools and generators run with developer/CI filesystem access. Treat manifest/config-derived paths and command arguments as untrusted until bounded to the intended directory and execution context.
- Residual state and disclosure. Kernel logs, returned buffers, recycled frames, endpoint scratch space, and generated artifacts must not expose kernel pointers, stale bytes from another process, secrets, or build-system paths that increase attacker leverage.
- Hostile interrupts / preemption. The scheduler preempts at arbitrary points. Any kernel invariant that is only transiently true must be held under the right lock or with interrupts disabled.
- Out of scope (for now): physical attacks, speculative-execution side channels, malicious hardware, IOMMU bypass from DMA devices. These become in-scope once the driver stack lands; revisit the threat model then.
Threat Actor Matrix
| Actor | Current scope | Current treatment | Production gate |
|---|---|---|---|
| Local physical attacker | Out of scope. | The prototype does not claim protection against physical memory access, bus probing, evil-maid boot replacement, cold boot, firmware compromise, or direct console access. | Secure/measured boot, sealed storage keys, physical console policy, and hardware-rooted attestation before production claims. |
| Malicious DMA device | Out of scope for hostile hardware; in scope only as confused userspace around cooperative QEMU virtio. | The virtio-net smoke assumes QEMU-provided cooperative virtio hardware and kernel-owned bounce buffers. Without an IOMMU, a bus-mastering device can DMA arbitrary RAM. | IOMMU-backed DMA domains or a documented hardware policy that forbids untrusted bus-mastering devices before userspace drivers or production hardware claims. |
| Malicious boot manifest | Partially in scope. | Manifest decoding/validation must fail closed and not panic. A manifest accepted by the kernel/init is still trusted to define the initial service graph and bootstrap grants. | Signed/authorized manifest policy, boot-package integrity, and review-visible payload hashes before accepting manifests from outside the repo or operator-controlled build path. |
| Compromised init/supervisor | Partially out of scope for current proofs. | Current demo TCB includes init and manifest-declared trusted services. If init is compromised, it can misgrant authority within the bootstrap service graph. | Minimize init, split supervisors, require narrow grant construction, audit graph changes, and make restart/update authority explicit. |
| Compromised service with narrow caps | In scope. | Address-space isolation, cap-table lookup, generation checks, ring validation, transfer checks, and resource ledgers should constrain it to granted authority. | Complete hostile smokes for transfer modes, resource exhaustion, panic surfaces, and revoke/epoch behavior per service class. |
| Hostile network peer | In scope only for loopback demo robustness, not production remote access. | Telnet is plaintext loopback-only. SSH gateway work is fixture/prototype status without complete encrypted transport, durable key/account storage, full OpenSSH userauth/channel handling, or complete audit gates. | Non-loopback remote shells stay blocked until SSH transport/auth/key/audit/storage gates pass and pre-auth/post-auth authority is isolated or otherwise proven constrained. |
| Hostile local web client of the remote-session-ui bridge | In scope. | Today’s bridge shares one upstream capOS session across all loopback HTTP clients with no per-browser session, missing-Origin short-circuit, and non-constant-time secret comparison. | Per-browser BrowserSession cookie, CSRF/Host/Content-Type guards, strict CSP with the matching inline-script/style refactor, constant-time comparators, rate limiting, and the carry-over Tauri capability-allowlist minimization, all per remote-session-ui-security-proposal.md. |
| Malicious build dependency or tool | Partially in scope. | Lockfiles, generated-code checks, pinned Cap’n Proto/Limine/docs tools, and dependency-policy checks make drift review-visible, but Rust nightly, QEMU/xorriso/OVMF, and final image hashes are not fully pinned. | Date/hash-pinned toolchains, recorded host tool versions, image/payload hashes, and reproducible production build path. |
ITU-T X.800 security-services completeness matrix
X.800 enumerates five security services; X.805 extends the list with availability and privacy. Each review of a proposal or kernel change should be able to say which service it touches, or that it touches none. The point is not to implement every cell — capOS explicitly defers some (end-to-end non-repudiation, for example) — it is to make gaps explicit.
| X.800/X.805 service | capOS surface that provides it |
|---|---|
| Authentication (peer entity, data origin) | user-identity-and-policy-proposal.md X.1254 LoA tiers; passkey + password credentials in boot-to-shell-proposal.md; certificate-based peer auth in certificates-and-tls-proposal.md (mTLS); future attestation in cryptography-and-key-management-proposal.md AttestationKeySource. |
| Access control | Structural: the capability model itself. The interface is the permission; wrapper caps attenuate; CapTable cannot be bypassed. Policy layer: AuthorityBroker (X.812 ADF) over CapObject::call (X.812 AEF). |
| Data confidentiality | Transport: certificates-and-tls-proposal.md TlsSocket. At rest: volume-encryption-proposal.md. In memory: address-space isolation + SMAP + SMEP. |
| Data integrity | Transport: TLS AEAD. At rest: authenticated block encryption (SymmetricAlgorithm.aes256GcmSiv etc.). Manifest/boot: signed manifests (storage-and-naming-proposal.md Open Q #5). In-transit schema: Cap’n Proto wire format + bounds-checked decoders. |
| Non-repudiation (origin, delivery) | Partial. Signed audit records (system-monitoring-proposal.md + cryptography-and-key-management-proposal.md audit key purpose). End-to-end non-repudiation for user actions is deferred until signed sessions exist. |
| Availability (X.805) | Resource ledgers, bounded rings, CAP_OP_RELEASE, supervisor restart policy, rate limiters on monitoring ingestion. DoS resistance is a review dimension, not a separate subsystem. |
| Privacy (X.805) | Principal pseudonymity (user-identity-and-policy-proposal.md pseudonymous profile), audit-record redaction, monitoring “payload capture is exceptional” default. |
The matrix is a checklist, not a claim of completeness: individual proposals remain authoritative about what they do and don’t provide.
3. Tiered Approach
Four tiers, cheapest first. Each tier is independently useful, and later tiers assume earlier ones are in place.
Tier 1 – Hygiene and CI (cheap, high value)
These are the controls that make every other tier work. The only checked-in
GitHub Actions workflow is .github/workflows/ci.yml; it runs formatting,
host tests, cargo build --features qemu, make capos-rt-check,
make generated-code-check, make dependency-policy-check, and
make workflow-check. The QEMU smoke job installs its own boot tools and
runs make plus make run, but remains non-blocking, so it is not yet a
required boot assertion. No separate clippy, miri, fuzz, or Kani workflow
files exist yet – those are scheduled per the track table below.
- Continuous integration via GitHub Actions (or equivalent). Current
baseline:
make fmt-check,cargo test-config,cargo test-ring-loom,cargo test-lib,cargo test-mkmanifest,cargo build --features qemu,make capos-rt-check,make generated-code-check,make dependency-policy-check, andmake workflow-check. Remaining CI work: treat QEMU boot as a required CI gate once runtime flakiness is acceptable, then add the security policy jobs below. cargo clippy --all-targets -- -D warningsacross workspace members, with a curated set ofclippy::pedantic/clippy::nurserylints that pay off for kernel code (clippy::undocumented_unsafe_blocks,clippy::missing_safety_doc,clippy::cast_possible_truncation, etc.). Do NOT enable all of pedantic blindly – review each lint and either enable it or add a rationale comment.cargo-denyfor license and advisory gating;cargo-auditfor the RustSec advisory DB againstCargo.lock. Dependencies includecapnp,spin,x86_64,limine,linked_list_allocator– all externally maintained.cargo-geigerreport of unsafe surface area per crate, checked in as a snapshot and diffed in CI so growth is visible in PRs.- Deny
unsafe_op_in_unsafe_fn(already required by edition 2024; make sure it stays on) andmissing_docson public kernel items where it is not already the case. - Dependency review discipline: every new dep needs a one-line
rationale in the commit message and a check that it is
no_std-capable, maintained, and does not pull in a surprise async runtime or heavy transitive graph. - No-std dependency rubric: kernel/no_std additions require an explicit
compatibility check that
core/allocpaths do not regress tostdthrough default feature drift, and class ownership is recorded againstdocs/trusted-build-inputs.md. - Boot/build input pinning: pin external bootloader/tool downloads to an auditable revision or checksum. Branch names are not enough for TCB inputs. CI should fail when generated capnp bindings or no-std patching change outside an intentional schema/codegen update.
- Untrusted-path panic audit:
panic!,assert!,.unwrap(), and.expect()are acceptable during bring-up, but every path reachable from manifest bytes, ELF bytes, SQEs, params buffers, result buffers, and future IPC messages needs either a fail-closed error or a documented halt policy. - Hardware protection smoke tests: boot under QEMU with SMEP/SMAP-capable CPU flags and assert CR4.SMEP/CR4.SMAP once paging is initialized. Every explicit user-memory dereference must be wrapped in a short STAC/CLAC window once SMAP is enabled.
Tier 2 – Targeted dynamic analysis
Aimed at the host-testable pure-logic crates (capos-lib, capos-config)
where the Rust toolchain just works. No kernel changes required.
- Miri on the
cargo test-libandcargo test-configsuites. Catches UB in pure-logic code: invalid pointer arithmetic, uninitialized reads, bad provenance, unsoundunsafe. The FrameBitmap and CapTable tests in particular push against slot indexing, generation counters, and raw&mut [u8]handling – exactly what miri is good at. proptest(orquickcheck) on:capos-lib::elf::parse– random bytes / random perturbations of a valid header must never panic and must refuse anything that isn’t a correctly formed user-half ELF64.capos-lib::frame_bitmap– interleaved sequences ofalloc,alloc_contiguous,free,mark_usedpreserve the invariantfree_count == popcount(bitmap == 0)and never double-free.capos-lib::cap_table– insert/remove/lookup sequences preserve “every returned id resolves to its insertion-time object, and stale ids are rejected.”capos-config::manifestencode/decode round trip on arbitrary manifests.- Schema round-trip tests in
capos-config/tests/: todayremote_capnp_rpc_dto_roundtrip.rspins the remote capnp-rpc DTO wire shape, andremote_paperclips_dto_roundtrip.rs(10 tests) pins the Remote Session Paperclips DTO wire shape ahead of the future gateway/worker/browser bridge that will marshal traffic through them. New shared-DTO families should land alongside similar round-trip coverage so schema drift is review-visible.
cargo fuzzharnesses (libFuzzer). The currentfuzz/fuzz_targets/set is seven targets:elf_parse.rs,manifest_capnp.rs,mkmanifest_json.rs,sqe_validation.rs(ring SQE wire validator viacapos_config::ring::sqe_wire_validation_error),telnet_filter.rs,telnet_filter_roundtrip.rs, andline_discipline.rs. The Telnet round-trip oracle exists alongside the structural Telnet filter target because the round-trip variant found a real EXOPL parsing bug (docs/changelog.md). These run outside CI (they never terminate) but have seed corpora underfuzz/corpus/and can be exercised in fixed budgets viamake fuzz-buildandmake fuzz-smoke.- Sanitizers on host tests:
make sanitizer-host-testsruns AddressSanitizer over thecapos-libandcapos-confighost suites under the repo-pinned nightly (zero findings to date). ASan is indeed cheap – it needs no-Zbuild-std. ThreadSanitizer (make sanitizer-host-tests-tsan) is wired but currently blocked by an upstream cargo-Zbuild-std+ build-script limitation when the sanitizer target equals the host triple; see Track S.17 for the recorded reproduction.
Tier 3 – Concurrency model checking
The capability ring is a lock-free single-producer / single-consumer protocol using volatile reads, release/acquire fences, and a shared head/ tail pair. It is the most likely source of subtle memory-ordering bugs and is also the most isolated – a perfect fit for model checking.
- Loom on a host-buildable wrapper of the ring protocol. Extract the
producer/consumer state machine from
capos-config::ringinto a form where atomics can be swapped forloom::sync::atomic, and write Loom tests that enumerate all interleavings of producer/consumer for small ring sizes (2–4 slots). Properties to check:- No CQE is lost.
- No CQE is double-delivered.
- The
sq_head/sq_tailandcq_head/cq_tailpointers never observe a state that impliestail - head > SQ_ENTRIES. - The userspace ring “corrupted producer state” fail-closed policy from prior review-finding task records holds under adversarial interleavings.
- Shuttle as a lighter alternative for regression-style tests once the specific bugs are known; cheaper per run, randomised rather than exhaustive. Good for long-running overnight jobs.
Loom coverage here is disproportionately valuable: it substitutes for the SMP-hardness work the project has explicitly deferred, and it exercises exactly the ordering that TOCTOU-style bugs hide in.
Tier 4 – Bounded verification of specific invariants
Not a full-kernel proof. Targeted, property-specific, one-module-at-a-time.
- Kani (bounded model checking for Rust, via CBMC). Good fit for
small, heap-free, arithmetic-heavy functions. Candidate modules:
capos-lib::cap_table– prove that for allinsert; remove; insert'sequences under au8generation counter, a stale CapId never resolves. Bound: table size ≤ 4, generation window ≤ 256.capos-lib::frame_bitmap– prove that for all bitmap sizes up to N bytes,alloc_framefollowed byfree_frameof the same frame restores the original bitmap andfree_count.capos-lib::elf::parsebounds checks: prove that every index into the program header table is< len, given the validatedphentsizeandphnum.
- Verus (SMT-based Rust verifier, active development at MSR) for invariants that Kani can’t handle ergonomically, particularly those involving loops and ghost state. Worth tracking but don’t commit to it yet – the proof-engineering cost is real, and the tool is still young. Revisit once IPC lands and the kernel has stable public APIs.
- Creusot / Prusti are alternatives in the same space. Do not
invest in more than one SMT-based verifier; pick whichever has the best
story for
no_std + alloccode when Tier 4 starts.
Deliberately out of scope: Isabelle/HOL, Coq proofs, Frama-C. They would require re-encoding Rust in a foreign semantic framework with no established Rust front-end mature enough for kernel code.
4. Security Review Process
REVIEW.md is the rules document and docs/tasks/** is the open remediation
and review-finding ledger. REVIEW.md contains the common security checklist
that applies across kernel, userspace, host tooling, generators, and CI. The
per-boundary prompts below are an expansion of that common checklist for
OS-specific code paths.
CWE/CAPEC tagging policy
Security findings should carry CWE metadata when the mapping is specific enough to help a reviewer or future audit. Do not force a CWE into every title.
- Prefer Base/Variant CWE IDs when the root cause is known: CWE-770 for unbounded allocation, CWE-88 for argument injection, CWE-367 for a concrete validation-to-use race, CWE-416 for a real use-after-free.
- Use Class IDs as temporary or umbrella labels: CWE-20 for “input was not validated enough” before the missing property is known; CWE-400 for general resource exhaustion only when the enabling mistake is not more precise.
- Use capability-kernel invariants instead of weak CWE mappings for design properties such as “no ambient authority”, “cap transfer happens exactly once”, “revocation cannot leave stale authority”, and “scheduling context donation cannot fabricate CPU authority”. Cite CWE-862/CWE-863 only when the issue is actually a missing or incorrect authorization check.
- Use CAPEC for the attacker pattern when useful: input manipulation, command injection, race exploitation, flooding, or path/file manipulation. CAPEC is not a substitute for the CWE root-cause tag.
Current checklist coverage:
| Area | Primary tags | Review intent |
|---|---|---|
| Structured input validation | CWE-20, CWE-1284–CWE-1289 when precise | Validate syntax, type, range, length, indexes, offsets, and cross-field consistency before privileged use |
| Filesystem paths | CWE-22, CWE-23, CWE-59 | Keep host-tool paths inside intended roots across absolute paths, traversal, symlinks, and file-type confusion |
| Commands/processes | CWE-78, CWE-88 | Avoid shell interpolation; constrain binaries and arguments |
| Numeric/buffer bounds | CWE-190, CWE-125, CWE-787 | Check arithmetic before pointer, slice, copy, ELF segment, and page-table use |
| Resource exhaustion | CWE-770 preferred; CWE-400 broad | Bound queues, allocations, retries, spin loops, frames, scratch arenas, cap slots, and CPU budget |
| Exceptional paths | CWE-703, CWE-754, CWE-755; CWE-248 only for uncaught exceptions | Fail closed on malformed or adversarial input; avoid trust-boundary panic/abort |
| Authorization/cap authority | CWE-862, CWE-863 plus capOS invariants | Verify capability ownership, generation, object identity, address-space ownership, and transfer policy |
| Concurrency/TOCTOU | CWE-362, CWE-367, CWE-667 | Preserve lock ordering, interrupt masking, page-table stability, and validation-to-use assumptions |
| Lifetime/reuse | CWE-416, CWE-664, CWE-672 | Prevent stale caps, stale kernel stacks, stale frames, and expired IPC state from being used |
| Disclosure/residual data | CWE-200, CWE-226 | Prevent logs, result buffers, frames, scratch arenas, and generated artifacts from leaking stale or sensitive data |
| Supply chain / generated TCB | capOS TCB invariant; use CWE only for concrete bug | Pin or review-visible drift for bootloader, dependencies, schema/codegen, generated code, and patching |
Per-boundary review checklist
- Syscall surface change (
arch/x86_64/syscall.rs):- Every register-passed argument is treated as attacker-controlled.
- No user pointer is dereferenced without an
AddressSpace-locked copy/read helper or an explicitly documented equivalent stability guarantee. - Numeric conversions, copy lengths, and pointer arithmetic are checked before constructing slices or entering any direct user-access scope.
- Kernel stack pointer and TSS.RSP0 invariants are preserved.
- The syscall count stays bounded; a new syscall has an SQE-opcode alternative considered and explicitly rejected with rationale.
- Ring dispatch change (
kernel/src/cap/ring.rs):- SQ bounds check and per-dispatch SQE limit still enforced.
- Corrupted SQ state fails closed (never re-processes the same bad state on the next tick).
- No allocation in the interrupt-driven path beyond what the owning task record or panic-surface inventory explicitly accepts.
- Result buffers and endpoint scratch buffers cannot leak stale bytes beyond the returned completion length.
- User buffer validation change (
kernel/src/mem/paging.rs,kernel/src/mem/validate.rs):- Address range check precedes PTE walk.
- PTE flags checked: present, user, and write (if the buffer is written).
- For process-owned buffers, validation and copy/read hold the process
AddressSpacemutex. Any current-CR3 validator caller must document its own page-table stability guarantee.
- ELF loader change (
capos-lib::elf):- Every field bounded before use (phentsize, phnum, p_offset, p_filesz, p_memsz, p_vaddr).
- Segments confined to the user half.
- Overlap check preserved.
- Integer arithmetic uses checked add/subtract before deriving mapped addresses, file slices, or zero-fill ranges.
- Manifest change (
capos-config::manifest):- Every optional field is either present or the service is rejected.
- Name / binary / cap source strings are length-bounded.
- Unknown / unsupported numbers in CUE input fail-closed with a path- specific error.
- Capability grants are checked as an authority graph before any rejected graph can start a service.
- Schema change (
schema/capos.capnp):- Backward-compatible with existing wire format, or migration documented.
- Every new method has an explicit capability-granting story (who mints the cap that lets this method be called?).
- Generated code
no_stdpatching still applies.
- Host tool or generator change (
tools/*,build.rs, CI scripts):- Manifest/config-derived paths cannot escape intended directories through absolute paths, traversal, symlinks, or file-type confusion.
- External command execution uses explicit binaries and argument APIs, not shell interpolation of untrusted strings.
- Generated outputs are review-visible and fail closed on malformed inputs.
- Generated files and diagnostics do not disclose secrets, absolute paths, or stale build outputs beyond what the developer intentionally requested.
- Unsafe block added or expanded: Tier 1 clippy lints plus REVIEW.md §“Unsafe Usage” checklist already cover this; the review should cite the specific invariant being maintained in the commit message.
Threat-model refresh
On every stage completion (Stage 6 IPC, Stage 7 SMP, first driver landing, first time a manifest comes from outside the repo), re-run §2 of this document and update it. The list of trust boundaries grows over time; the proposal decays if it doesn’t grow with the code.
Periodic full audit
Once per stage, schedule a focused audit pass:
- Re-verify every boundary’s code is still enforced at its documented entry point (no new bypass path).
- Re-run all Tier 2/3 jobs with the latest toolchain (catches tool-upgrade regressions).
- Walk through open review-finding task records and confirm each is still correctly classified (still open, fixed, explicitly accepted, blocked, or on-hold).
- Record the audit date and outcome in the relevant task records or a focused closeout task, matching the repository timestamp convention.
5. Concrete Verification Targets
Ordered by value and feasibility. Each one is a specific, bounded piece of work a contributor can pick up without needing to redesign the kernel.
| # | Target | Tier | Property | Blocker |
|---|---|---|---|---|
| 1 | capos-lib::cap_table | 4 (Kani) | Stale CapId never resolves after slot reuse within the generation window | None |
| 2 | capos-lib::frame_bitmap | 4 (Kani) | alloc/free preserve free_count invariant; no double-alloc | None |
| 3 | capos-lib::elf::parse | 2 (proptest + fuzz) | No panic on arbitrary input; only well-formed user-half ELF64 accepted | None |
| 4 | capos-config::manifest | 2 (proptest + fuzz) | Decode/encode round-trip; malformed input rejected without panic | None |
| 5 | Ring SPSC protocol | 3 (Loom) | No lost/doubled CQEs; fail-closed on corruption under all interleavings | Extract protocol into Loom-testable wrapper |
| 6 | AddressSpace user-buffer helpers | 4 (Kani) | Every accepted buffer lies entirely in user half with correct PTE flags, and validation/use happens under the address-space lock | Formalise PTE and locking model |
| 7 | Ring dispatch path | 3 (Loom + proptest) | SQE poll is bounded per tick; no allocation on the dispatch path | Initial alloc-free synchronous path landed; async transfer/release paths still need coverage |
| 8 | IPC routing | 3 | Capabilities transferred exactly once; no duplication under direct-switch | Capability transfer |
| 9 | Direct-switch IPC handoff | 2 + 3 | Scheduler invariants preserved when a blocked receiver bypasses normal run-queue order | Loom-testable scheduler/ring model |
| 10 | SMEP/SMAP + user access windows | 1 + QEMU integration | Kernel cannot execute user pages; direct user-memory touches either use audited access windows or the AddressSpace/HHDM copy path | Wire existing x86_64 helper into init path |
| 11 | Manifest authority graph | 2 (property tests) | Every granted cap source resolves, every export is unique, and no service starts after a rejected graph | Manifest executor path |
| 12 | Resource accounting | 2 + 3 | Rings, endpoints, cap tables, scratch arenas, frames, and CPU budget fail closed under exhaustion | Security Verification Track S.9 design complete; implementation hooks pending |
| 13 | Build/codegen TCB | 1 | Bootloader/deps/codegen inputs are pinned and generated output changes are review-visible | CI bootstrap |
| 14 | Device DMA boundary (future) | 1 + design review | No driver or device can DMA outside explicitly granted buffers | PCI/device work; IOMMU or bounce-buffer decision |
Targets 1–4 are feasible today and should be the first batch of work. Target 10 is the security gate before treating Stage 6 services as untrusted. Targets 11–12 should be designed before capability transfer lands, otherwise the first IPC implementation will bake in ambient resource authority. Target 14 gates user-mode or semi-trusted drivers.
Current status as of 2026-05-16:
- Targets 1–2 are part of the completed Verified Core visible milestone:
commit
d43b691at2026-04-23 22:09 UTCmademake kani-libthe bounded local/GitHub proof gate for cap-table and frame-bitmap invariants, and commitc5968eeat2026-04-23 22:12 UTCrecorded the high-memorymake kani-lib-fullCloud Build gate. - Target 3 has arbitrary-input proptest coverage and a cargo-fuzz target for ELF bytes. The current Kani harness still only proves the short-input early-reject path because fully symbolic ELF parsing reaches allocator and sort internals before there is a sharper proof obligation.
- Target 4 has cargo-fuzz coverage for manifest decoding/roundtrip and mkmanifest exported-JSON conversion.
- Target 5 has a feature-gated Loom model for the shared ring protocol.
- Target 13 has an initial CI baseline plus generated-code drift checking,
dependency audit/deny gates, and required QEMU boot still open. Remaining
supply-chain provenance work is tracked by
docs/tasks/trusted-build-inputs-pr-blocking-provenance.md; panic-surface hardening remains tracked by its owning task records across IPC/scheduler guarded unwraps, rollback restoration, stale queues, blocking waits, process/thread exit, endpoint cancellation, TLB shootdown send failures, and scheduler hot-path expects. Scheduler hot-path panic surface fully closed (2026-05-17, REVIEW_FINDINGS commit1b295cb3): all.expect()/.unwrap()inblock_current_on_cap_enter,next_start_context,schedule,exit_current,exit_current_thread,capos_block_current_syscall, andretain_endpoint_queuehardened per the established let-else + log + drop-lock +hcf()/return None/breakpattern (per-function closures at7f86796f/777e0b3a/0af439d4/7d93aea4/b04d6d65/2bea189c). - Out-of-band scheduler/runtime hazards tracked in review-finding task records
but not yet expressed as Concrete Verification Targets above: current
post-AP kernel upper-half page-table mutation through the MMIO/firmware
helper path is closed by kernel-wide TLB shootdown plus preseed/fail-closed
PML4-slot handling
(
../tasks/done/2026-06-07/kernel-upper-half-pml4-propagation-hardening.md); future helper windows or allocator-growth paths that need a new kernel-half PML4 slot still require boot preseed or synchronized live-root propagation. ParkSpace unmap/reuse cleanup still owes shared park-word cleanup and address-space generation cleanup; resource quota fields for scratch bytes, outstanding calls, endpoint queues, and in-flight calls need real wiring or removal. Each is owned by its respective subsystem proposal; the consolidated routing index lives indocs/design-risks-register.md.
6. Security Verification Track Registry
The S.x labels are registry identifiers for this proposal’s
security-verification track. They are not product stages and should be expanded
as “Security Verification Track S.x” when cited outside this proposal.
| Track | Name | Status | Primary document or evidence |
|---|---|---|---|
| S.1 | CI bootstrap | Landed 2026-04-21 | .github/workflows/ci.yml |
| S.2 | Miri + proptest on capos-lib | Landed 2026-04-21 | cargo test-lib, cargo miri-lib |
| S.3 | Manifest + mkmanifest fuzzing | Landed 2026-04-21 | fuzz/ manifest and mkmanifest targets |
| S.4 | Ring Loom harness | Landed 2026-04-21 | capos-config/tests/ring_loom.rs |
| S.5 | Kani on capos-lib | Initial landed 2026-04-21, expanded bounded gate landed 2026-04-23 | make kani-lib |
| S.6 | Security review docs stay aligned | Ongoing | REVIEW.md, CLAUDE.md |
| S.7 | Stage-6-aware refresh | Planned/ongoing | Trust-boundary inventory after Stage 6 changes |
| S.8 | Untrusted-service hardening gate | Planned | SMEP/SMAP, user access windows, hostile-userspace tests |
| S.9 | Authority graph and resource accounting | Landed 2026-04-21 | docs/authority-accounting-transfer-design.md |
| S.10 | Supply-chain and generated-code TCB | Partially landed | docs/trusted-build-inputs.md |
| S.11 | Device/DMA isolation gate | Design accepted; brokered-bounce DDF production authority gates landed for the current local/GCE path, while direct-remapping and hostile-hardware claims remain future | docs/dma-isolation-design.md |
| S.12 | Kani harness bounds refresh | Planned | Future transfer/accounting/user-buffer proof obligations |
| S.13 | ELF parser arbitrary-input coverage | Landed | capos-lib::elf::parse, fuzz/fuzz_targets/elf_parse.rs |
| S.14 | Telnet IAC filter fuzz coverage | Landed 2026-04-27 16:33 EEST | capos-lib::telnet, fuzz/fuzz_targets/telnet_filter.rs |
| S.15 | Telnet differential round-trip + line-discipline extraction | Landed 2026-04-27 17:18 EEST | capos-lib::line_discipline, Telnet round-trip fuzz target |
| S.16 | Ring SQE wire-validation extraction + fuzz target | Landed 2026-04-27 19:42 EEST | capos_config::ring::sqe_wire_validation_error, fuzz/fuzz_targets/sqe_validation.rs |
| S.17 | Sanitizers on host tests | ASan landed (zero findings); TSan blocked upstream | make sanitizer-host-tests / make sanitizer-host-tests-tsan |
Track Details
This slots into docs/tasks/README.md as a cross-cutting track rather than a phase – items are independent of Stage 6 IPC and can proceed in parallel.
Subtracks are scoped identifiers under their parent track:
| Subtrack | Parent | Name | Primary document or evidence |
|---|---|---|---|
| S.10.0 | S.10 | Trusted build input inventory | docs/trusted-build-inputs.md |
| S.10.2 | S.10 | Generated-code drift check | make generated-code-check |
| S.10.3 | S.10 | Dependency policy and no_std review gate | make dependency-policy-check, deny.toml |
| S.11.1 | S.11 | DMA capability invariants | docs/dma-isolation-design.md |
| S.11.2 | S.11 | Userspace-driver ownership-transition gate | docs/dma-isolation-design.md |
Security Verification Track S.11.2 defines checklist rows S.11.2.0 through
S.11.2.9 in docs/dma-isolation-design.md; those row labels are local
acceptance criteria for the userspace-driver transition, not independent
registry tracks.
- Track S.1 – CI bootstrap – landed 2026-04-21
.github/workflows/ci.yml: fmt-check, test-config, test-ring-loom, test-lib, test-mkmanifest,cargo build --features qemu,make capos-rt-check, generated-code drift checking, and dependency policy checking.- QEMU smoke installs
build-essential,capnproto,qemu-system-x86,xorriso, andcuev0.16.0 before runningmakeandmake run; it remains optional/non-blocking until boot runtime is stable enough to make it a required gate. - Clippy-with-deny and cargo-geiger remain future hardening jobs.
- Track S.2 – Miri + proptest on capos-lib – landed 2026-04-21
- Add
proptestdev-dependency tocapos-lib. - Host properties for
capos-lib::cap_tableandcapos-lib::frame_bitmap; ELF arbitrary-input coverage is tracked separately under landed Security Verification Track S.13. cargo test-libruns the native host suite;cargo miri-libruns the same crate under Miri.
- Add
- Track S.3 – Manifest + mkmanifest fuzzing – landed 2026-04-21
fuzz/crate with harnesses formanifest::decodeandtools/mkmanifestCUE → capnp pipeline. Seed corpus checked in.
- Track S.4 – Ring Loom harness – landed 2026-04-21
- Extract the SPSC protocol from
capos-config::ringinto a test-only wrapper where atomics are swappable. - Loom tests covering corruption, overflow, and ordering.
- Doubles as regression coverage for Phase 1.5 in docs/tasks/README.md.
- Extract the SPSC protocol from
- Track S.5 – Kani on capos-lib – initial harnesses landed 2026-04-21,
expanded bounded gate landed 2026-04-23
- CapTable generation/index/stale-reference invariants.
- FrameBitmap fail-closed free-error behavior plus a concrete bounded contiguous-allocation proof.
- Transfer/resource-accounting fail-closed invariants for cap-slot
preflight, frame-grant reservation, invalid transfer-origin rejection,
move-reservation rollback after revocation, source visibility/accounting
after the real
prepare_copy_transferpath, and provisional destination cap-slot/frame-grant ledger restoration. - Propagation of real prepared transfer metadata into a provisional
destination slot is reserved for
make kani-lib-full; Google Cloud Build run95b49620-06a5-49f4-85e6-782adb82d11cpassed this high-memory gate on 2026-04-23. - ELF parser short-input early-reject panic-freedom exists as a targeted Kani harness but is not part of the mandatory bounded gate.
- The current bounds are intentionally conservative so
make kani-libremains a practical local/GitHub CI gate; broader symbolic ELF and contiguous-allocation proofs should wait for more specific invariants or high-memory runners.
- Track S.6 – Security review docs stay aligned
- Keep REVIEW.md’s common security checklist aligned with §4’s boundary prompts as new boundaries land.
- Add a “threat model refresh” step to the stage-completion workflow in CLAUDE.md.
- Track S.7 – Stage-6-aware refresh
- Re-run §2 trust-boundary inventory after capability transfer/release semantics land.
- Plan Loom coverage for cross-process routing and direct-switch IPC.
- Carry the inventory through the active scheduler-evolution phases
(Phase D WFQ, Phase E
SchedulingContext, Phase F one-SQ-consumer and nohz telemetry) and the WASI host-adapter surface (Phase W.4 entropy production wiring + per-instance argv text grant) so each new boundary is reflected in §2 before it can be relied on. The WASI host adapter is a userspace trust boundary – wasmi sandbox around untrusted Preview 1 payloads with per-instanceEntropySource/ argv grants – that the Tier 2/3 plan should explicitly cover as new harness targets emerge (seedocs/proposals/wasi-host-adapter-proposal.md). - The Phase 1 monitoring log surface (
LogSink/LogReader,kernel/src/cap/log.rs) is a new kernel boundary: aLogSinkaccepts bounded userspace-supplied records (decoded, length-truncated, severity- filtered againstSystemConfig.logLevel) into a bounded drop-oldest ring, and a scopedLogReaderserves cursor/filtered snapshots. It confers no transfer/grant authority beyond the scoped sink/reader and adds no ambient log namespace. Carry it in §2 before downstream services rely on it; per-process log token-bucket backpressure remains future work (docs/proposals/system-monitoring-proposal.md).
- Track S.8 – Untrusted-service hardening gate
- Wire SMEP/SMAP enablement into x86_64 init after paging is live.
- Replace raw user-slice construction in syscall/ring paths with checked copy/access helpers that bracket the actual access with STAC/CLAC.
- Add QEMU hostile-userspace tests for bad pointers, kernel-half pointers, invalid caps, corrupted rings, and services without Console authority.
- Audit untrusted-input paths for panics before Stage 6 endpoints run mutually-untrusting processes.
- Track S.9 – Authority graph and resource accounting – landed 2026-04-21
- Concrete design is captured in
docs/authority-accounting-transfer-design.md. - Defines authority graph invariants, per-process quota ledger
(
cap slots,endpoint queue,outstanding calls,scratch,frame grants,log volume,CPU budget), diagnostic aggregation, and exactly-once transfer/rollback semantics. - Establishes acceptance criteria that gate capability transfer and
ProcessSpawner implementation. Current follow-up items live in
docs/backlog/stage-6-capability-semantics.md.
- Concrete design is captured in
- Track S.10 – Supply-chain and generated-code TCB
- Pin Limine and other external build inputs by revision/checksum rather than branch name.
- Make capnp generated-code changes review-visible in CI, including the no-std patching step.
- Consider
cargo-vetonly aftercargo-deny/cargo-auditare in place; vetting too early is process theater. - Security Verification Track S.10.3 adds a concrete dependency policy:
no_std additions are accepted only with class attribution,
cargo deny+cargo audit, and explicit lockfile intent. - Security Verification Track S.10.3 enforcement is
make dependency-policy-check, backed bydeny.tomland pinned CI installs ofcargo-deny 0.19.4andcargo-audit 0.22.1.
- Track S.11 – Device/DMA isolation gate
- The DMA isolation story is now runtime-selected and fail-closed: guest-programmable remapping only when capOS can discover, program, and validate it; otherwise labeled brokered bounce buffers or unsupported.
DMAPool,DeviceMmio, andInterruptinvariants are represented by done task evidence for bounded physical/device-visible ranges, explicit interrupt ownership, reset/release teardown, generation checks, and no raw host-physical grants to untrusted drivers.- The current GCP/no-IOMMU userspace-provider path is brokered bounce-buffer authority. It supports the proved virtio-net and NVMe provider chains without claiming direct DMA, IOVA export, hostile bus-master isolation, or device-autonomous MSI-X delivery.
- The DDF production-authority closeout closes the retained review finding for the current brokered-bounce provider path. Security Verification Track S.11.2 remains the canonical matrix for future direct-remapping/vIOMMU, hostile-hardware isolation, and broader device-owner claims.
- Track S.12 – Kani harness bounds refresh
- Revisit Kani bounds and harness shape once capability transfer,
resource-accounting, or
AddressSpaceuser-buffer helpers expose concrete proof obligations. - Prefer actionably narrow properties over arbitrary symbolic parser exploration that spends verifier time in allocator or sort internals.
- Revisit Kani bounds and harness shape once capability transfer,
resource-accounting, or
- Track S.13 – ELF parser arbitrary-input coverage – landed
capos-lib::elf::parsehas proptest coverage for arbitrary bytes and valid-header perturbations.fuzz/fuzz_targets/elf_parse.rsexercises ELF bytes through cargo-fuzz.
- Track S.14 – Telnet IAC filter fuzz coverage – landed 2026-04-27 16:33 EEST
- Extract the kernel’s
TelnetFilterbyte-stream parser intocapos-lib::telnetso it is host-fuzzable and survives the Phase C move of Telnet framing into userspace perdocs/proposals/networking-proposal.md. - Add
fuzz/fuzz_targets/telnet_filter.rswith structural assertions (Normal must pass non-IAC bytes through unchanged; AfterIac is the only state allowed to emit a 0xFF; emitted byte count never exceeds input length). - Wired into
make fuzz-buildandmake fuzz-smoke.
- Extract the kernel’s
- Track S.15 – Telnet differential round-trip + line-discipline extraction – landed 2026-04-27 17:18 EEST
- Add
fuzz/fuzz_targets/telnet_filter_roundtrip.rs: synthesize arbitrary RFC 854 event streams from fuzzer bytes, encode to wire, run throughTelnetFilter, assert output equals the concatenation ofData(_)payloads. Found a real EXOPL handling bug – the option byte right afterIAC SBwas being mis-parsed as the start of anIAC IACescape when its value was 0xFF, leaving the filter stuck in subnegotiation and silently dropping all subsequent data. Fixed via a newAfterSbstate that consumes the option byte unconditionally; pinned by a regression test incapos-lib::telnet. - Extract the cooked-mode line discipline from
kernel::cap::networkintocapos_lib::line_discipline::LineDiscipline, returningLineStep { outcome, echo }so all socket I/O stays at the caller. Addfuzz/fuzz_targets/line_discipline.rswith structural invariants (line_len <= max_bytes; ±1 line_len delta per Pending step; Cancelled clears; Echo::Byte/Backspace iff buffer grew/shrank by exactly one). - Future follow-up: differential against an external Telnet library (libtelnet C or Rust port) to catch RFC conformance bugs the structural targets cannot express.
- Add
- Track S.16 – Ring SQE wire-validation extraction + fuzz target – landed 2026-04-27 19:42 EEST
- Closes the original three-parser fuzz plan (
elf::parse,manifest::decode, ring SQE decoder). Lifts the per-opcode*_sqe_has_unsupported_fieldspredicates fromkernel/src/cap/ring.rsintocapos_config::ring, exposes a unifiedsqe_wire_validation_error(&CapSqe) -> Result<(), i32>entry point, and reroutes the kernel through the shared functions so the kernel-host pair has one source of truth for ABI rules. - Add
fuzz/fuzz_targets/sqe_validation.rs: cast arbitrary 64 bytes toCapSqe, runsqe_wire_validation_errorand the matching per-opcode predicate, assert determinism, opcode-classification consistency (CAP_OP_FINISH->CAP_ERR_UNSUPPORTED_OPCODE, unknown opcodes ->CAP_ERR_INVALID_REQUEST), and that the unified validator never disagrees with the predicate it dispatches to. Wired intomake fuzz-build/fuzz-smoke. - Add 12 host unit tests in
capos_config::ringcovering the classification rules each opcode imposes (THREAD_OWNED + call_id pairing on CALL/PARK, RETURN’s APPLICATION_EXCEPTION flag, CANCEL’s required pipeline_dep target, NOP’s reserved-fields-zero rule, PARK_BENCH’s required addr). - The structural fuzz target pins arbitrary-byte behavior. The follow-up
well-formed SQE generator oracle landed on 2026-06-06: the test/fuzz-only
sqe-validation-oraclefeature exposescapos_config::ring::sqe_oracle, which generates validator-accepted SQEs for each accepted opcode and one-field rejecting mutations, andfuzz/fuzz_targets/sqe_validation.rsruns that oracle on each input. This is a shared wire-validator oracle only; it does not claim cap-table lookup, userspace pointer mapping, transfer-descriptor loading, or full kernel ring semantic coverage. A future differential against an independent reference predicate remains a possible stronger disagreement oracle.
- Closes the original three-parser fuzz plan (
- Track S.17 – Sanitizers on host tests – ASan landed; TSan
blocked upstream
make sanitizer-host-testsrunsRUSTFLAGS=-Zsanitizer=addressover thecapos-libandcapos-confighost suites (crate set / features mirror thetest-lib/test-configaliases) on the repo-pinned nightly + host target. It is a focused gate, not part ofmake check, mirroringdependency-policy-check/sdk-publish-dry-run. Outcome so far: zero findings; both suites pass clean, including the namedunsafesuspects (FrameBitmap slot indexing, CapTable generation counters,lazy_bufferraw&mut [u8]). The §Tier 2 “cheap to add” claim holds for ASan, which needs no-Zbuild-std.make sanitizer-host-tests-tsanis wired but currently blocked by an upstream cargo limitation, not a capOS defect. TSan changes the crate ABI, so rustc refuses to link sanitized code against the uninstrumented precompiled std; instrumenting std needs-Zbuild-std, which fails with duplicatecorelang items for build-script-bearing dependencies (typenum / libc / cfg-if / subtle) when the sanitizer target equals the host triple. The exact reproduction (four attempted workarounds) is recorded indocs/backlog/security-verification.mdTrack S.17. Concurrency invariants are meanwhile covered by the dedicated Loom model (cargo test-ring-loom).- Done means: the ASan gate exists, runs under nightly, and any
findings either land as fixes or get a documented disposition; the
TSan target starts passing once the upstream
-Zbuild-std+ build-script issue is fixed.
Security Verification Tracks S.1 through S.5 have initial coverage. Track S.6 is ongoing doc hygiene and should move with review-process changes. Track S.8 must land before Stage 6 runs mutually-untrusting services. Track S.9 design is complete and now gates concrete implementation work in 3.6/5.2. Track S.11 gates device-driver work. Track S.12 should not expand bounds for their own sake; it is a refresh point when new kernel invariants make better proof targets available. Track S.13 closes the remaining target-3 gap from the table above.
7. What This Proposal Does Not Promise
- No claim that capOS will be “secure” at the end. It will be harder to write a silently wrong change to the code paths the tooling covers, and it will be easier to find the ones that are still wrong.
- No proof obligation on every PR. Kani and Loom are expensive to run on every push; CI runs them on a reduced schedule (e.g. nightly, or on PRs that touch the covered crates).
- Userspace and host-tool bugs are in scope, but their impact is classified by boundary. A userspace bug should not compromise kernel isolation; a host-tool bug can still compromise the build TCB or developer/CI filesystem.
- No claim that confidentiality is handled beyond architectural isolation. Timing channels, cache side channels, device side channels, and covert channels through shared services remain explicit research topics, not current implementation goals.
8. Relation to Other Docs
docs/research/sel4.md§1 and §6.1 already make the case that full verification is not the right goal. This proposal is the operational answer.REVIEW.mdis the reviewer’s rulebook. This proposal explains the security and verification rationale behind its common checklist and per-boundary prompts.docs/tasks/**is the open-issue ledger. This proposal feeds it – every bug found by Tier 2/3/4 tooling gets a task record unless fixed in the same change.docs/roadmap.mdowns the stages; this proposal does not add stages, only a cross-cutting track that runs alongside them.- Task records under
docs/tasks/own concrete ordering; Security Verification Tracks S.1–S.17 above are mirrored there when they are actionable slices. docs/design-risks-register.mdis the consolidated index of long-horizon design risks and open architectural questions; consult it when this proposal’s open gaps reference a hazard whose primary owner lives in a subsystem proposal, backlog, or design file rather than here.
DMA Assurance Model
Current DMA authority and isolation design authority lives in DMA Isolation. This proposal defines the accepted evidence model and is retained as the grounding record for DMA proof obligations.
The DMA assurance model is the evidence scaffold for moving capOS from bounded QEMU-local provider proofs toward cloud and production device-driver claims. It does not select a cloud DMA backend. It defines the claims that a backend must prove, the model objects those claims refer to, and the tools that should check each claim before a driver slice can cite it.
The immediate use is the cloud DMA backend decision: direct DMA through a reviewed remapping domain, labeled bounce buffers, or unsupported. The binding choice and any per-VM-shape safety claim remain attended decisions.
Claim Boundary
The model is about DMA authority, not whole-kernel correctness.
In scope:
- ownership of
Device,DMAPool,DMABuffer,Page,IommuDomain,Iova, descriptor, completion, and interrupt-route state; - lifecycle transitions from allocation through mapping, publication, completion, revocation, invalidation, scrub, and reuse;
- stale handle, stale completion, revoke/reset race, teardown-under-DMA, no-host-physical-exposure, and cross-domain aliasing claims;
- the evidence split between IOMMU-backed direct DMA and labeled bounce-buffer fallback.
Out of scope:
- proving all kernel behavior;
- proving cloud-provider hardware facts without attended evidence;
- treating QEMU Intel VT-d evidence as general hardware evidence;
- creating a new prover or proof kernel.
The capOS-specific layer may become a DSL later, but it must emit to mature checkers or proof assistants. A self-authenticating capOS prover would increase the trusted base and is not part of this plan.
Model Objects
The abstract model uses these terms consistently across docs, model files, and future proof harnesses:
| Object | Meaning |
|---|---|
Device | PCI function or provider device that can issue DMA. |
IommuDomain | Device-manager-owned translation context or trusted sharing group. |
DMAPool | Capability-scoped allocation authority for DMA buffers. |
DMABuffer | Live buffer handle with owner, slot, generation, and mapping state. |
Page | Physical backing page owned by the device manager or held fail-closed. |
Iova | Device-visible address meaningful only inside one domain. |
Descriptor | Device-visible command referencing a live buffer generation. |
Completion | Device or software observation that a descriptor finished. |
IrqRoute | Interrupt source, route generation, waiter, mask, and ack state. |
The first model files live under models/dma/. They are
small by design: reviewers should be able to read the whole state machine and
tell whether it matches the DMA design before any checker is involved.
Required Invariants
| Invariant | Required meaning |
|---|---|
| No host-physical exposure | Result caps, diagnostics, audit, and cloud evidence never expose a host physical address to a driver. IOMMU-backed paths may expose only a domain-scoped IOVA labeled with its domain. |
| Mapping before publication | A descriptor cannot become device-visible until the backing buffer is live, owned by the device manager, and either mapped in the selected IOMMU domain or copied through the selected bounce-buffer path. |
| No page reuse before teardown | A DMA page cannot return to the general free pool until submissions are stopped, in-flight descriptors are drained or invalidated, mappings are removed, required invalidations complete, and the page is scrubbed. |
| Stale handles fail closed | A stale pool, buffer, slot, page, source, route, or generation cannot create a new side effect. |
| Stale completions fail closed | A completion whose descriptor, buffer, slot, page, owner, or generation no longer matches cannot publish CQ state, ack IRQ state, free pages, or reuse buffers. |
| Domain-scoped aliasing only | The same IOVA may be reused in different domains, but one domain cannot map the same IOVA to two pages unless an explicit trusted sharing group model permits it. |
| Fail-closed leaks are bounded | If teardown cannot prove that hardware can no longer reach a page, the page or pool may be held, but that hold must be accounted, bounded, and surfaced as a remediation item. |
| Backend evidence is explicit | Direct DMA requires remapping-domain evidence. Bounce-buffer fallback must stay labeled as not hostile-hardware isolation. Unsupported devices stay disabled. |
Tool Mapping
The assurance model intentionally uses several narrow tools instead of one large proof.
| Tool | capOS use |
|---|---|
| TLA+ / TLC | Model lifecycle ordering and races: allocate, map, publish, complete, revoke, flush, scrub, reuse, reset, and fail-closed hold. The v0 skeleton is models/dma/dma_authority.tla. |
| Alloy | Model the relational authority graph: device, domain, IOVA, page, owner, and alias constraints. The v0 skeleton is models/dma/dma_authority.als. |
| Kani | Prove pure Rust validators and accounting helpers once they are extracted into host-checkable code: generation matching, budget arithmetic, stale rejection, and fail-closed transitions. |
| Loom | Cover concurrency-sensitive state that depends on atomics, queues, or multi-CPU ordering. The first target was the DeferredCompletionQueue / TLB-shootdown model gap now recorded in docs/tasks/done/2026-06-04/dma-assurance-model-deferred-completion-loom.md. |
| Verus | Candidate later tool for small critical Rust cores that need unbounded functional contracts and are stable enough to justify annotation cost. |
| HAMR / Microkit | Reference architecture for static component contracts and traceability, not a replacement runtime for capOS. Useful for comparing device-manager and driver partitioning assumptions. |
Do not claim a checked model result merely because the files exist. A checked claim requires recording the exact tool, version, configuration, model bounds, and command output in the task evidence.
V0 Gate
dma-assurance-model-v0 is complete when:
- this proposal defines the model objects, invariants, tool mapping, and claim boundaries;
models/dma/contains inspectable TLA+ and Alloy skeletons for the lifecycle and authority graph;- the cloud DMA backend draft task depends on this model before it can be promoted beyond proposal text;
- the verification workflow names these model files as planned design evidence, while making clear that no required checker gate exists yet;
- docs workflow and diff hygiene pass.
Future slices should add actual checker commands only after the repo has pinned
tool installation and run targets. Suggested future targets are
make model-dma-tla, make model-dma-alloy, make kani-dma-authority, and a
focused Loom target for DeferredCompletionQueue.
V1 Operationalization
dma-assurance-model-operationalization (2026-06-04) reconciles the v0
skeletons with the DMA authority code that landed after them and emits the
checker tracks as concrete task records, so the work cannot be silently parked
again. The reconciliation gap table — which invariants the skeletons already
capture and which landed-since invariants are MISSING — is recorded in
models/dma/README.md and grounded against named
symbols in kernel/src/device_dma.rs, kernel/src/cap/dma_buffer.rs,
kernel/src/device_manager/stub.rs,
kernel/src/cap/virtio_net_userspace_rx_dma_proof.rs, and
kernel/src/arch/x86_64/tlb.rs.
Landed-since invariants MISSING from the v0 skeletons: ownership-generation bump
on recycle, map-record-before-PTE-install ordering, drive-pin/quarantine, the
queue-enable epoch fence, and the deferred-EOI / completion-queue concurrency.
Each is owned by an emitted checker slice (each names its make target, pinned
tool + version, model bounds, and the exact invariant it checks, and each must
record checked output per the anti-overclaim rule above):
| Track | make target | Tool | Slice |
|---|---|---|---|
| Lifecycle ordering + generation + stale-completion | make model-dma-tla | TLA+/TLC (pinned; TLC-pin owner shared with the scheduler/IRQ model tracks) | dma-assurance-model-tla-checked-gate (done 2026-06-04, checked clean at 2/2/2/2, gen 0..1) |
| Device/domain/IOVA/page/alias authority graph + generation | make model-dma-alloy | Alloy (pinned 6.2.0; Alloy-pin owner) | dma-assurance-model-alloy-checked-gate (done 2026-06-04, checked for 4) |
| Extracted pure ownership-generation / stale-handle / no-re-expose core | make kani-dma-authority | Kani (pinned 0.67.0, kani-lib style) | dma-assurance-model-kani-authority-core (done 2026-06-04, 3 harnesses checked over capos_lib::dma_authority) |
| Deferred-EOI / completion-queue concurrency | new Loom target | Loom (test-ring-loom sibling) | dma-assurance-model-deferred-completion-loom |
CI wiring (make check / GitHub gate) + cite checked evidence | (wiring only) | — | dma-assurance-model-ci-wiring (done 2026-06-05) |
Cloud Backend Use
The cloud backend draft must cite this model and fill an evidence matrix for each backend candidate:
| Candidate | Required evidence before sign-off |
|---|---|
| Direct remapping domain | Cloud VM shape exposes guest-programmable remapping hardware; capOS can discover and program it; descriptor publication is ordered after mapping; teardown removes mappings and observes required invalidations before page reuse; hostile stale-DMA and stale-completion smokes cover the selected path. |
| Labeled bounce-buffer fallback | Direct DMA remains blocked; all device-visible addresses are manager-owned bounce pages; no host physical address is exposed; stale handle/completion/teardown evidence covers the selected fallback; documentation states that hostile bus-mastering hardware isolation is not claimed. |
| Unsupported | Device remains disabled or unbound; no driver-visible DMA, MMIO doorbell, interrupt ownership, or storage/network readiness claim is made. |
The matrix must distinguish provider-side isolation facts from guest-controlled isolation facts. SR-IOV, virtual NIC, GPU, accelerator, or local NVMe support is evidence that a VM exposes DMA-capable device surfaces, but it is not direct remapping evidence unless the guest also exposes an IOMMU or equivalent translation authority that capOS can program. Each VM-shape row should record the provider, region or zone, instance type, image and kernel, provider API or documentation source and date, live guest probe output, visible PCI/device drivers, visible IOMMU tables or groups, maintenance/revocation behavior, and the resulting backend classification.
The matrix is a support-policy input, not a hardcoded boot oracle. capOS should
infer the safest available backend at runtime from the device inventory,
remapping authority it can actually program, driver self-tests, and fail-closed
probe results. Unknown or contradictory observations select Unsupported, not
direct DMA. Provider evidence remains necessary for VM shapes the project wants
to advertise as supported, because a guest probe cannot fully prove host-side
provider isolation or maintenance behavior.
The matrix is an input to attended sign-off. It is not itself the sign-off.
Design Grounding
- DMA Isolation
- IOMMU Remapping Grounding
- seL4
- seL4 HAMR
docs/tasks/done/2026-05-23/ddf-iommu-remapping-production-closeout.mddocs/tasks/done/2026-05-23/ddf-provider-virtio-net-driver-closeout.md- TLA+ TLC model checker documentation: https://docs.tlapl.us/using%3Atlc%3Astart
- Alloy analyzer documentation: https://alloytools.org/faq/what_kind_of_analysis_does_the_alloy_analyzer_do.html
- Kani Rust verifier documentation: https://model-checking.github.io/kani/
- Loom crate documentation: https://docs.rs/loom/latest/loom/
- Verus guide: https://verus-lang.github.io/verus/guide/
Proposal: Device Manager Refactor
Before the current module split, kernel/src/device_manager.rs was the
convergence point for production device authority, transitional userspace cap
surfaces, QEMU proof harnesses, audit labels, DMA/MMIO/IRQ policy, and
serialization checks. That shape was useful while the Device Driver Foundation
gate was evolving, but it hid the target userspace-driver model behind a
single large file.
The refactor should keep the kernel device manager as the authoritative
ownership ledger for claimed devices while separating proof scaffolding and
domain-specific record logic into clearer modules. It must not weaken the
single ownership transaction across DMAPool, DMABuffer, DeviceMmio, and
Interrupt.
Implementation Status
The first mechanical proof split landed at 99c37592
(refactor(kernel): split device-manager proof scaffolding). Current main
keeps public proof wrapper functions in kernel/src/device_manager/mod.rs for
existing virtio.rs call sites, with the moved proof scaffolding in
kernel/src/device_manager/proofs.rs.
Later mechanical slices split handles/errors and domain record helpers into the
current kernel/src/device_manager/ module tree. The transaction-helper
cleanup also landed at 98dddb72 (device_manager: share authority admission helpers), with the aggregate PciDeviceRecord still serving as the single
claimed-device ledger and existing proof/audit labels preserved.
Design Grounding
- DMA Isolation Design requires one device-manager ledger of record for each claimed device before userspace NIC or block drivers receive hardware authority.
- Service Architecture makes init the
holder of
DeviceManagerandProcessSpawner, with child hardware drivers receiving only scoped device caps. - Networking defines the target NIC split: a
userspace NIC driver holds
DeviceMmio,Interrupt, andDMAPool, then exports aNiccap to a separate network stack. - Device Driver Foundation is the
active implementation track for the hardware authority gates that make this
refactor useful. The plan explicitly schedules this refactor as
high-priority DDF risk reduction subordinate to behavior-moving authority
slices, and any further split slice must be mechanical, behavior-preserving,
and reduce review risk for upcoming
DeviceMmio,Interrupt, orDMAPoolauthority work. - Pass AttachedDmaPoolRecord by reference
is the ready DDF prerequisite that converts the device-manager ledger record
from by-value to by-reference threading through the proof emission paths.
The current by-value layout exhausted the BSP boot stack when the inline
AttachedDmaPoolRecord::proof_buffersslot count was grown past three; the by-reference conversion unlocks further provider-TX descriptor concurrency without expanding the per-frame footprint of nested proof emissions.
No external prior-art report is required for the initial split: this is a repo-local maintainability refactor that preserves the existing accepted authority model rather than selecting a new OS design.
Module Shape
The accepted current shape has converted kernel/src/device_manager.rs into a
kernel/src/device_manager/ module tree:
kernel/src/device_manager/
mod.rs public API, re-exports, lock-order notes
handles.rs BDF, owner/state, and handle structs
error.rs production DeviceManagerError and display helpers
mmio.rs DeviceMmio records and map/unmap/read/write admission
dma_pool.rs DMAPool records, accounting, budget, teardown evidence
dma_buffer.rs DMABuffer records, map/free/submit/complete admission
interrupt.rs interrupt records and route/wait/ack/mask/unmask admission
proofs.rs transitional QEMU proof entry points and proof logs
Future cleanup is limited to optional registry, ledger, or proof-internal
splits if they reduce review risk for upcoming DDF work. The current accepted
proof split is proofs.rs, not a proofs/ directory.
PciDeviceRecord should remain the aggregate owner of a claimed device’s
ledger. The split should move record-specific logic behind modules, not create
independent managers that can diverge during teardown.
Follow-Up Risk Reduction
Two adjacent tracks reduce review risk for further DDF proof growth without disturbing the accepted module shape:
- The by-reference ledger-record threading prerequisite tracked in
docs/tasks/ddf-attached-dmapool-record-by-ref.md
converts
AttachedDmaPoolRecordfrom by-value to by-reference through the affected proof emission paths so that growing inline proof slot counts no longer multiplies cumulative stack frames across nested proof calls. - The scheduler off-stack release work that landed under
d322a78f(sched: make thread stack drops off-stack explicit) and9b94ea7f(sched: release qemu proof stacks off-stack) already pulls QEMU release-proof process kernel stacks off the dropping thread, which removes one stack-pressure axis on the BSP boot path that previously interacted with the device-manager proof emissions.
Refactor Strategy
-
Split proof code first. Landed at
99c37592.prove_qemu_*, proof log structs, proof-only error enums, and bounded proof helper functions moved into a proof module. Wrapper functions remain inmod.rsso currentvirtio.rscall sites do not churn. -
Split handles and errors. Landed at
734383f9.PciBdf, owner/state enums, handle structs,DeviceMmioRegion, andDeviceManagerErrormoved into dedicated modules. -
Split record domains. Landed at
af539f6c. MMIO, DMA pool, DMA buffer, and interrupt attached-record logic moved into domain modules whilePciDeviceRecordremains the aggregate ledger owner. -
Preserve one authoritative ledger. Every operation that creates, consumes, or releases device-visible authority must still update the claimed-device ledger as part of the same ownership transaction that changes device-manager state.
-
Improve internal APIs after the split. Landed at
98dddb72. Narrow transaction helpers and typed admission contexts now remove repeated stale-handle, owner, generation, state, and attached-record checks while preserving the single aggregate ledger.
Constraints
- Preserve the existing lock order:
PCI_DEVICE_MANAGERbeforeDEVICE_INTERRUPT_ROUTES. - Preserve cap semantics, audit labels, proof labels, and QEMU smoke output during the initial split.
- Keep userspace-driver authority blocked until the Device Driver Foundation gates still marked open are closed.
- Avoid broad call-site churn. Compatibility wrappers are acceptable during the mechanical phase.
- Do not move authority decisions into userspace. Userspace drivers receive scoped caps, but the kernel remains the ledger and enforcement point.
- Keep proof code available until userspace-driver production gates have equivalent coverage.
Validation
For mechanical file movement, run:
make fmt-checkcargo build --features qemumake workflow-check
When a slice moves code that emits or validates device proof labels, also run the affected QEMU gates:
make run-netmake run-ddf-provider-consumermake run-devicemmio-grantmake run-dmapool-grantmake run-interrupt-grant- relevant
make run-hardware-audit*targets when audit or proof labels move
Choose those gates from the moved authority surface, not from the file move alone:
- proof-log or proof-label movement needs the QEMU target that asserts those exact proof lines;
- grant-source or cap-object movement needs the matching
run-*-granttarget plus any parent lifecycle target it depends on; - audit emission, snapshot decode, or audit-label movement needs the matching
run-hardware-audit*target; DMABuffer,DeviceMmio,Interrupt, selected provider TX, proof labels, or schema-comment movement for those surfaces needsmake run-ddf-provider-consumer;- MMIO, DMA, IRQ, or teardown transaction movement needs the focused grant target and the broader device proof target that exercises stale handle, revoke/reset, or release behavior;
- pure type, handle, or error-module movement may stop at
make fmt-check,cargo build --features qemu, andmake workflow-checkonly when the public diff leaves proof labels, grant behavior, and authority transactions unchanged.
Success Criteria
device_manageris a module tree rather than one monolithic source file.- Production authority paths are visibly separated from QEMU proof scaffolding.
- Public behavior and existing proof/audit labels are unchanged by the initial split.
- The module boundaries match the target userspace-driver design: kernel code owns claim, revoke, teardown, MMIO, DMA, and IRQ authority; userspace drivers consume only scoped capabilities.
Cloud Driver Foundation: Gap Analysis
Premise Correction
A prior framing held that “capOS has no userspace device-driver foundation.” That is wrong. The userspace virtio driver foundation exists and is proven in QEMU across a month of landed DDF work. This document establishes precisely what the foundation covers and reduces each blocked cloud-driver task to its narrow real remaining gap, so no one re-implements a foundation that already exists.
What The Foundation Already Provides (proven, in docs/tasks/done/)
- Device-agnostic virtio DMA/notify seam + relocated queue/discovery
(
ddf-virtio-driver-foundation-boundary, 2026-05-25). The split-ringVirtqueueanddiscover_modern_transportlive inkernel/src/virtio.rs mod transport, driven through theVirtqueueDmaseam (preflight/register/allocate/free/record-submission/record-completion over thedevice_dmaledger). virtio-net is one caller of the seam, not the only possible caller – a non-net virtio device (e.g. virtio-blk) can drive the same bounded ledger semantics. Proofs:make run-net,make run-ddf-provider-consumer. - Userspace provider owns the selected virtio-net TX queue end-to-end
(
ddf-provider-virtio-net-driver-closeout, 2026-05-23). A userspace process publishes real selected-queue TX descriptors, rings the doorbell through aDeviceMmionotify-write claim, consumes the TX used-ring completion, and exposes CQ identity – all through user-modeDMAPool/DMABuffer/DeviceMmio/Interruptauthority, with no silent fallback to the in-kernel virtio-net TX helper while the provider owns TX. RX is bounded synthetic-token CQ identity (kernel RX cohabitation explicit). DMA backend is manager-owned bounce buffers. - Manager-granted provider/consumer authority lifecycle
(
ddf-userspace-driver-provider-consumer, 2026-05-11). A userspace provider consumes manifest-granted DMAPool/DeviceMmio/Interrupt authority; stale-authority rejection, revoke, and release/reset/driver-death teardown are proven. - GCP virtio-net function bound through the gate locally in QEMU
(
cloud-gcp-virtio-net-local-qemu-binding, 2026-05-26). The enumerated/bound function matches the documented GCP 1st/2nd-gen virtio-net surface (vendor0x1af4), the resolved DMA backend is the labeled bounce-buffer path, proven bymake run-netandmake run-ddf-provider-consumer. - DMA backend selection (
cloud-dma-backend-selection, 2026-05-24): boot probe -> fail-closed select -> manifest override; GCE resolves to bounce-buffer. - Production IOMMU remapping closeout (
ddf-iommu-remapping-production-closeout, 2026-05-23): the direct-remapping domain path for IOMMU shapes (make run-iommu-remapping). - First
BlockDeviceCapObject (ddf-blockdevice-boundary-virtio-blk-smoke, 2026-05-25): a bounded sector write/read-back over virtio-blk (make run-virtio-blk). Note: thisBlockDeviceis kernel-side, over manager-owned bounce buffers – it is not a userspace storage provider.
Boundary Of The Foundation (where userspace ownership stops today)
- NIC: userspace owns virtio-net TX; RX is synthetic/cohabited. No live hardware RX used-ring ownership, no direct DMA/IOMMU on the provider path, no cloud enumeration.
- Storage: there is no userspace storage provider of any device class. The
BlockDevicecap is kernel-side; NVMe is metadata-only (kernel/src/pci.rsenumerates the controller and emits ano-authority/ no-driver ... controller_init=not-startedline, no register/queue/IDENTIFY/IO code). The NIC userspace driver does not transfer to storage: NVMe is a different device class (admin/IO submission+completion queue pairs, doorbells, PRP/SGL), and even userspace virtio-blk/virtio-scsi has no provider driver – the foundation seam makes it possible, but no slice has built it. - Production grant sources stage an arbitrary function through one
device-agnostic entry point (done 2026-05-30). The non-
qemu{dmapool,devicemmio,interrupt}_grant_source_prodstatics previously inferred their candidate function from a hardcoded selection rule narrowed by#[cfg(feature = "cloud_*")]blocks scattered through eachpick_candidatebody.cloud-prod-grant-source-despecializationreplaced that with onestage_with_classentry point per source that takes an explicitProdGrantClassdevice-class descriptor (cap::prod_grant_source_class):AnyFunction(plain BAR / first usable function),DmaCapable(virtio or NVMe), orNvmeController(NVMe only); the DeviceMmio source additionally takes the explicit mapped-window length (one page for the plain/virtio-net notify family, two pages for the NVMeCC/admin-register selected-write region). The no-arginit()wrappers select the build’s descriptor and delegate, so a non-virtio-net function is staged by passing the matching descriptor rather than by reaching virtio-net-specific code. The transitional in-kernelqemu-path grant sources still carry the per-functioninit_*_for_device/init_provider_*variants; those follow the virtio transport into userspace under Phase C of the networking proposal rather than through this slice.
Per-Task Gap (the narrow real Y)
cloud-gcp-virtio-net-nic-driver -> runnable-now claim is superseded
The 2026-05-27 version of this document concluded that the GCP virtio-net live
driver task was runnable as a cloud-evidence slice. That conclusion is now
stale. The local production cloudboot bind markers have landed, but
cloud-prod-provider-nic-bound-local-proof deliberately settled its completion
boundary with a kernel-side dispatch-slot proxy because the production
userspace-provider grant/waiter surface is still not available in the
non-qemu cloudboot build.
The current local production chain is therefore still implementation work, not
just billable evidence capture:
The cloud-prod-provider-devicemmio-grant-source-local-proof,
cloud-prod-provider-dmapool-grant-source-local-proof, and
cloud-prod-provider-interrupt-grant-source-local-proof children are done
(2026-05-28): the non-qemu cloudboot kernel can deliver DeviceMmio,
DMAPool, and Interrupt grants to small userspace provider services through
manifest/process-spawner delivery, each with its own local-QEMU proof and
bounded caveats. The aggregate docs-status closeout
cloud-prod-provider-grant-surface-local-proof is also done (2026-05-28):
it records those landed children as one provider grant-surface boundary
without adding new behavior. The remaining local production work is
cloud-prod-provider-cap-waiter-local-proof, then
cloud-prod-virtio-net-userspace-provider-local-proof (and the brokered NVMe
sibling). Only after those local production userspace-provider tasks land does
the live-GCE NIC task reduce to a cloud evidence/harness run.
The access and spend corrections still stand: GCE access is provisioned and the operator authorized billable runs on 2026-05-27. The blocker is local production userspace-provider authority, not cloud access.
Storage tasks -> gap is a userspace NVMe-class storage provider
cloud-gcp-storage-driver, cloud-gcp-storage-local-qemu-binding,
cloud-aws-nvme-storage-driver, cloud-azure-disk-storage-driver all reduce to
the same genuine missing piece: a userspace storage provider driver. virtio-net
TX ownership does not carry to storage. Two real sub-gaps:
-
No userspace storage provider driver. Either (a) a userspace virtio-blk/ virtio-scsi provider over the existing virtio seam (the kernel
BlockDeviceis kernel-side and does not satisfy the “no hidden kernel DMA ownership” acceptance), or (b) a userspace NVMe-class driver (controller bring-up + admin/ IO queue pairs + doorbells + PRP DMA) over the bounce-buffer/IOMMU backend. NVMe is the strategic target: GCP 3rd-gen+, AWS Nitro EBS, and Azure Boost are all NVMe, so one NVMe foundation unblocks all three providers’ storage legs. -
The no-IOMMU
run-pci-nvmeproof gate and the DMA-address ownership model. A real provider-driven NVMe completion + “no hidden kernel DMA ownership” + “no host-physical exposure” must all hold under the no-IOMMU bounce-buffer shape. The 2026-05-27 Model B override (provider writes queue-base/PRP addresses, kernel validates on notify) does not satisfy those constraints on the current no-IOMMU gate: device-visible equals host physical, and reviewed IOVA export discipline intentionally returns no usable device address to userspace.The correction is to split the lanes. Model B remains valid for a verified direct-remapping/vIOMMU gate, or a future synthetic address namespace translated by trusted code. The GCP/no-IOMMU lane must use brokered bounce: the provider owns NVMe protocol state and buffer/command capabilities, while the kernel or device manager materializes
ASQ/ACQ, I/O queue-base, and PRP/SGL device-visible fields from the liveDMAPoolledger. That is the only current path that preserves no-host-physical-exposure on GCP.
The ordered NVMe work therefore splits into:
- no-IOMMU brokered lane:
nvme-no-iommu-brokered-controller-enable(landed 2026-05-27 21:38 UTC, commit11b86568) ->nvme-admin-queue-identify(landed 2026-05-27 22:34 UTC, commitcede5257) ->nvme-admin-interrupt-delivery(landed 2026-05-27 23:07 UTC, commit18fd25c7) ->nvme-io-queue-and-read(ready brokered I/O/read); - direct-remapping lane:
nvme-doorbell-dma-validator(landed mechanism) -> provider-written enable/admin/I/O slices on a verified IOMMU/vIOMMU gate.
Those are the real storage Y for the NVMe path; the virtio-scsi path is an alternative userspace provider of comparable size. None of this is “build a foundation” – it is “build a storage device-class provider on the existing foundation.”
AWS / Azure storage -> consume the GCP NVMe foundation + provider delta
cloud-aws-nvme-storage-driver and cloud-azure-disk-storage-driver already
re-scope themselves to a small provider delta once the shared NVMe foundation
lands. No new driver decomposition; their blocked-until is the GCP NVMe child
chain. Their AWS/Azure NIC siblings (ENA, MANA) are vendor-custom and out of GCP-first scope.
What This Document Changes
- Supersedes the
cloud-gcp-virtio-net-nic-driverrunnable-now claim. The QEMU userspace virtio foundation remains useful grounding, but the live GCP NIC task stays blocked until the local production userspace-provider grant-source, waiter, and userspace virtio-net provider chain lands. - Decomposes the storage gap GCP-first into a no-IOMMU brokered-bounce userspace NVMe lane for GCP and a separate direct-remapping Model B lane for IOMMU/vIOMMU proofs.
- Re-points AWS/Azure storage at the GCP NVMe child chain.
Design Grounding
docs/tasks/done/2026-05-25/ddf-virtio-driver-foundation-boundary.mddocs/tasks/done/2026-05-23/ddf-provider-virtio-net-driver-closeout.mddocs/tasks/done/2026-05-11/ddf-userspace-driver-provider-consumer.mddocs/tasks/done/2026-05-26/cloud-gcp-virtio-net-local-qemu-binding.mddocs/tasks/done/2026-05-25/ddf-blockdevice-boundary-virtio-blk-smoke.mddocs/tasks/done/2026-05-24/cloud-dma-backend-selection.mddocs/tasks/done/2026-05-23/ddf-iommu-remapping-production-closeout.mddocs/proposals/nvme-model-b-doorbell-dma-validator.md(conditional Model B validator for direct-remapping/synthetic-address lanes)docs/research/dma-userspace-driver-isolation.mddocs/dma-isolation-design.md(Cloud DMA Backend; IOVA export discipline)kernel/src/virtio.rs(transport::VirtqueueDma,transport::Virtqueue),kernel/src/cap/{dma_pool,dma_buffer,device_mmio,interrupt,block_device}.rs,kernel/src/device_dma.rs,kernel/src/device_interrupt.rs,kernel/src/pci.rsdocs/proposals/cloud-deployment-proposal.md,docs/backlog/hardware-boot-storage.md#cloud-device-tracks
NVMe Userspace Provider: Conditional Model B Doorbell/Notify DMA Validator
Operator Decision (2026-05-27)
The userspace NVMe-class storage provider
(docs/proposals/cloud-driver-foundation-gap-analysis.md, NVMe child chain)
selected Model B: provider-writes-everything, kernel-validates-on-notify for
the direct-remapping userspace-driver lane. This was intended to override the
kernel-mints-the-address model (Model A) that the gap analysis originally
recommended for the storage chain and that the landed virtio-net TX provider
uses.
The operator’s stated reason: capOS wants the genuine userspace-driver model, where the driver process — not the kernel — owns and writes the device-visible addresses it programs into the controller. Model A keeps device-address minting inside the kernel, which is safe but is not a real userspace driver: the provider only places a value the kernel already chose. Model B makes the provider a first-class driver and moves the kernel from address-author to address-validator.
Correction recorded later on 2026-05-27: Model B cannot be used on the current
no-IOMMU run-pci-nvme or probed GCP bounce-buffer path without exporting host
physical addresses to userspace. It remains valid for a verified
direct-remapping/vIOMMU lane, or for a future synthetic device-address namespace
that the manager translates before hardware sees it. The GCP/no-IOMMU path must
use brokered bounce address publication instead.
This is a design-and-task slice only. The landed nvme-doorbell-dma-validator
mechanism remains the direct-remapping/synthetic-address validator component;
the no-IOMMU controller-enable work is re-planned as a brokered-bounce slice.
Model A vs Model B
| Dimension | Brokered address publication (kernel/device-manager materializes) | Model B (provider-writes, kernel-validates) |
|---|---|---|
| Who writes the device-visible address | Kernel or device manager writes queue-base/PRP/SGL values from live buffer authority. Provider submits typed requests or places opaque kernel-authored values only when that is safe. | Provider writes the device-visible address itself into ASQ/ACQ/SQ/CQ bases and PRP/SGL entries. |
| Kernel role | Author of every device address; trivially correct by construction; no scan needed. | Validator: on each doorbell/notify, scan the submitted descriptors/queue-base registers and reject any address outside the owner’s granted DMA window. |
| New kernel component | None. | A ring/queue-scan on-notify DMA validator (this proposal). |
| Driver authenticity | Provider owns protocol choices but not raw device-address authorship. This is required when device-visible equals host physical. | Provider is a real driver that owns its addresses. |
| Where it applies | No-IOMMU brokered-bounce paths, including probed GCP shapes and the current no-IOMMU run-pci-nvme gate. | Verified direct-remapping/vIOMMU paths, or a future synthetic address namespace. |
The two models coexist. The existing virtio-net TX path keeps brokered/kernel-
authored device addresses. The NVMe validator is retained for lanes where
provider-written addresses are not host physical addresses. A DeviceMmio
doorbell claim must declare which model is active; no-IOMMU claims must not
accept provider-authored raw device addresses.
What Model B Requires: the On-Notify DMA Validator
The validator is a kernel component invoked on the doorbell/notify path of
the NVMe provider’s DeviceMmio selected-write claim. Before the doorbell write
reaches the device (i.e. before the controller can fetch the just-submitted
descriptors or act on a just-programmed queue base), the kernel scans the
device-visible addresses the provider wrote and fails closed if any address
is not inside that owner’s granted DMA window(s).
Scan targets (what the validator reads)
- Queue-base registers, scanned when the doorbell/notify that arms a queue
is rung (or on the controller-enable /
CC.ENwrite that activates the admin queue):ASQ,ACQ, and the I/OSQ/CQbase addresses the provider programmed through its selected-writeDeviceMmioclaim. - Submission-queue entries newly made visible by an SQ tail doorbell: the PRP1/PRP2 entries (and, where used, the PRP list pages and SGL descriptors) of each NVMe command between the last validated tail and the new tail. The validator follows one level of PRP-list indirection; deeper SGL/PRP-chain shapes are out of scope for the bounded proof and are rejected, not silently accepted.
The validator scans only on notify — not on every provider memory write. The provider may freely write into its own mapped DMA pages between doorbells; nothing device-reachable happens until a doorbell rings, and that is the single choke point the kernel guards. This bounds the validation cost to the descriptors a single doorbell newly publishes (one queue entry for a depth-1 admin proof, a small bounded batch otherwise), not to the whole address space.
Invariants (fail-closed on any violation)
- Bounds. Every scanned device-visible address, and the full extent of the region it names (queue size × entry size for a queue base; transfer length for a PRP/SGL data pointer), must lie wholly within a DMA window granted to the owning provider. An address at the window edge whose region runs past the window end fails closed. Unaligned queue-base or PRP addresses (NVMe requires page-aligned PRP1 for the first entry, dword-aligned queue bases) fail closed.
- Owner-scoping. The window set checked is exactly the set granted to the
provider that owns the
DeviceMmiodoorbell claim being rung. An address that is valid for another owner’s window is rejected for this owner: no aliasing into a different owner’s DMA region, no host-physical address, no out-of-any-window address. The validator resolves “owner” from the doorbell claim’s grant identity, not from the address value. - No host-physical / no out-of-window. The provider-written value must be a domain-scoped IOVA or synthetic device address, never a host physical address. On the current no-IOMMU bounce path this invariant cannot be satisfied by provider-authored queue-base/PRP values, because device-visible equals host physical and userspace export is disabled.
- Stale-completion / generation. The validator binds its accept decision to the live grant generation of the owner’s DMA window and doorbell claim. A doorbell rung after revoke/reset/regrant against a stale generation fails closed even if the byte value would have been in-window for the prior grant. Completions are accepted only against the issue/generation that was live at submission scan time, matching the existing stale-completion gate on the virtio-net path; a completion whose submission was never validated (or was validated under a now-retired generation) does not wake a waiter.
- On-notify timing. The scan completes and either accepts or rejects
before the doorbell write is allowed to take effect on the device. A
rejected scan does not write the doorbell, returns a fail-closed error to the
provider’s
DeviceMmiowrite, and records the rejection; the device never sees the descriptor batch. There is no window in which the controller can fetch an unvalidated descriptor. - Quiesce/teardown. On release/reset/driver-death, in-flight doorbell scans are quiesced, the owner’s windows are removed from the validator’s accepted set, backing pages are scrubbed before frame reuse, and any subsequently rung doorbell against the retired grant fails closed.
Where it hooks
The validator hooks the NVMe provider’s selected-write DeviceMmio doorbell
claim in the kernel capability layer — the same selected-write claim the
bring-up slice scopes to the NVMe enable/admin-queue-base/doorbell registers
(mirroring the virtio-net notify-write claim). Concretely:
- The doorbell/queue-base
DeviceMmio.write*path (kernel/src/cap/device_mmio.rs) gains a pre-write validation step for the NVMe doorbell/queue-arm register subset. - The scan reads the provider’s mapped SQ pages and queue-base register shadow
through the manager-owned DMA window records
(
kernel/src/device_dma.rs), checking containment against the owner’s granted window descriptors. It does not gain a generic memory-read authority over the provider; it reads only the descriptor/queue-base bytes the doorbell newly publishes, via the manager’s record of the owner’s DMA pages. - Generation/owner identity comes from the grant ledger
(
kernel/src/device_dma.rs/ the*_grant_sourcerecords), not from provider-supplied metadata.
This is a kernel-side, capability-scoped, on-notify check — not a new ambient syscall and not a per-write trap on all provider memory.
Performance note
The validator runs only on the notify/doorbell path, not on the data path and not on every provider write. Its cost is O(descriptors newly published by this doorbell) — one entry for the depth-1 admin/IDENTIFY proof, a small bounded batch for the I/O queue. Steady-state provider memory writes between doorbells are uninstrumented. This keeps the genuine-driver model without a per-access trap and without copying the data path through the kernel.
No-IOMMU Correction And Brokered Bounce Path
On GCE shapes without a usable guest IOMMU, and on the current no-IOMMU
make run-pci-nvme gate, the labeled bounce-buffer backend does not provide
a provider-visible IOVA namespace. The device-visible value a real NVMe
controller consumes is the host physical or bus address of a manager-owned page.
Publishing that value to userspace would violate the reviewed
no-host-physical-exposure invariant.
Therefore the no-IOMMU storage path must be brokered:
- The provider receives buffer capabilities, queue ownership handles, and typed NVMe command intent, not raw queue-base or PRP addresses.
- The kernel or device manager allocates/pins the bounce pages and writes
AQA/ASQ/ACQ, I/O queue-base, and PRP/SGL fields from the live ledger. - The selected
DeviceMmioclaim gatesCC.EN, queue-arm, and doorbell writes on the brokered ledger state, not on provider-supplied numeric addresses. - Teardown still quiesces outstanding DMA, blocks stale completions, scrubs
pages before reuse, and keeps
hostile_hardware_isolation=not-claimed.
Model B can be reintroduced for NVMe when the proof gate is a verified direct-remapping/vIOMMU shape where the provider-visible value is a domain-scoped IOVA, or after capOS implements a synthetic address namespace that is translated by trusted code before the controller observes it.
Brokered Alternative For No-IOMMU
The brokered model is no longer a rejected storage alternative for no-IOMMU targets. It is the required GCP/no-IOMMU design until a safe non-host-physical device-address namespace exists. Its tradeoff is narrower driver authenticity: userspace owns NVMe protocol state and command construction, but trusted kernel or manager code remains the author of raw device addresses.
Implementing Slices
nvme-doorbell-dma-validator(landed 2026-05-27 08:56 UTC): the kernel on-notify DMA validator mechanism (kernel/src/cap/nvme_doorbell_validator.rs,validate_doorbell_scan/completion_wakes_waiter) and its invariants, proven by the boundedcfg(qemu)hostile-scan self-test (prove_qemu_on_notify_scan_contract) thatmake run-pci-nvmeasserts: out-of-window, host-physical, cross-owner-alias, region-overrun, unaligned, deeper-PRP-chain, and stale-generation all fail closed with no doorbell write and no waiter wake. Synthetic owner windows stand in for the live grant ledger; the liveDeviceMmiodoorbell-path wiring is the bring-up slice below. This is the kernel component Model B requires; the controller bring-up slice depends on it. Provenance map: NVMe.nvme-no-iommu-brokered-controller-enable(landed 2026-05-27 21:38 UTC, commit11b86568): no-IOMMU replacement for the blocked provider-written enable task; brokered admin queue-base materialization with no host-physical export.nvme-userspace-bind-and-controller-bringup: remains blocked unless re-scoped to an IOMMU/vIOMMU proof lane or replaced by the brokered no-IOMMU slice above.nvme-admin-queue-identify(landed 2026-05-27 22:34 UTC, commitcede5257) closes the no-IOMMU admin command.nvme-admin-interrupt-delivery(landed 2026-05-27 23:07 UTC, commit18fd25c7) closes the admin completion wake.nvme-io-queue-and-readis the ready brokered I/O/read continuation. It inherits the same split: provider-written PRPs require direct remapping or a synthetic namespace; no-IOMMU GCP planning requires brokered PRP materialization.
Design Grounding
docs/proposals/cloud-driver-foundation-gap-analysis.md(the foundation map and the original Model A recommendation this overrides for storage)docs/dma-isolation-design.md(Cloud DMA Backend; bounce-buffer fallback; IOVA/window discipline; teardown/scrub ordering)docs/proposals/dma-assurance-model-proposal.mddocs/tasks/done/2026-05-23/ddf-provider-virtio-net-driver-closeout.md(the Model A virtio-net TX provider that this leaves unchanged)kernel/src/cap/device_mmio.rs(the selected-write claim the validator hooks),kernel/src/device_dma.rs(owner DMA window records / grant generation),kernel/src/cap/{dma_pool,dma_buffer,interrupt}_grant_source.rs,kernel/src/pci.rs(NVMe enumeration today)
Remote Session UI Security Proposal
The current Linux remote-session-ui bridge in
tools/remote-session-client/src/bin/remote_session_ui.rs is a trusted
local web bridge: a loopback HTTP listener whose Rust backend owns the
TCP connection to the capOS gateway and the upstream session, while
browser JavaScript receives only DTOs (view models, call results,
denial diagnostics, and redacted transcript rows). This document
describes the web-security posture required before that bridge ships
beyond research use, and how the Tauri desktop wrapper inherits the
controls. It also records which browser-facing controls carry over to the
capOS-served remote-session-web-ui service and which public-origin controls
belong to the selected GCE provider-terminated HTTPS policy, without
authorizing public exposure. It is cross-linked from
docs/proposals/security-and-verification-proposal.md,
docs/proposals/remote-session-capset-client-proposal.md (the parent
proposal that defines the remote session CapSet wire and host-client
shape this bridge instantiates), and the design risks register entry
R17 – Remote-session UI bridge and Tauri wrapper are research-only,
which routes long-horizon residual risk (distributable packaging,
desktop automation, non-loopback exposure) back to this proposal.
Threat Model
The bridge holds the operator’s authority to drive the capOS gateway. Anything that can issue HTTP requests to the loopback listener inherits that authority. The original bridge shape had:
- A single shared
Arc<Mutex<AppState>>constructed once inrun()(around line 1606) and cloned to every accepted connection. - No per-browser session cookie, no per-tab token, no per-origin isolation, no proof-of-possession of the original operator login.
- An origin allow-list that returns
truewhen theOriginheader is absent (origin_allowed, line 2163-2169), which lets non-browser POSTs bypass the only state-change guard. - Plain
http://127.0.0.1:<port>/transport.
Already closed:
- The previous non-constant-time
!=comparison on the automation token has been replaced withconstant_time_eqinautomation_reportandset_automation_report(seetools/remote-session-client/src/bin/remote_session_ui.rs:1378and:1392). Future secret comparisons must use the same comparator. - The loopback bridge now mints per-browser
BrowserSessioncookies, requires CSRF tokens on state-changing/api/*routes, validatesHost/Origin/ JSON content type before route work, and enforces first-wins bridge ownership through an atomic tentative reservation. - The local HTTP parser now bounds request-line length, header-line length, header count, aggregate header bytes, body size, slow reads, and concurrent handler threads before gateway or authentication work.
Gateway-host redirect scope. POST /api/config is intentionally
operator-controlled: it allows an authenticated operator to point the
bridge at a different gateway_host. This is bounded by the
operator-console trust boundary — only a caller who has already passed
the BrowserSession cookie guard and the CSRF double-submit check
(i.e., the bridge-owning operator session) can invoke it. The
capability model provides the deeper guarantee: the bridge holds a
single capOS gateway connection at a time; redirecting to an arbitrary
host replaces that connection but does not grant new capability
authority that wasn’t already present in an authenticated operator
session. No arbitrary-host proxy to untrusted endpoints is possible
without an authenticated operator action.
Treating 127.0.0.1 as a trust boundary repeats the failure pattern of
historic Docker, Jupyter, and Electron loopback CVEs: any local user,
another OS account, a malicious browser extension, a locally-running
package install script, or any other process that can connect(2) to
the listener can drive the upstream capOS gateway with the operator’s
authority. Two browsers today silently share one upstream session;
there is no way for an operator or audit log to distinguish them.
Required Posture
Per-browser BrowserSession
Mint a high-entropy opaque session id at the first browser hit and
store it server-side as a BrowserSession record distinct from the
upstream capOS session. The cookie is the only thing the browser
holds; everything else stays in AppState. Two browsers must end up
with two BrowserSession records.
Cookie attribute target:
HttpOnlySameSite=Strict(the loopback bridge has no cross-site sign-in redirect, soStrictis unconditional here; the capOS-servedremote-session-web-uibehind public ingress selects the posture from the boot manifest instead –Strictby default,Laxonly when an IAP-fronted deployment manifest grants theiap_fronted_ingressmarker, per the selected policy incloud-deployment-proposal.md– and applies it uniformly to the session, CSRF, and clear-cookie headers)Path=/- Host-only: no
Domainattribute. __Host-name prefix when transport allows it (requiresSecureandPath=/and forbidsDomain).Max-Age=...plus an absolute upper bound enforced server-side.
Secure cookie attribute over plaintext loopback is browser- and
version-specific. Modern browsers do treat 127.0.0.1 and ::1 as
potentially trustworthy origins for some Secure-Context APIs, but
acceptance and sending of Secure-flagged cookies over plaintext
loopback is not uniform across vendors and versions. Two acceptable
deployment paths:
- Move the bridge to HTTPS or to the Tauri custom-scheme secure
origin before requiring
Secureand__Host-. - Run on plaintext loopback as an interim with
HttpOnly; SameSite=Strict; Path=/; Max-Age=...and noSecure/__Host-, with a documented support matrix and a test that proves browsers retain and resend the cookie across the supported range.
Decision: option 2 (plaintext loopback, no Secure, no
__Host-). This matches the current research-stage operator-bridge
deployment, which only listens on 127.0.0.1 and is not reachable
from the network. Cookie attributes are therefore HttpOnly; SameSite=Strict; Path=/; Max-Age=<absolute-timeout-secs> exactly –
no Secure and no __Host- prefix. The follow-on Tauri / HTTPS
track will switch to option 1 (with Secure and __Host-) before
shipping beyond research use; the cookie-emit code carves out one
place to flip both attributes when transport changes.
Browser support matrix verified for option 2 (cookies retained and
resent across loopback HTTP without Secure):
| Browser | Min version | Notes |
|---|---|---|
| Chromium | 96+ | 127.0.0.1 is a potentially-trustworthy origin |
| Firefox | 96+ | same; SameSite=Strict enforced for loopback |
| Safari | 15.4+ | macOS 12.3+ / iOS 15.4+ |
| Edge | 96+ | matches Chromium |
The verification host test in iter7 round-trips a cookie through a synthetic loopback request to assert browsers within this matrix retain and resend it. Older browsers (pre-Same-Site-Strict enforcement on loopback) are not supported.
The design must not silently rely on a Secure flag that some target
browsers drop.
Server-side requirements:
- High-entropy opaque ids; never derived from user-controlled input.
- Server-side rotation: regenerate the
BrowserSessionid on successful login and on privilege transitions, and invalidate the prior anonymous/pre-auth record. (Session-fixation defense.) - Server-side invalidation on logout, idle timeout, absolute timeout,
and explicit revoke; wipe the record from
AppState. - Cookie value must never be logged, never written to the transcript, and never included in any DTO returned to the browser.
Multi-browser policy
Pick one and document it here:
- (a) Independent logins. Each
BrowserSessioncarries its own upstream capOS session; logging in from a second browser opens a second upstream session. - (b) First-wins exclusivity. The first authenticated
BrowserSessionowns the upstream session; subsequent browsers see an explicit “session already in use” denial DTO rather than silent piggy-backing.
Either is acceptable if explicit and audit-logged. Silent shared state is not.
Decision: option (b), first-wins exclusivity. The bridge today
holds exactly one upstream capOS session per process, and the
research-stage operator boot does not have a clean way to multiplex
two operator-authority sessions through a single capOS gateway
connection. First-wins is also the auditor-friendlier path: every
denied “session already in use” carries the active
BrowserSession’s timestamped lineage, so the operator and audit log
see the rejection rather than silently sharing state. Concretely:
- The first browser that starts
/api/login/password,/api/login/anonymous, or/api/login/guestafter passing local request guards reserves the owner slot before upstream gateway authentication. Successful login rotates the BrowserSession id, marks that slot authenticated, and keeps the upstream capOS session handle inAppState. - Failed local login validation, bad credentials, and gateway denials release the tentative reservation. An already authenticated owner is not released by a later bad retry from the same browser session.
- Subsequent
BrowserSessions authenticating against the same bridge get a typedsessionAlreadyInUsedenial DTO rather than an upstream login attempt, including while the first session is still authenticating upstream. The denial includes the owner’s claim or authentication timestamp so the second operator sees when the bridge was claimed. - Logout / idle-timeout / absolute-timeout on the owner releases the
upstream session and clears
owner_session_id; the next authenticator wins. - Every transition (claim / denial / release) emits a structured audit event into the same stream as upstream capOS session events so an operator looking back can see the bridge contention pattern.
The Tauri wrapper inherits this rule per-window unless the wrapper introduces an explicit multi-window upstream-fanout authority the loopback bridge does not have.
CSRF and origin discipline
- Require a valid
BrowserSessioncookie on every/api/*route, not only state-changing routes. Today’sGET /api/state,GET /api/transcript/redacted, andGET /api/automation/reportexpose state, transcript, and automation surfaces and must not rely on SOP/loopback assumptions alone. - Reject state-changing requests when
Originis missing. The currentorigin_allowedshort-circuit on missingOrigin(line 2164) must be removed for state-changing methods. ValidateOriginagainst the listener’s expected loopback origin set, and validateRefereras a fallback only whenOriginis absent on legacy paths. - Add a double-submit CSRF token bound to the
BrowserSessioncookie and required on every state-changing POST.SameSite=Strictis not sufficient defense in depth on its own. - Defense-in-depth via Fetch Metadata: reject browser POSTs whose
Sec-Fetch-Siteiscross-siteor whoseSec-Fetch-Modeis not in the expected set for the route. This is not a replacement for CSRF/Origin, but adds another layer.
DNS-rebinding hardening
Validate the Host header against the loopback set
{127.0.0.1:<port>, localhost:<port>, [::1]:<port>}. Without this,
DNS-rebinding from a malicious public site can use the victim’s
browser as a proxy into the loopback bridge.
Content-Type enforcement
Reject POSTs whose Content-Type is not application/json (or the
specific expected type for the route). This blocks text/plain /
form-urlencoded cross-origin form submits that bypass preflight.
Implemented on both surfaces. The capOS-served remote-session-web-ui
normalizes the header (casing and ;-parameters stripped) and requires
the application/json media type on every state-changing /api/* POST
class – login-family and authenticated – before route work, with a
typed 415 denial (missingContentType / unsupportedContentType).
The more specific Host/Origin denials keep precedence, and the fixed
non-JSON routes (/healthz, bundle assets, the scoped ACME http-01
challenge path) are unaffected. make run-cloud-prod-remote-session-web-ui-l4 proves the negative matrix
(missing, text/plain, form-encoded, multipart, malformed, mixed-case
parameterized non-JSON) and the parameterized/mixed-case JSON positives
over the real ingress path. This is local request-shape hardening only;
it is not public ingress or TLS readiness.
Local HTTP request and handler bounds
The host bridge remains a trusted local development bridge. These bounds reduce local resource-exhaustion and confused-client failure modes; they do not make the UI a public network service.
The HTTP parser must reject overlong request lines, overlong header lines, too many headers, excessive aggregate header bytes, and overlarge bodies before route dispatch, JSON parsing, authentication, or gateway I/O. Incomplete or slow request lines, headers, and bodies must time out under a fixed read deadline. The accept loop must also cap concurrent request handler threads and fail closed with a typed local denial rather than spawning one thread per accepted connection without bound.
CORS stance
Emit no Access-Control-Allow-Origin by default. If a future route
ever needs CORS, allow only the exact same-origin echo of the listener
URL. Refuse wildcards. Refuse Access-Control-Allow-Credentials: true
combined with permissive origins. Document the rule in code so future
contributors do not accidentally widen it.
Security response headers
Implemented in the in-guest remote-session-web-ui service
(SECURITY_RESPONSE_HEADERS / CONTENT_SECURITY_POLICY in
demos/remote-session-web-ui/src/main.rs), emitted on every response
class – HTML, static assets, JSON API, /healthz, the ACME http-01
challenge route, and every denial:
X-Frame-Options: DENY(anti-clickjacking).X-Content-Type-Options: nosniff.Referrer-Policy: no-referrer.Cross-Origin-Opener-Policy: same-origin.Cross-Origin-Embedder-Policy: require-corp.Cross-Origin-Resource-Policy: same-origin.Cache-Control: no-store.
The implemented shape applies Cross-Origin-Resource-Policy: same-origin and Cache-Control: no-store to every response, not only
API responses: every asset is consumed same-origin by the operator app,
non-browser consumers (provider health checkers, ACME validators)
ignore browser embedding policy, and serving the fixed boot-resource
bundle uncached is acceptable for the operator UI. Relaxing caching for
static assets would be a deliberate future change, not a default.
The implemented Content-Security-Policy meets the no-unsafe-inline
target for both script-src and style-src:
default-src 'none'; script-src 'self'; style-src 'self';
img-src 'self' data:; connect-src 'self'; base-uri 'none';
form-action 'self'; frame-ancestors 'none'
img-src allows data: in addition to 'self' because the committed
stylesheet’s hacker-theme dashed border is a data:image/svg+xml
background image; a data: image cannot execute script under this
policy, and folding it into the pinned bundle as a file asset would be
a separate reviewed bundle change. The earlier inline feature-flag
script and inline style="..." attributes in
tools/remote-session-client/ui/index.html were moved into static
bundle assets (/feature-flags.js, the stylesheet) before the CSP
landed, so the strict policy serves the fixed bundle without nonces or
hashes. The local QEMU proof
(make run-cloud-prod-remote-session-web-ui-l4) asserts the header set
and CSP on every response class over the real ingress, boots the served
root document in a real browser under the strict CSP with zero
securitypolicyviolation events, and asserts no Access-Control-*
header is emitted on any probed route.
Constant-time secret comparison
The automation-token check has been migrated to constant_time_eq
(automation_report and set_automation_report in
tools/remote-session-client/src/bin/remote_session_ui.rs). Apply
the same comparator to the future BrowserSession cookie value
lookup, the CSRF token check, and any future bearer/HMAC validations.
Auth-endpoint rate limiting and lockout
Add per-BrowserSession and per-listener rate limits to
/api/login/password and any future credential-handling routes.
Exponential backoff on failure. Audit-logged lockout. Wire into the
same audit stream as upstream session events so the operator sees
failed attempts.
Idle and absolute timeouts
Independent of the upstream capOS session expiry, expire
BrowserSession cookies on idle and on absolute lifetime. Force
re-auth on resume. Rotate the cookie id on re-auth.
Log injection / transcript safety
Sanitize browser-supplied strings routed into the transcript or stderr for CRLF, ANSI escape sequences, and control bytes so a hostile client cannot forge transcript rows or terminal control on operator stderr.
DTO-only-to-webview discipline
Keep the existing *Vm DTO boundary in
tools/remote-session-client/src/bin/remote_session_ui.rs (lines
~199-382). The browser must never receive raw cap handles, raw
interface ids, or unredacted session ids. The CapVm.interface_id
field is already #[serde(skip_serializing)]; preserve that pattern
for any new fields.
Self-Served And Public-Origin Carry-Over
The host-local remote-session-ui bridge and the capOS-served
remote-session-web-ui service are different deployment surfaces. The host
bridge is a trusted Linux loopback development tool whose backend owns the TCP
gateway connection. The self-served service is a capOS userspace HTTP service
that owns its TcpListenAuthority, session-manager login flow, authority-broker
bundle, and remote CapSet/proxy state inside the guest. The host bridge is not
the self-served service moved into the guest.
The authority boundary is the shared rule. Browser JavaScript receives only
view models, typed commands, typed results, denials, redacted transcript/status
rows, and fixed UI assets. It must not receive raw capOS capabilities, raw cap
ids, endpoint-owner authority, ProcessSpawner, socket factories,
NetworkManager, TcpListenAuthority, TcpListener, TcpSocket, key
material, remote CapSet handles, result-cap slots, process handles, host
usernames, host paths, host environment markers, or QEMU-forwarding identity
hints. These exclusions match the self-served Gate 1B boundary in
Remote Session CapSet Client and
the implementation proof records under
remote-session-self-served-full-ui-bundle.
Forbidden browser-visible surface matrix:
| Forbidden browser-visible class | Trusted owner or denial boundary | Proof / denial expectation |
|---|---|---|
| Raw capOS capabilities, raw cap handles, raw interface ids, and local cap ids | Held only by the remote-session-web-ui backend, its server-side proxy state, or the upstream gateway connection. | Browser envelopes, DOM state, diagnostics, transcripts, and JSON contain only DTO names and redacted labels; any browser request that tries to name a cap id fails before backend dispatch. |
| Endpoint-owner authority and arbitrary endpoint creation | Owned by the backend service runner and AuthorityBroker policy, not by browser state. | Browser launch forms name only approved service descriptors; denied launches return typed denial DTOs without endpoint-owner tokens or creation handles. |
Process handles, raw ProcessSpawner, and shell launcher authority | Kept behind AuthorityBroker-approved remote-client bundle policy. | Status and transcript rows expose only redacted process/service state; process handles and spawner markers are absent from browser-visible data. |
NetworkManager and TcpListenAuthority | remote-session-web-ui owns only the manifest-scoped UI listener for the selected proof target; the open cloudboot L4 task must source that listener through the Phase C userspace network path rather than browser or raw manager authority. | Listener/source metadata is service-derived from the accepted socket plus a service event id; browser requests cannot supply trusted source, route, or listener authority. |
TcpListener, TcpSocket, and socket factories | The HTTP accept loop owns accepted sockets and per-connection state server-side. | Browser JavaScript uses ordinary same-origin HTTP commands only; socket factory names, accepted-socket handles, and backend connection handles never appear in DTOs. |
| Key material, TLS private keys, certificates, public IPs, and firewall rules | Public-origin TLS and ingress remain in the on-hold provider-terminated HTTPS task; local and private proofs do not hold these secrets in the browser or capOS Web UI. | Local self-served and cloudboot proofs must not emit TLS key/certificate material, provider resource ids, public addresses, or firewall rule names as browser-readable state. |
| Remote CapSet handles, backend cap holders, session-global ids, and result-cap slots | Stored in server-side remote-session proxy tables and invalidated through backend logout/stale-call rules. | Browser commands reference typed route/request ids only; stale calls and unauthorized result access fail closed without leaking slot numbers or remote handles. |
| Host paths, host usernames, host environment markers, and QEMU-forwarding identity hints | Limited to development harness/operator context and not part of the capOS-served browser contract. | DOM state, JSON responses, diagnostics, and transcripts use redacted service labels; source metadata is backend-derived and cannot be replayed from browser-supplied fields. |
The matrix is a review checklist, not the enforcement mechanism. The browser boundary is acceptable only when the backend also rejects stale, unauthorized, or client-supplied authority selectors before any capability dispatch.
The carry-over controls are backend-held session state, server-side
BrowserSession records, CSRF tokens on state-changing JSON routes,
Host/Origin/Referer/content-type validation, no wildcard CORS, security
response headers, request and handler bounds, per-session rate and resource
limits, idle and absolute lifetime enforcement, logout that drops server-side
authority, transcript sanitization, constant-time comparisons for secrets, and
audit-visible denials. Those controls are required for the capOS-served service
as well as the loopback bridge, but their concrete transport assumptions differ.
On the capOS-served remote-session-web-ui, the browser-boundary baseline is
implemented and locally proven on
make run-cloud-prod-remote-session-web-ui-l4: server-side session hardening
(unpredictable rotated session ids, a domain-separated double-submit CSRF
token, Host/Origin validation, and idle/absolute lifetime enforcement),
GFE-range-pinned forwarded-scheme trust, the manifest-selected single public
origin, the IAP-aware SameSite cookie posture, JSON content-type rejection on
state-changing /api/* POSTs, the uniform security response headers with the
strict no-unsafe-inline CSP, in-guest login peer-gating with failure
backoff, and the public /healthz health-check contract. All of that evidence
is local QEMU/cloudboot proof only; none of it claims private GCE
reachability, public ingress, TLS custody, or operator exposure.
Two browser-boundary local proofs remain open as dispatchable task records
under docs/tasks/, not done: a public-deployment loopback gate that rejects
loopback Host/Origin/Referer acceptance and loopback-shaped source hints
when the public-origin load-balancer posture is configured (the landed local
proofs intentionally preserve the QEMU loopback posture), and a consolidated
browser-visible forbidden-marker matrix proof that scans every response class
– success, denial, health, manual, and error bodies – for the forbidden
surface above and proves hostile browser-supplied authority fields fail closed
before backend-held capability dispatch.
Loopback-only decisions do not carry to a public origin. The plaintext
http://127.0.0.1 cookie exception above is only for the trusted local bridge.
A public operator endpoint must use the selected policy in
Cloud Deployment:
one HTTPS origin at a GCP external Application Load Balancer, no wildcard CORS
or cross-origin credentialed requests, provider-terminated TLS with no capOS or
harness private-key custody for the bootstrap proof, capOS serving only
plain HTTP/1.1 on the backend port, no public IP on the VM, and firewall-bounded
trust in the load balancer’s forwarded-scheme headers. Public sessions use
Secure/HttpOnly/SameSite cookies, HSTS at the HTTPS edge, CSRF
Origin/Referer checks against the known public origin, bounded idle and
absolute lifetimes, and server-side logout.
The forwarded-scheme half of that trust boundary is already implemented and
locally proven on the capOS-served service: remote-session-web-ui honors
X-Forwarded-Proto only from the recorded GCP front-end source ranges
(130.211.0.0/22, 35.191.0.0/16) and treats the header from any other peer
– or any unknown peer-address format – as absent, so a direct client cannot
forge secure-context cookie posture. make run-cloud-prod-remote-session-web-ui-l4 drives both the forged-header negative
over the real ingress path and the trusted-forwarder fixture positive.
The single-public-origin half is also implemented and locally proven:
remote-session-web-ui reads exactly one public_origin.<host> manifest
marker cap (fail-closed on a second marker, a malformed, loopback-named, or
IP-literal-shaped host, or any unrecognized extra grant) and accepts the
configured
https://<host> origin in its Host/Origin/Referer gates only for
requests arriving through the trusted forwarded-scheme HTTPS path.
Cross-origin, mixed-scheme, wildcard, and missing-origin state changes fail
closed before backend-held capability dispatch, browser-supplied
principal/source hint headers are rejected on the public-origin path, no CORS
headers are ever emitted, and the loopback proof posture is unchanged. The
same proof drives a direct-client forged public Host/Origin negative over the
real ingress and the trusted-forwarder fixture positive in-process. This is
local public-origin readiness only – no DNS name, load balancer, TLS
endpoint, or live public exposure is claimed.
Keep the proof classes separate. The landed local/QEMU self-served UI bundle
proof does not prove local cloudboot L4 over the Phase C userspace network
stack. The local cloudboot L4 proof does not prove private GCE reachability.
The private GCE proof does not authorize public IPs, firewall exposure, DNS,
TLS certificates, or operator browser exposure from the internet. The later
cloud-gce-public-self-hosted-webui-ingress-tls
task remains on hold for explicit public-ingress/TLS authorization and must
build against the selected provider-terminated HTTPS policy rather than raw
public HTTP.
Tauri Wrapper
The repository now contains a check/dev Tauri wrapper scaffold under
tools/remote-session-client/src-tauri/. It does not introduce a new
remote-session authority boundary: make remote-session-tauri checks
the wrapper and host Tauri prerequisites by default, and
CAPOS_REMOTE_SESSION_TAURI_MODE=dev make remote-session-tauri
launches cargo tauri dev. The webview loads
http://127.0.0.1:3337/ from the existing remote-session-ui Rust
backend, so the backend still owns the gateway TCP connection, remote
session state, remote caps, and worker proxies. Webview JavaScript
receives only the same view models, user events, typed results,
denials, and redacted transcript rows as the trusted local web bridge.
The wrapper command also has a policy-only preflight:
CAPOS_REMOTE_SESSION_TAURI_MODE=policy tools/remote-session-tauri.sh.
That preflight runs before Tauri dependency/build checks in the normal
check path and does not require Tauri Linux packages or a desktop
session. It fails closed if the reviewed scaffold drifts: bundling must
stay disabled, both the Tauri devUrl and the single main window
URL must remain http://127.0.0.1:3337, the default capability must
grant only core:default to the main window, and the wrapper must
not add app-specific Tauri commands, invoke handlers, generate
handlers, or tauri-plugin-* dependencies/uses. This is a guardrail
over the current check/dev scaffold only; it is not evidence that
distributable packaging or desktop automation is reviewed.
The current check/dev wrapper therefore inherits the loopback HTTP bridge threat model:
- Loopback HTTP controls apply. Host validation, Origin checks,
CSRF tokens, per-
BrowserSessioncookies, request bounds, first-wins ownership, rate limiting, transcript sanitization, and DTO-only-to-webview discipline apply to the Tauri webview path unchanged because the webview talks to the same loopback backend. - No custom Tauri invoke authority. The current scaffold has no
app-specific
invokecommands for remote-session actions. Do not add Tauri commands that expose raw caps, cap ids, process handles, endpoint owner caps, result slots, host usernames, host paths, or gateway connection internals to the webview. - Distributable packaging is still residual. Bundling is disabled
until the backend lifecycle is reviewed. A future packaged wrapper
may keep a reviewed loopback sidecar or migrate to Tauri command IPC
/ custom-protocol assets, but that change must update this proposal
and re-evaluate which loopback controls still apply. The wrapper’s
packagemode is intentionally blocked until that review is done. - Webview content is the attacker. If any non-trusted asset can
ever load (remote frame, broken integrity check, mis-scoped asset
protocol), webview JavaScript becomes the attacker. CSP, asset
scope discipline, no remote frames, no
eval-style hatches still apply. - Capability/allowlist minimization. Lock the Tauri capability
manifest tightly. Every
invokecommand and every core API (fs, shell, http, dialog, process, window, clipboard, …) the frontend may call must be enumerated and minimized before distributable packaging is enabled. Misconfigured Tauri allowlists are the dominant Tauri CVE pattern; prefer per-window capability scoping over global allow. - Per-window BrowserSession isolation. If multiple windows are
spawned over a shared Rust state, keep per-window
BrowserSessionisolation matching the loopback design. - Carry-over controls. Constant-time secret comparison, rate-limiting, idle/absolute timeouts, transcript-injection sanitization, DTO-only-to-webview discipline, and audit logging apply to the Tauri wrapper unchanged.
- Desktop automation remains unreviewed. The wrapper’s
automationmode is intentionally blocked until screenshot/input authority, automation-token handling, UI-smoke oracle scope, desktop session isolation, and fail-closed teardown have a reviewed design.
Verification
Before the corresponding review-finding task is closed:
- Host tests cover each control above (cookie attributes, CSRF guard, Origin/Host validation, Content-Type rejection, CSP surface, header set, constant-time compare, rate limit, timeouts, log injection sanitization).
- The CSP refactor of
tools/remote-session-client/ui/index.htmlships in the same change set as the CSP header. - The cookie-transport choice (HTTPS/secure-origin vs. interim plaintext-loopback no-Secure) is recorded in this proposal and the matching browser support matrix is documented.
- The multi-browser policy choice is recorded in this proposal and reflected in audit logs and DTO denial diagnostics.
- The Tauri wrapper check/dev scaffold keeps the existing loopback
bridge controls in force, has no app-specific remote-session
invokecommands, leaves distributable packaging disabled until the sidecar/custom-protocol/backend lifecycle is reviewed, and keeps the policy preflight passing as a narrow guardrail over that scaffold.
Proposal: capOS Repository Harness Engineering
This proposal applies OpenAI-style harness engineering to the capOS repository itself. The goal is not to add agent features to the operating system. The goal is to make this repository a better, safer work environment for long-running agents and human reviewers.
The related capOS-Hosted Agent Swarms proposal describes capOS as a future host for OpenClaw-like agent services. This proposal describes the repository infrastructure needed so agents can work on capOS without repeatedly rediscovering project state, extending superseded designs, choosing the wrong QEMU proof, or silently drifting documentation.
Why This Proposal Exists
The capOS repo is already heavily agent-shaped:
AGENTS.mdandCLAUDE.mddefine workflow rules.docs/tasks/state.tomlselects the current milestone, and task records underdocs/tasks/define immediate gates.docs/tasks/**records open remediation and review-finding work.docs/proposals/,docs/backlog/, anddocs/research/hold design context.docs/topics.md,docs/SUMMARY.md, and proposal indexes make docs navigable.- Make targets and QEMU harnesses prove behavior.
- CUE manifests define focused system configurations.
That is enough for a careful agent to work, but it is not yet a complete harness. Too much project state still requires fragile human-style inference: which document is authoritative, which proposal is stale, which run target proves which behavior, which open finding blocks a task, and which design pivot explains why old text should not be extended.
OpenAI’s harness engineering lesson is direct: what an agent cannot inspect in its working context effectively does not exist. capOS should therefore compile its project state into repo-local, versioned, mechanically checked artifacts.
Two existing tracker documents already shape the harness contract this proposal builds on, and the artifacts below must stay consistent with them rather than re-derive their state:
- Trusted Build Inputs inventories the toolchain, generated bindings, dependency policy, Limine pin, QEMU/OVMF observation, and host-tool surface the repo currently trusts. Any run-target, proof, or generated-code claim the harness exposes to agents must point back to that inventory rather than restate pinning or drift status independently.
- Design Risks and Open Questions Register is the consolidated index of long-horizon design risks (including the supply-chain risk R13, the harness-coverage gaps, and the open-question pointers for proposal/backlog/design ownership). Harness artifacts that claim a risk is “tracked” should cite the register row, and new risks surfaced by harness checks should be filed there rather than buried in this proposal.
Scope
In scope:
- agent-facing repository map;
- task-selection and milestone state;
- proposal/research/status consistency checks;
- run-target and QEMU proof inventory;
- machine-readable design relationships;
- agent-maintained but reviewed knowledge compilation;
- deterministic evals for future coding agents;
- active-work and shared-resource visibility;
- review and security handoff artifacts.
Out of scope:
- capOS-hosted agent runtime implementation;
- model provider selection;
- browser, MCP, or A2A runtime integration;
- replacing human review;
- changing the current mandatory worktree workflow.
Design Principles
-
Repository-local context wins. Important design and workflow state should live in tracked files, not in chat history or operator memory.
-
Indexes are harness inputs.
docs/topics.md,docs/SUMMARY.md, proposal indexes, backlog pointers, and run-target tables are not cosmetic; they are how agents find the right context. -
Status must be checkable. Proposal status, supersession, implementation status, selected milestone, and review findings should fail checks when they drift.
-
Proofs need names and ownership. A QEMU harness target should say what it proves, which manifest it uses, which proposal/backlog owns it, and what transcript shape is expected.
-
Compiled knowledge is non-authoritative until reviewed. Agent-generated wiki pages can help navigation, but proposals, architecture docs, schemas, code, and review findings remain authoritative.
-
Prefer generation over duplicate hand-maintained state. When possible, sidecars and indexes should be generated from front matter, Makefile metadata, manifests, or explicit source files.
-
Expose replacement paths. If a proposal is superseded, an agent should see the replacement before acting on stale text.
-
Make unsafe shortcuts hard. The harness should steer agents away from main-worktree edits, stale branches, missing review, unverified QEMU claims, and undocumented design pivots.
-
Agents must know when they are not alone. Shared resources such as git branches, worktrees, docs indexes, task lists, generated files, and review queues need visible ownership, lease, and version state before agents mutate them.
Proposed Artifacts
docs/agent-harness.md
A concise entry point for future agents. It should answer:
- where current project state lives;
- how to choose a task;
- how to create a compliant worktree;
- how to find relevant proposals, backlog, research, and review findings;
- how to choose checks;
- how to handle docs/status updates;
- how to hand off verification and review.
This file should link to authoritative docs rather than duplicate them. It is a map, not a new policy source.
docs/run-targets.md
Generated or maintained inventory of run/check targets:
| Target | Manifest | Purpose | Expected proof | Owner |
|---|---|---|---|---|
make run-session-context | system-session-context.cue | one immutable session context proof | hostile second-session attempts fail closed | session-bound invocation context |
make run-chat | system-chat.cue | resident chat service proof | session-scoped chat transcript | chat/shared-service proposal |
The table should cover make run-*, make qemu-*, docs checks,
generated-code checks, and security checks. Agents should not infer target
meaning from target names alone.
Active Work Registry
Add a small generated or reviewed active-work registry for concurrent agents. It should be derived from git worktrees where possible and supplemented by task metadata:
| Task | Branch | Worktree | Claimed resources | Mode | Expires | Status |
|---|---|---|---|---|---|---|
| example-session-model | feat/session-model-proof | <worktree-root>/session-model-proof | src/capos/service.rs, docs/proposals/session-context.md | exclusive source, shared docs | 2026-05-01 | checking |
The registry is not a replacement for git or human review. It is a harness surface for “another agent is already touching this shared resource.” The row above is synthetic sample data, not live project state.
The same registry should also feed the daily development-performance report defined in capOS Agentic Development Experiment. Git can explain what merged, but the registry explains live ownership, intended role, claimed resource surface, and whether a task was implementation, review, verification, recovery, or metrics processing.
Minimum fields:
- task or issue id;
- owner identity or runner id;
- actor class when known:
claude,codex,human/manual,mixed, orunknown; - role: implementation, review, planning/design, verification, recovery/integration, or recap/metrics processing;
- attribution confidence: direct, corroborated, inferred, or unknown;
- branch and worktree path;
- claimed paths, subsystems, generated outputs, todo items, or review queues;
- exclusive/shared mode;
- observed base revision;
- lease expiry and renewal time;
- status: planning, editing, checking, review, merge, blocked, abandoned.
Rows should keep attribution confidence explicit. A direct session id, commit
trailer, or operator-created row is higher-confidence than timestamp overlap.
Low-confidence rows should stay unknown or mixed rather than assigning work
to a specific tool.
For the current repo workflow, this would make the existing worktree policy
queryable. For a future capOS-hosted swarm, the same shape becomes a
SharedResource/ResourceLease service: git repos, shared todo items, wiki
pages, generated docs, and merge queues all get visible claims and versioned
writes.
Proposal Relationship Metadata
Add or standardize front matter fields:
status: "Future design. No implementation."
last_reviewed: "2026-04-28 00:00 UTC"
supersedes:
- old-proposal.md
superseded_by: new-proposal.md
implemented_by:
- commit-or-target
owned_backlog: docs/backlog/example.md
proof_targets:
- make run-example
The exact schema can be narrower at first. The important requirement is that replacement and proof relationships become queryable.
Design Pivot Records
Add short ADR-style files under docs/decisions/ for high-impact pivots:
- endpoint badges as service identity rejected;
- service-object capabilities superseded by session-bound invocation context;
- SSH work paused behind session-bound invocation context;
- hosted agents split from shell agent mode.
Each record should state context, decision, consequences, superseded docs, and current replacement docs.
docs/agent-wiki/
A generated or agent-maintained compiled knowledge tree:
index.md: current topic map;capability-model.md: current “interface is permission” model;session-model.md: implemented session-bound invocation context summary;shell-and-remote-access.md: shell, Telnet, SSH, WebShellGateway status;qemu-proofs.md: proof target summaries;open-findings.md: current review findings summarized with links.
This tree must be clearly labeled as compiled navigation, not authority. It can be hidden from public docs until reviewed.
Agent Evals
Add deterministic repository-workflow evals:
- identify selected milestone from
docs/tasks/state.toml; - find the relevant backlog and proposal;
- reject editing the main worktree;
- detect another active worker claiming the same exclusive path or generated output;
- choose a non-overlapping task or wait when a shared resource is already leased;
- identify required checks for a doc-only proposal change;
- detect a superseded proposal and follow replacement;
- update proposal index and summary when adding a proposal;
- avoid claiming full tests passed when only docs built;
- surface open review-finding task records before unrelated feature work.
These evals can start as scripted fixtures. They do not need live model calls.
Mechanical Checks
Extend existing documentation tooling to check:
- every proposal in
docs/proposals/is present indocs/proposals/index.mdor an explicit archive section; - every proposal linked in
docs/SUMMARY.mdexists; - every proposal with topics appears in
docs/topics.mdafter generation; superseded_bypoints to an existing file;- superseded proposals display a replacement link near the top;
- selected milestone in
docs/tasks/state.tomlhas matchingdocs/tasks/README.md/ backlog orientation; - run-target inventory entries point to existing Make targets and manifests;
- research-backed proposals link at least one
docs/research/*.mdnote; - external source snapshots in research notes include a review date;
- QEMU proof claims name a target;
- active-work registry entries point to existing branches/worktrees when local;
- no two active registry entries claim the same exclusive resource unless one is marked blocked, abandoned, or waiting for merge;
- daily metrics rows that cite an active-work entry use a known actor class, role, and confidence label.
These checks should start warning-only if needed, then become required once the metadata is in place.
The harness checks above stop at proposal/index/run-target/active-work hygiene. They are deliberately not a substitute for the security review process. Trust-boundary review, threat-model refresh, per-boundary CWE/CAPEC tagging, tiered tooling (CI hygiene, miri/proptest/fuzzing, Loom, Kani), and the Security Verification Track registry live in Security Review and Formal Verification. When a harness check (for example “proof claim names a target” or “active-work registry attributes a generated output”) touches trust-boundary or supply-chain authority, it must route the finding to the matching security verification track or design-risks register row rather than absorb the authority claim into agent-facing harness metadata.
Workflow Impact
For agents:
- start at
docs/agent-harness.md; - read selected milestone state through stable headings or generated sidecar;
- inspect active-work/resource claims before choosing or mutating shared files;
- follow proposal relationship metadata to avoid stale design;
- choose checks from run-target inventory;
- update docs/status through mechanically checked indexes;
- hand off with proof target names and transcript artifacts.
For humans:
- less repeated explanation of repo rules;
- easier review of whether an agent chose the right context;
- clearer detection of stale docs;
- explicit locations for “why did we change direction?” records.
Implementation Phases
Phase 1 - Map and Inventory
- Add
docs/agent-harness.md. - Add initial
docs/run-targets.mdby hand for major run targets. - Link both from
docs/SUMMARY.md,docs/topics.md, andREADME.md. - Add a short section in
docs/tasks/README.mdpointing future agents to the harness map.
Phase 2 - Metadata and Checks
- Standardize front matter for proposals and research notes.
- Extend mdBook metadata tooling to validate proposal index, topic membership, summary links, status fields, and supersession links.
- Add run-target inventory validation against Makefile and manifest paths.
Phase 3 - Decision Records
- Add
docs/decisions/and initial pivot records for the session-bound invocation context change and hosted-agent split. - Link decisions from affected proposals and backlog files.
Phase 4 - Compiled Agent Wiki
- Create a reviewed
docs/agent-wiki/seed for the current selected milestone. - Add lint for stale links, missing citations, and “compiled, not authority” labels.
- Decide whether generated wiki pages are published in mdBook or kept as repo-internal harness files.
Phase 5 - Agent Workflow Evals
- Add fixtures and scripts for repository-workflow evals.
- Run them in a docs/check target.
- Use failures to improve
docs/agent-harness.md, metadata, and run-target inventory.
Open Questions
- Should proposal relationship metadata live only in front matter, or should there be a generated JSON sidecar for fast agent/tool consumption?
- Should
docs/agent-wiki/be generated on demand or checked in after review? - How much QEMU transcript output should be retained as proof artifacts without bloating the repository?
- Should run-target metadata live in Makefile comments, a CUE file, or
docs/run-targets.mdfront matter blocks? - How strict should the first status linter be, given existing historical docs?
- Should agent evals be part of
make docs, a separatemake agent-harness-check, or a broadermake check?
Relationship to Existing Documents
- Hosted agent harnesses research records the external harness research and the initial checklist.
- capOS-Hosted Agent Swarms uses this repo harness as precedent for future capOS-hosted agents.
- mdBook Documentation Site owns public docs structure and status vocabulary; this proposal adds agent-legibility and mechanical checks on top.
- Trusted Build Inputs is the source of truth for toolchain pinning, generated-code drift, dependency policy, Limine binary pinning, observed-only QEMU/OVMF surface, and host-tool inventory. The harness run-target inventory, proof-target metadata, and generated-output active-work claims in this proposal must cite the relevant row there rather than re-derive trust status.
- Security Review and Formal Verification owns the trust-boundary model, per-boundary CWE/CAPEC checklist, tiered tooling (CI hygiene, miri/proptest/fuzzing, Loom, Kani), and the Security Verification Track registry. Harness mechanical checks must hand security- bearing findings to that proposal’s tracks rather than redefine review authority.
- Design Risks and Open Questions Register is the consolidated index of long-horizon design risks and open architectural questions. New harness-surfaced risks should be filed against existing rows there (for example R13 for supply-chain pinning gaps) or added as new rows, not buried in harness artifacts.
CLAUDE.md,AGENTS.md,docs/tasks/README.md, and the task ledger remain authoritative workflow inputs.docs/agent-harness.mdshould route to them, not replace them.
Proposal: capOS Agentic Development Experiment
This proposal treats capOS development as a longitudinal field experiment in agentic software engineering. The experiment studies whether persistent coding agents, subagents, review agents, recovery routines, and session-recap tooling can make sustained progress on a nontrivial operating-system project while preserving engineering quality, reviewability, and coordination safety.
The core question is not whether an AI can produce isolated code changes. The stronger question is whether an agentic workflow can maintain a coherent project over many sessions, interruptions, branches, reviews, and handoffs, and which process controls keep that workflow reliable.
This proposal studies the development-time workflow that produced capOS, not the in-system agent runtime that capOS itself targets. The capability-served language-model, embedder, and agent-runner surface lives in Language Models and Agent Runtime; that proposal is the authority on tool-use loops, per-tool permission modes, and how a future agentic capOS user surface holds model authority. The experiment described here uses external Claude and Codex sessions running against the repo, and records observations about their behaviour for later analysis.
Motivation
capOS is a useful setting because it is systems software with real correctness constraints: kernel behavior, capability discipline, QEMU evidence, generated schemas, docs, reviews, and integration rules all matter. It is a stronger testbed than toy programming tasks because the work has long dependency chains and observable integration gates.
The immediate practical need is session memory. Raw ~/.codex and ~/.claude
logs contain the evidence, but they are too large and operationally noisy for
routine recovery or research analysis. The recap tooling creates a derived
evidence layer: structured metadata, compact evidence packets, plain-text
summaries, parent/child session graphs, and freshness tracking.
Research Questions
- Can agentic development produce sustained, reviewable progress on capOS across many sessions and subagents?
- Which controls reduce coordination failures such as stale ownership, duplicate work, unsafe branch cleanup, live-process confusion, and review drift?
- How should parent sessions and subagent sessions be summarized so project history remains useful without recursively flooding the recap system?
- How reliable are LLM-generated factual recaps when grounded in compact evidence packets rather than full transcripts?
- What failure modes remain visible after adding stronger evidence fields, prompt examples, routing rules, and summary comparison snapshots?
Hypotheses
- Dedicated worktrees, explicit ownership rules, and mandatory review gates reduce destructive interference between concurrent agents.
- Root-session summaries plus compact child-session evidence are more useful than treating every subagent as an independent top-level recap by default.
- Small summarizer models can handle simple review sessions when given exact paths, strict output scope, and good/bad examples, but routine/recovery and child-heavy parent sessions need stronger models.
- Derived recaps can support research and operations if treated as coded observations, while raw transcripts remain the authority for audits.
- Iterative prompt and evidence changes can measurably reduce recap defects such as bootstrap-boilerplate summaries, queue-processing self-references, and “limited evidence” outputs.
Experimental Setting
The setting spans more than one development machine. Session identity therefore needs an explicit source-machine dimension: a session captured through one machine but originating on another must remain attributed to the originating machine in raw manifests and derived data.
Observed source classes:
- Claude transcripts under
~/.claude/projects/.../*.jsonl. - Claude live metadata under
~/.claude/sessions/*.json. - Codex thread metadata in
~/.codex/state_5.sqlite. - Codex parent/child relationships in
thread_spawn_edges. - Codex rollout transcripts under
~/.codex/sessions/YYYY/MM/DD/. - Git branch, worktree, commit, review, and check evidence from this repo.
Raw collection keeps source_host separate from capture_host. A central
machine may perform the capture, but the manifest records where each source
file originated.
An initial private pilot inventory found a large child-session skew: most Codex sessions were spawned subagents rather than root sessions. This motivates the default policy of indexing every session while queuing only primary/root sessions for standalone summaries.
Tooling
The repo-tracked tools live under tools/agent-session-recaps/:
maintain_recap_store.pyinventories local Claude/Codex sessions, writes script-owned metadata and evidence JSON, maintains summary queues, maps live PIDs conservatively, and ingests LLM-ownedsummary.txtfreshness metadata.archive_raw_sessions.pysnapshots raw session sources with host provenance, checksums, compression, optional project filtering, and optional upload to private object storage.
The default derived recap store remains outside the repo:
~/ai-session-recaps/index.json~/ai-session-recaps/by-session/{tool}/{session_id}/meta.json~/ai-session-recaps/by-session/{tool}/{session_id}/evidence.json~/ai-session-recaps/by-session/{tool}/{session_id}/summary.txt~/ai-session-recaps/by-session/{tool}/{session_id}/summary.meta.json~/ai-session-recaps/queue/*.json
Important design choices:
- Summary prose lives only in
summary.txt. - JSON files remain script-owned metadata/evidence/freshness files.
- The index tracks source
updated_attimestamps for staleness. - Parent/root sessions are queued by default.
- Spawned child sessions remain indexed and linked, but are not queued by default.
- Parent evidence includes compact child-session evidence so root summaries can include meaningful subagent outcomes.
- Codex
task_complete.last_agent_messageis extracted to improve final review and implementation verdicts. - Live Claude/Codex PIDs are mapped conservatively using
/proc, ClaudeprocStart, Codex wrapper/native process relationships, and explicit Codex resume evidence when available.
Data Products
The experiment distinguishes four layers:
- Raw logs: private source of truth.
- Evidence packets: compact redacted excerpts, metadata, child-session packets, and command/check summaries.
- LLM summaries: qualitative coded observations, not ground truth.
- Analysis snapshots: immutable comparison runs that evaluate prompt and evidence changes.
Daily development-performance reports are analysis snapshots. They combine git, worktree, check, review, and session evidence for a bounded reporting window. They are not raw logs and should not contain private prompts, unredacted transcripts, local credentials, or unrelated operator context.
Raw transcripts should not be committed to the public source history. Evidence packets and summaries may be committed only after redaction policy and privacy review. Tooling, schemas, prompts, synthetic examples, and methodology docs can be tracked first.
Raw Evidence Archival
The recap store is derived data; it is not enough for auditability. Raw session sources should be archived separately, with checksums and a manifest that lets a later analysis reproduce which transcript version produced each evidence packet and summary.
Preferred raw archive design:
- Use private object storage, such as a locked-down GCS bucket, as the default archive for raw session logs.
- Store compressed snapshots by capture time and source host, for example:
gs://<private-bucket>/capos-agentic-dev/raw-sessions/YYYY/MM/DD/<snapshot-id>/
manifest.json
sha256sums.txt
hosts/primary-dev/.codex/sessions/....jsonl.zst
hosts/primary-dev/.claude/projects/....jsonl.zst
hosts/portable-dev/.codex/sessions/....jsonl.zst
hosts/portable-dev/.claude/projects/....jsonl.zst
- Enable uniform bucket-level access, least-privilege IAM, lifecycle rules, and object versioning or retention if the bucket policy allows it.
- Consider customer-managed encryption if the archive will contain sensitive prompts, private operational instructions, or source excerpts.
- Store a manifest with source host, capture host, source path, archive object path, byte size, SHA-256, source mtime, capture timestamp, tool, session id when known, compression, and redaction status.
- Keep the manifest path or archive snapshot id in derived recap metadata so summaries can be audited against the exact archived source.
- Do not merge logs from one machine into another machine’s live
~/.codexor~/.claudetrees. Gather them into host-partitioned archives first, then import from that archive if the recap store is extended to multi-host analysis.
Project filtering matters on machines that contain unrelated Claude/Codex
projects. archive_raw_sessions.py --project-root <path> selects only Codex
rollouts whose threads.cwd is inside a selected project/worktree root, writes
a filtered Codex state JSON extract, and selects matching Claude project JSONL
and session metadata. Full Codex SQLite state, global history, Codex logs DB,
Claude tasks, and Claude file history are opt-in.
Git branch or Git LFS storage is useful only under tighter constraints:
- A private Git LFS dataset branch can be convenient for small, curated, redacted, or synthetic fixtures.
- Raw local session logs should not go into normal git history because they may contain private prompts, operational instructions, credentials accidentally pasted into chat, or unrelated user content.
- Even private Git LFS is awkward for raw logs if later deletion or redaction is needed, because clones and LFS object stores can retain historical content.
- If Git LFS is used, prefer a separate private data repository or an orphan data branch, never the normal capOS source branch.
Recommended split:
- Git-tracked source repo: tooling, schemas, prompts, proposal, methodology, redaction scripts, synthetic examples.
- Private object storage: exact source session JSON/JSONL and SQLite snapshots.
- Optional private Git LFS dataset: curated redacted snapshots used by paper reviewers.
- Public artifact, if any: synthetic fixtures plus aggregate metrics and selected redacted examples.
Pilot Results
An initial private pilot processed a small queue of current summaries and then reran a target set after prompt/evidence changes.
Baseline result:
- Current summaries: 53.
- Bad queue/meta/evidence self-reference markers: 0.
- “Limited evidence” summaries: 7.
- Child-heavy current summaries: 3.
First intervention:
- Added prompt good/bad examples.
- Added compact child-session evidence for parent Codex sessions.
- Dereferenced child recap-worker output files so parent summaries see summary text, not only completion paths.
Second intervention:
- Added Codex
task_complete.last_agent_messageextraction. - Reran the remaining limited-evidence summaries.
Combined candidate result:
- Candidate summaries: 53.
- Bad self-reference markers: 0.
- Limited-evidence summaries: 0.
- Average baseline summary length: 1221.6 characters.
- Average candidate summary length: 1060.6 characters.
These results support the claim that prompt examples help, but evidence shape matters more. The weak summaries were not only a prompt problem; they lacked the right final-result evidence.
Methodology
Collection
Run the recap maintainer periodically or after major work bursts. Each run should:
- refresh metadata for all sessions;
- update evidence for recent primary/root sessions;
- preserve child-session graph information;
- update live-process mappings;
- queue only stale or missing summaries;
- record immutable analysis snapshots before major prompt/evidence changes.
Summarization
Use model routing:
gpt-5.3-codex-sparkfor simple, concrete, non-routine sessions.- A stronger model for routine/recovery sessions, live sessions, and child-heavy parent sessions.
Keep small-model tasks concrete:
- one queue item or a very small batch;
- exact paths;
- no JSON output;
- no broad filesystem exploration;
- good/bad examples in the prompt;
- strict instruction to summarize target-session outcomes, not the queue-processing task.
Metrics
Hard metrics:
- session count by tool, primary/child/root role, model, and live state;
- queue size and stale/current summary count;
- child-session count per root;
- number of review findings, no-finding reviews, failed checks, and passed checks;
- branch/worktree lifecycle events: created, committed, reviewed, merged, pushed, parked, abandoned;
- recovery-session frequency and duration;
- recap quality markers: self-reference markers, limited-evidence phrases, missing final verdicts, excessive bootstrap boilerplate.
Qualitative coding:
- coordination failures;
- evidence gaps;
- useful controls;
- subagent summarization failures;
- review-loop behavior;
- human intervention points.
Daily Development Metrics
The daily report answers a narrower operational question than the recap store: what project progress happened during the reporting window, how strongly was it validated, and which agent/human channels contributed to it. It should keep project-performance metrics separate from attribution. Raw commits or lines of code are activity signals, not performance by themselves.
Use a fixed window and record it in the report. UTC calendar days are the
default for cross-machine comparison; a local workday boundary may be used only
when the report records the chosen day_start hour. The collector should
derive base and tip commits from the window and report both raw and
normalized git stats.
Normalize diff metrics by separating generated and vendored churn:
- raw commit count and non-merge commit count;
- first-parent merged task branches;
- raw file and line stats;
- authored file and line stats excluding
vendor/**andtools/generated/**; - optional secondary exclusions for lockfiles and generated demo content;
- top-level directory and subsystem breakdown;
- schema changes and generated-code regeneration as distinct rows.
Project-progress metrics:
- reviewed task slices merged;
- selected-milestone gates closed;
- task records closed under
docs/tasks/done/; - review-finding task records opened, closed, or carried forward;
- blockers retired and blockers still open;
- new capability, schema, runtime, demo, manifest, or QEMU proof surfaces;
- checks and QEMU targets recorded as passed, failed, skipped, or flaky;
- review iterations and review finding severity;
- rework after review or after merge.
Validation metrics should be evidence-first. A report may say a check was recorded only when it can point to a session evidence packet, saved log, commit message, or local check database entry. It should not convert a conversational claim into “passed” without a corroborating artifact. Flakes should be recorded separately from deterministic failures.
Attribution metrics are secondary accounting. Attribute by task slice and role,
not by raw line count. The report should allow at least these actor classes:
claude, codex, human/manual, mixed, and unknown. A commit trailer,
session evidence, or active-work registry row can support attribution, but
timestamp overlap alone is low-confidence and should remain unknown unless
corroborated.
Split roles explicitly:
- implementation;
- review;
- planning/design;
- verification/check running;
- recovery/integration;
- recap/metrics processing.
The Claude/Codex split should be reported as a matrix of actor class by role, with counts of task slices, sessions, review findings, checks, and merged commits where known. It should not rank agents by total commits or authored lines because generated code, vendored dependencies, docs refreshes, and review work distort that comparison.
Recommended daily report sections:
- Executive summary: visible progress, evidence gates closed, blockers retired, and blockers still open.
- Git metrics: raw commits, non-merge commits, merged task branches, normalized diff stats, generated/vendor churn.
- Area breakdown: kernel, schema, runtime, demos, tools, docs, and plans.
- Evidence and validation: checks, QEMU proof targets, flakes, skipped gates, and missing gates.
- Review and rework: review iterations, findings opened/closed, severity, and post-review or post-merge rework.
- Claude/Codex/human split: role-based attribution with confidence labels.
- Planning state: selected milestone, active high-priority tasks, closed plan items, stale blockers, and next credible gates.
The active-work registry proposed by capOS Repository Harness Engineering is the preferred source for live task ownership, claimed resources, and role labels. Git remains the authority for merged history; raw session archives remain the authority for auditing derived summaries.
Validation
Treat summaries as coded observations. Validate claims against raw logs, git history, and checks before using them as paper evidence. The capOS review and verification regime described in Security and Verification is the authority on what counts as a closed review gate, what counts as a deterministic check versus a flake, and how trust boundaries are documented. The recap store and daily report cite those gates rather than redefining them: a summary may record that a check passed only when the evidence packet, saved log, or commit trailer matches one of the named gates in that proposal.
Use audits:
- sample raw transcript lines for selected summaries;
- verify cited commits and branches;
- verify check outcomes in logs;
- compare parent summaries against child-session final results;
- rerun summaries after prompt/evidence changes and compare snapshots;
- compare daily report attribution against commit trailers, session evidence, and active-work registry rows;
- sample normalized diff calculations to ensure generated and vendored files are not counted as authored development volume.
Threats To Validity
- Single-project bias: capOS is one project with one workflow.
- Model/version drift: model behavior and Codex/Claude log schemas may change.
- Observer effect: improving prompts and processes changes the system being studied.
- LLM-coded summaries can omit or distort details.
- Raw logs may contain private operational data, limiting public reproducibility.
- Agent behavior is affected by local instructions, model routing, and tool availability.
Paper Outline
- Introduction: why long-running agentic development is different from single-prompt code generation.
- Background: capOS, worktrees, review gates, Codex/Claude sessions.
- System design: recap instrumentation, evidence packets, child-session graph, model routing, live-process mapping.
- Methodology: longitudinal observation, metrics, prompt/evidence interventions, audit strategy.
- Pilot findings: session scale, child-session dominance, failure modes, recap improvement loop.
- Case studies:
- recovery session after interruption;
- child-heavy device-driver-foundation work session;
- repeated review loop;
- recap prompt/evidence refinement.
- Discussion: what worked, what remained brittle, implications for agentic software engineering.
- Limitations and future work.
Immediate Next Steps
- Add schema documentation and a privacy/redaction README.
- Add repeatable analysis scripts for baseline/rerun comparison.
- Add a daily metrics collector that joins git, recap evidence, active-work rows, check artifacts, and review findings into the report sections above.
- Add a small synthetic fixture set that exercises:
- root session with children;
- recap-worker child returning only a path;
- review session with
task_complete; - recovery session with bootstrap boilerplate.
- Decide whether generated summaries should be tracked privately, exported as redacted snapshots, or kept only as local research data.
Proposal: Symmetric Multi-Processing (SMP)
How capOS goes from single-CPU execution to utilizing all available processors.
Grounding and Cross-Links
The SMP substrate is one half of capOS’s multicore story; scheduler policy above it is the other half, and they advance through coupled gates. Read this proposal together with:
- Scheduler Evolution – Phase D (per-CPU
WFQ, bounded stealing) and Phase E (
SchedulingContextbind/revoke, budget, donation/return, depletion notification) are closed; Phase F has landed the one-SQ-consumer prerequisite, nohz telemetry, housekeeping/deferred-work placement, bounded SQPOLL ring mode, the clockevent/deadline substrate, and bounded non-periodic SQPOLL producer-wake progress, the first automatic nohz activation increment closed viadocs/tasks/done/2026/scheduler-phase-f-auto-nohz-activation.md, and SQPOLL-driven auto-nohz activation closed viadocs/tasks/done/2026/scheduler-phase-f-auto-nohz-sqpoll.md; timeout-based auto-revoke and ordinary-thread generic full-nohz admission are also landed; generic SQPOLL nohz for arbitrary rings and policy-service AutoNoHz issuance remain future work; Phase F.5 (full-SMP 16/32-core scalability planning) is the named gate for the milestone described below in Full-SMP Scalability Milestone and remains planning, not closed. - In-Process Threading Contract – thread-owned
execution state, generation-checked
ThreadRefqueues and wake records, per-thread ring mappings, and the recorded same-process 1-to-2 / diagnostic 1-to-4 evidence rows that this proposal’s scalability work must keep honoring. - Design Risks Register, Q9 – CPU accounting and scheduling
contexts
– partial-status answer that covers per-CPU WFQ, Phase E
SchedulingContext, and the cross-service donation / nohz activation / isolation lease / cross-principal fairness work still open. - Ring v2 For Full SMP – per-thread ring
endpoints and
cap_enter-on-thread-CQ are the dispatch contract this proposal’s scheduler-ownership milestones rely on. - SMP Phase C backlog – decomposed task list for the in-progress Phase C work tracked below.
The migrated task
kernel-upper-half-pml4-propagation-hardening
carries the Phase C residual for kernel upper-half page-table mutation after AP
startup. The retained finding is closed for the current kernel
MMIO/firmware helper path: paging::init() pre-seeds the helper’s upper-half
PML4 slot, AddressSpace::new_user clones upper-half entries from the
synchronized kernel root under the kernel page-table lock, and
map_kernel_physical_range rejects any attempt to create a previously absent
kernel-half PML4 slot after a user address space has been created. User-side
AddressSpace::{map,unmap,protect} remains shootdown-aware against resident
CPU masks; kernel upper-half edits inside pre-existing slots use the
kernel-wide shootdown path. Future helper windows or allocator-growth paths
that would require a new upper-half PML4 slot must pre-seed that slot before
user address-space creation or add synchronized active propagation into live
address spaces.
This document has three phases: a per-CPU foundation (prerequisite plumbing), AP startup (bringing secondary CPUs online), and SMP correctness (making shared state safe under concurrency).
Current status: Phase A’s BSP per-CPU foundation and Phase B AP startup are complete. Phase C has completed syscall GS migration, LAPIC/IPI, TLB shootdown, the first AP scheduler-owner handoff, temporary scheduler ownership on CPUs 0-3, per-CPU WFQ runnable queues under the shared scheduler lock, bounded stealing, and bounded idle-to-runnable wake targeting for queued and direct-IPC wakeups. The current scheduler is no longer the temporary single-global-runnable-queue shape from the 2026-05-02 collapse. Remaining SMP risks are the shared scheduler lock, temporary pinning replacement, scheduler-driven AP idle policy, broader workload classes, and higher-thread-count evidence. The next SMP product-level milestone should be full-SMP scalability evidence on a real 16/32-core environment, with QEMU kept for boot and regression coverage rather than as the primary performance source.
Implementation checkpoint: the BSP now has a concrete PerCpu object with
stable syscall-stack offsets, and syscall entry uses KernelGsBase/swapgs
to reach the per-CPU kernel RSP and saved user RSP slots. The scheduler mirrors
its current ThreadRef into the BSP record.
Second checkpoint: runtime stack switches now flow through
percpu::set_kernel_entry_stack, which updates the BSP PerCpu.kernel_rsp
slot and the BSP TSS.RSP0 together. Scheduler and interrupt paths no longer
coordinate those two updates by calling separate GDT and syscall helpers.
Third checkpoint: kernel/src/arch/x86_64/smp.rs now issues the Limine
MpRequest, enumerates non-BSP CPUs, allocates AP-local PerCpu records and
kernel/IST stack storage, and records dense capOS CPU ids separately from Limine
processor and LAPIC ids.
Fourth checkpoint: APs now start through MpInfo::bootstrap() and reach a
parked kernel idle loop. The BSP passes an AP record pointer through Limine
extra_argument, waits for a bounded online count, and remains the only CPU
that schedules userspace. Each AP loads AP-owned GDT/TSS state, the shared IDT,
KernelGsBase, and syscall MSRs, reports online, disables interrupts, and
parks in hlt. Review tightened this checkpoint so APs first switch from
Limine handoff state to the capOS kernel PML4 and AP-owned kernel stack before
any online signal.
Fifth checkpoint: syscall entry/exit now runs with kernel GS active between
entry and return. Normal returns swap back before sysretq, and blocking or
exiting syscall paths that leave through scheduler iretq restore use a
dedicated trampoline to swap GS back before restoring the next user context.
Sixth checkpoint: the BSP now enables xAPIC MMIO, maps the LAPIC page through the kernel MMIO allocator, calibrates the LAPIC timer initial count against PIT channel 2, runs scheduler ticks through LAPIC timer vector 48 with LAPIC EOI, installs the LAPIC spurious vector, and masks the legacy PIC once LAPIC ticks are active. Parked APs initialize local APIC state before reporting online. IDT vector 49 and a bounded vector-49-only fixed IPI send primitive back TLB shootdown and bounded idle-to-runnable reschedule requests.
Seventh checkpoint: user page-table map, unmap, and protect now flush the
local CPU and then route through a serialized vector-49 TLB shootdown helper
using each AddressSpace’s resident CPU mask. The helper records pending
full-TLB flush generations and sends vector-49 IPIs to online resident CPUs
other than the caller, then returns a completion token that callers wait after
dropping ring dispatch locks. Scheduler CR3 handoff points mark the selected
address space resident on the current CPU.
Eighth checkpoint: scheduler current-thread state is split into per-CPU slots,
AP PerCpu records are registered for current-thread and kernel-entry stack
updates, AP TSS.RSP0 is updated during context switches, and AP cpu=1 can enter
the scheduler from the AP idle loop when its LAPIC timer is available. The
first AP proof intentionally keeps one scheduler owner: when AP cpu=1 is online
with a programmed timer, the BSP remains in kernel idle so the process-wide
capability ring is not executed concurrently. The scheduler idle path is now a
per-CPU CPL0 (kernel-mode) idle thread; the user-mode idle process was removed
in commit e3c0df01 (2026-05-14 UTC). “Kernel idle” throughout this proposal
refers to that per-CPU CPL0 idle thread, not a user-mode idle process.
Depends on: Stage 5 (Scheduling) – needs a working timer, context switch, and run queue on the BSP before adding more CPUs.
Phase B completion: AP startup is implemented and reviewed. The private
process-buffer validate_user_buffer
TOCTOU blocker is closed for single locked copy/read paths, and Phase A now
has the BSP running through concrete per-CPU syscall-stack/current-thread
state. TLB shootdown, the first AP scheduler-owner handoff, temporary scheduler
ownership on CPUs 0-3, per-CPU WFQ runnable queues, bounded stealing, and
bounded idle-to-runnable wake targeting are implemented; shared scheduler lock
contention, temporary pinning replacement, scheduler-driven AP idle policy,
broader workload classes, higher-thread-count evidence, and shared
SharedParkSpace park key derivation remain later Stage 7 work. Shared
keys still need MemoryObject mapping provenance or object pins before they can
keep backing stable beyond one address-space-locked access.
Full-SMP Scalability Milestone
The current SMP evidence reaches four physical-core workers and one eight-logical-CPU SMT run under QEMU/KVM. That was enough to expose scheduler structure problems, but it is not the shape that should define whether capOS really uses modern multicore machines. The next SMP milestone should answer a more concrete question: can ordinary capOS workloads keep useful throughput and bounded scheduler overhead as the machine scales to 16 and 32 physical cores?
Preferred evidence environment:
- direct capOS boot on a dedicated bare-metal or cloud bare-metal/perf-runner machine with at least 16 physical cores, and a 32-core row when hardware is available;
- recorded CPU topology, SMT state, APIC mode, timer source, frequency policy, memory size, firmware/device model, source commit, toolchain, and kernel configuration;
- Linux native baselines on the same machine for comparable CPU workloads;
- QEMU/KVM rows only for boot/regression continuity or for explicitly labeled virtualized comparisons.
Workload coverage should move beyond one fixed checksum row:
- static map/reduce checksum over equal byte ranges;
- uneven dynamic task pool with deterministic task ids and result hash;
- barrier-heavy phase loop that exposes wakeup and cross-CPU coordination cost;
- same-process thread workload and independent-process workload;
- IPC/service-bound worker workload that includes capability calls outside the timed compute loop.
Each workload should report 1, 2, 4, 8, 16, and 32-worker rows when the hardware supports those counts, with SMT rows separated from physical-core rows. Each row should include both work-window time and total time, run count, warmup policy, median, variance, and verifier output. The report should show speedup and efficiency curves instead of reducing the result to one boolean threshold.
Implementation work expected before this milestone:
- replace the temporary scheduler CPU mask and static four-owner assumptions with discovered CPU topology and dynamic per-CPU scheduler structures;
- decide xAPIC versus x2APIC backend selection for larger APIC-id spaces;
- split or otherwise shrink the shared scheduler-lock critical sections that still serialize queue selection, wakeups, blocking, and cleanup;
- make placement topology-aware enough to distinguish physical cores, SMT siblings, and later NUMA/cache groups;
- keep TLB shootdown, timer, reschedule-IPI, cleanup, and accounting costs observable per CPU and per workload phase;
- keep per-thread ring ownership and SQ-consumer ownership generation-checked as CPU count rises.
This milestone belongs with scheduler evolution and benchmark planning rather than a new standalone proposal: the SMP proposal defines the CPU substrate, Scheduler Evolution Phase F.5 defines dispatch and policy work for full-SMP 16/32-core scalability, the benchmark proposal defines artifact shape, and the HPC parallel-pattern proposal defines the workload matrix. Q9 in the design risks register is the matching open-question entry: base CPU accounting and scheduling-context authority through Phase E are implemented, while cross-service donation, full nohz activation, CPU isolation leases, and cross-principal fairness are the named follow-ons that this milestone’s evidence will be evaluated against.
Current State
APs can boot into kernel idle loops, and CPUs 0-3 can temporarily own scheduler/user work when their LAPIC timers are available. Specific assumptions that Phase C must still remove:
| Component | File | Assumption |
|---|---|---|
| Syscall stack switching | kernel/src/arch/x86_64/syscall.rs, kernel/src/arch/x86_64/percpu.rs | Syscall entry/exit uses KernelGsBase/swapgs and GS-relative PerCpu stack fields on the running CPU |
| AP GDT, TSS, kernel stacks | kernel/src/arch/x86_64/gdt.rs, kernel/src/arch/x86_64/smp.rs | AP-local descriptor tables and stacks exist, and AP TSS.RSP0 updates during AP scheduler context switches |
| IDT | kernel/src/arch/x86_64/idt.rs | Single static IDT (shareable – IDT can be the same across CPUs) |
| SYSCALL MSRs | kernel/src/arch/x86_64/syscall.rs, kernel/src/arch/x86_64/smp.rs | STAR/LSTAR/SFMASK/EFER are initialized on BSP and parked APs; BSP and AP startup both publish KernelGsBase |
| Current thread and run queues | kernel/src/sched.rs, kernel/src/arch/x86_64/percpu.rs | SCHEDULER owns per-CPU current slots, per-CPU WFQ runnable queues ordered by virtual_finish_ns, bounded stealing from sibling queues, and wake placement through WakePolicy::QueueCpu; queued and direct-IPC wakeups iterate eligible idle scheduler CPUs and wake the first that accepts a fresh reschedule IPI, and CPUs 0-3 can temporarily own scheduler/user execution when their LAPIC timers are available, while shared-lock reduction, temporary pinning replacement, broader workload evidence, and higher-thread-count evidence remain deferred |
| Timer/IPI delivery | kernel/src/arch/x86_64/context.rs, kernel/src/arch/x86_64/lapic.rs, kernel/src/arch/x86_64/pic.rs, kernel/src/arch/x86_64/pit.rs, kernel/src/arch/x86_64/tlb.rs | CPUs 0-3 use PIT-calibrated LAPIC timer vector 48 with LAPIC EOI when online; vector 49 services TLB shootdown and bounded reschedule requests |
| Frame allocator | kernel/src/mem/frame.rs | Single global ALLOCATOR behind one spinlock |
| Heap allocator | kernel/src/mem/heap.rs | linked_list_allocator behind one spinlock |
The first checkpoint removed the separate syscall RSP globals and made the BSP
PerCpu layout the owner of syscall stack state. The GS checkpoint now uses
KernelGsBase/swapgs for those offsets on syscall paths. The LAPIC checkpoint
removed the PIT/PIC interrupt dependency from the normal BSP scheduler tick,
kept PIT channel 2 as the LAPIC calibration source, installed the spurious
vector, and wired the IPI vector. The TLB checkpoint added resident CPU masks,
vector-49 shootdown, pending generation counters, completion waits, and
syscall-entry plus flush-before-user-return hooks for delayed maskable interrupt
delivery. The AP scheduler-owner checkpoint added per-CPU current slots and AP
cpu=1 scheduler entry. The remaining Phase C assumptions are in concurrent
run-queue ownership and reschedule routing, not in syscall stack lookup, the
primary timer source, user page-table mutation invalidation, or AP TSS updates.
Phase A: Per-CPU Foundation
Establish per-CPU data structures on the BSP. No APs are started yet – this phase makes the BSP’s own code SMP-ready so Phase B is a clean addition.
Per-CPU Data Region
Each CPU needs a private data area accessible via the GS segment base. On
x86_64, swapgs switches between user-mode GS (usually zero) and
kernel-mode GS (pointing to per-CPU data). The kernel sets KernelGSBase
MSR on each CPU during init.
The BSP checkpoint originally reached this layout as BSP_PER_CPU+offset from
assembly. Phase C now uses the same offsets through GS after swapgs on
syscall entry.
#![allow(unused)]
fn main() {
/// Per-CPU data, one instance per processor.
/// Accessed via GS-relative addressing after swapgs.
#[repr(C)]
struct PerCpu {
/// Self-pointer for accessing the struct from GS:0.
self_ptr: *const PerCpu,
/// Kernel stack pointer for syscall entry (replaces SYSCALL_KERNEL_RSP).
kernel_rsp: u64,
/// Saved user RSP during syscall (replaces SYSCALL_USER_RSP).
user_rsp: u64,
/// Currently running thread on this CPU, if one is active.
current_thread: Option<ThreadRef>,
/// CPU index (0 = BSP).
cpu_id: u32,
/// LAPIC ID (from Limine MP info or CPUID).
lapic_id: u32,
}
}
The previous checkpointed syscall entry stub used the same offsets via the BSP symbol:
movq %rsp, BSP_PER_CPU+16(%rip) ; PerCpu.user_rsp
movq BSP_PER_CPU+8(%rip), %rsp ; PerCpu.kernel_rsp
The current syscall entry stub uses GS-relative addressing:
swapgs
movq %rsp, %gs:16 ; PerCpu.user_rsp
movq %gs:8, %rsp ; PerCpu.kernel_rsp
And symmetrically on return:
movq %gs:16, %rsp ; restore user RSP
swapgs
sysretq
Non-returning syscall paths need separate handling: exit, a blocking
cap_enter, and a terminal ThreadControl.exitThread can leave the syscall
entry path by building a CpuContext and restoring another thread with
iretq. Those paths must restore user GS ownership before iretq, even though
they never execute the normal sysretq epilogue.
Lock And Ownership Rules
PerCpu fields split by owner:
kernel_rspandTSS.RSP0are updated together throughpercpu::set_kernel_entry_stack.user_rspis written only by syscall entry assembly and read only while constructing a blocked-syscallCpuContext.current_threadmirrorsScheduler.current; the scheduler lock remains the authority for choosing and validating the current thread.cpu_idandlapic_idare immutable after CPU initialization.
Phase A keeps the global scheduler lock and process table. The PerCpu
current field is not a second scheduler authority; it is the per-CPU execution
cache that Phase B will use when multiple CPUs stop sharing one current
slot.
Per-CPU GDT, TSS, and Stacks
Each CPU needs its own:
- GDT – the TSS descriptor encodes a physical pointer to the CPU’s TSS, so each CPU needs a GDT with its own TSS entry. The segment layout (kernel CS/DS, user CS/DS) is identical across CPUs.
- TSS –
privilege_stack_table[0](kernel stack for interrupts from Ring 3) and IST entries (double-fault stack) must be per-CPU. - Kernel stack – each CPU needs its own stack for syscall/interrupt handling. Current size: 16 KB (4 pages). Same size per CPU.
- Double-fault stack – each CPU needs its own IST stack. Current size: 20 KB (5 pages).
#![allow(unused)]
fn main() {
/// Allocate and initialize per-CPU structures for one CPU.
fn init_per_cpu(cpu_id: u32, lapic_id: u32) -> &'static PerCpu {
// Allocate kernel stack (4 pages) and double-fault stack (5 pages)
let kernel_stack = alloc_stack(4);
let df_stack = alloc_stack(5);
// Create TSS with per-CPU stacks
let mut tss = TaskStateSegment::new();
tss.privilege_stack_table[0] = kernel_stack.top();
tss.interrupt_stack_table[DOUBLE_FAULT_IST_INDEX] = df_stack.top();
// Create GDT with this CPU's TSS
let (gdt, selectors) = create_gdt(&tss);
// Allocate and populate PerCpu struct
let per_cpu = Box::leak(Box::new(PerCpu {
self_ptr: core::ptr::null(), // filled below
kernel_rsp: kernel_stack.top().as_u64(),
user_rsp: 0,
current_thread: None,
cpu_id,
lapic_id,
}));
per_cpu.self_ptr = per_cpu as *const PerCpu;
per_cpu
}
}
LAPIC Initialization
Stage 5 uses the 8254 PIT (100 Hz) and 8259A PIC (IRQ0 → vector 32) for preemption on the BSP. AP startup must initialize enough local-APIC state for secondary CPUs to park in a kernel idle loop and for later IPIs. Migrating BSP preemption from PIT to LAPIC timer is still required before multi-CPU scheduling, since the PIT is a single shared device that cannot provide per-CPU timer interrupts. LAPIC work is needed for:
- Per-CPU timer – replace PIT with LAPIC timer (required for SMP)
- IPI – inter-processor interrupts for TLB shootdown and AP startup
- Spurious interrupt vector – must be configured per-CPU
2026-04-25 research decision: the immediate Phase C LAPIC/IPI foundation uses xAPIC MMIO, LAPIC timer vector 48, IPI vector 49, LAPIC EOI, AP LAPIC initialization, and PIT/PIC fallback. The grounding note x2APIC and APIC virtualization records the checked Intel and QEMU/KVM sources and keeps x2APIC as a later backend rather than a reason to rework the current LAPIC gate.
Crate Dependencies
| Crate | Purpose | no_std |
|---|---|---|
| manual xAPIC MMIO backend | current LAPIC timer, EOI, IPI, spurious vector foundation | yes |
future manual x2APIC MSR backend using x86_64 MSR access | newer/high-core systems and firmware states where xAPIC is unavailable or undesirable | yes |
The current LAPIC path uses xAPIC MMIO through the kernel MMIO mapper. The
later x2APIC backend should still be small and explicit rather than adding an
APIC abstraction crate: read the APIC ID, enable x2APIC through
IA32_APIC_BASE, program the spurious-vector register, local-vector timer,
timer divide/initial-count registers, EOI, and ICR sends through MSRs. I/O APIC
remains separate MMIO hardware discovered through ACPI MADT and belongs to the
later interrupt-infrastructure/cloud path.
Migration Path
Phase A was a refactor of existing single-CPU code, not an addition:
- Add
PerCpustruct, allocate one instance for BSP. Done for BSP static storage. - Set BSP’s
KernelGSBaseMSR, addswapgsto syscall entry/exit. Done for syscall entry/exit, including syscall-to-iretqexits. - Replace
SYSCALL_KERNEL_RSP/SYSCALL_USER_RSPglobals with per-CPU accesses. Done; syscall assembly uses GS-relativePerCpuoffsets. - Replace scheduler’s global
SCHEDULER.currentwithPerCpu.current_thread. Partially done: the BSP per-CPU record mirrorsScheduler.current; the scheduler lock remains authoritative for current-thread and queue ownership until shared scheduler metadata is split further. - Move GDT/TSS stack updates behind the per-CPU path. Done for the BSP runtime stack-update hook; AP-local GDT/TSS allocation belongs to Phase B.
- Migrate BSP from PIT to LAPIC timer (PIT initialized in Stage 5). Done for the BSP timer path, with PIT used for calibration and PIT/PIC retained as a fallback.
After Phase A, the kernel still runs user work on one CPU but the BSP per-CPU
plumbing is in place. Existing tests (make run-smoke and make run-spawn)
continue to pass.
Phase B: AP Startup
Bring Application Processors (APs) online. Each AP runs the same kernel code with its own per-CPU state.
2026-04-25 grounding checkpoint: the next implementation slice should use the
current local limine crate’s MP API, not the older SmpRequest naming used
in some protocol examples. In capOS’s pinned crate, limine::request::MpRequest
returns architecture-specific limine::mp::MpRespData; x86_64 CPU records are
limine::mp::MpInfo values with processor_id, lapic_id,
MpInfo::bootstrap(entry, extra_arg), and MpInfo::extra_argument(). The
Phase B implementation is split into two checkpoints: first enumerate CPUs,
assign dense capOS CPU ids separately from Limine’s ACPI processor_id, and
allocate AP state/stack slots; then bind each non-BSP CPU to a slot via
extra_arg, start it with bootstrap, and park it in a kernel idle loop after
local CPU initialization. Both checkpoints are implemented; APs still must not
run userspace or mutate the global scheduler.
Limine MP Request
Limine provides an MP response with per-CPU records. Each x86_64 record
contains an ACPI processor id, LAPIC ID, and an atomic boot handoff. In the
local limine crate, callers should use MpInfo::bootstrap() rather than
writing the raw goto_addr field directly.
#![allow(unused)]
fn main() {
use limine::request::MpRequest;
static MP_REQUEST: MpRequest = MpRequest::new(0);
fn start_aps() {
let mp = MP_REQUEST.response().expect("no MP response");
let mut next_cpu_id = 1;
for cpu in mp.cpus() {
if cpu.lapic_id == mp.bsp_lapic_id {
continue; // skip BSP
}
let cpu_id = next_cpu_id;
next_cpu_id += 1;
record_boot_processor_id(cpu_id, cpu.processor_id);
let ap = init_ap_record(cpu_id, cpu.processor_id, cpu.lapic_id);
cpu.bootstrap(ap_entry, ap as *const ApCpu as u64);
}
}
}
AP Entry
Each AP must:
- Switch to the capOS kernel PML4 and AP-owned kernel stack
- Enable per-CPU CR4 state used by the kernel page tables and user-access guards
- Load its per-CPU GDT and TSS
- Load the shared IDT
- Set
KernelGSBaseMSR to itsPerCpupointer - Configure SYSCALL MSRs (STAR, LSTAR, SFMASK, EFER.SCE)
- Signal “ready” to BSP (atomic flag or counter)
- Enter a parked kernel idle loop
Local APIC timer setup and IPI handling remain separate Stage 7 gates; parked APs keep interrupts disabled until that work is ready.
#![allow(unused)]
fn main() {
/// AP entry point. Called by Limine with the MP info pointer.
unsafe extern "C" fn ap_entry(info: &limine::mp::MpInfo) -> ! {
let ap_ptr = info.extra_argument() as *const ApCpu;
let ap = unsafe {
ap_ptr
.as_ref()
.expect("Limine AP extra_argument must be an ApCpu pointer")
};
let per_cpu = ap.per_cpu();
// Switch from Limine state to capOS-owned paging and AP stack.
ap.switch_to_kernel_paging_and_stack();
// Match per-CPU CR4 state after the kernel PML4 is live.
paging::enable_global_pages_on_current_cpu();
smap::init();
// Load this CPU's GDT + TSS
ap.descriptors.load();
// Shared IDT (same across all CPUs)
idt::init();
// Set GS base for swapgs
unsafe { wrmsr(IA32_KERNEL_GS_BASE, per_cpu as *const _ as u64); }
// Configure syscall MSRs (same values as BSP)
syscall::init_msrs();
// Signal ready
ap.online.store(true, Ordering::Release);
AP_READY_COUNT.fetch_add(1, Ordering::AcqRel);
// Park until a later scheduler milestone gives APs runnable work.
ap_idle_loop();
}
}
The extra_argument pointer must name an initialized, non-null ApCpu record
whose storage outlives the AP. The BSP publishes that record before calling
MpInfo::bootstrap(), and the AP treats the contained PerCpu pointer as
CPU-local state after entry.
Scheduler Boundary
Phase B does not extend the Stage 5 scheduler. The BSP remains the only CPU
that runs userspace or mutates the global scheduler. APs only run enough kernel
initialization to prove that per-CPU architectural state is valid, signal ready,
and park in a bounded hlt loop.
Per-CPU WFQ runnable queues under the shared scheduler lock, bounded stealing
that chooses the most-overdue runnable sibling candidate, bounded
idle-to-runnable wake targeting that walks eligible idle scheduler CPUs, and
address-space CPU residency tracking are the current Phase C structure. The
temporary 2026-05-02 single-global-runnable-queue collapse is historical;
Scheduler Evolution Phase D (closed 2026-05-10) reintroduced per-CPU queues
with weighted fair ordering, and Phase E closed SchedulingContext
bind/revoke, budget, donation/return, and depletion notification on top of
that. Phase F has landed the one-SQ-consumer prerequisite, nohz telemetry,
housekeeping/deferred-work placement, the bounded SQPOLL ring mode, the
clockevent/deadline substrate, and bounded non-periodic SQPOLL producer-wake
progress, the first automatic nohz activation increment closed via
docs/tasks/done/2026/scheduler-phase-f-auto-nohz-activation.md,
and SQPOLL-driven auto-nohz activation closed via
docs/tasks/done/2026/scheduler-phase-f-auto-nohz-sqpoll.md;
timeout-based auto-revoke and ordinary-thread generic full-nohz admission are
also landed. Generic SQPOLL nohz for arbitrary rings and policy-service AutoNoHz
issuance remain future work.
CPU affinity policy, shared scheduler metadata splitting, scheduler-driven AP
idle policy, broader workload classes, higher-thread-count evidence, and the
named Phase F.5 16/32-core scalability proof remain Phase C/F follow-ups. The
first Phase C scheduler proof may continue to use the current process ring
while the runtime serializes ring consumption.
Full SMP where sibling threads from one process wait independently on different
CPUs should use the Ring v2 direction in
Ring v2 For Full SMP: cap_enter waits on the
current thread’s CQ, not on a shared process CQ.
Boot Sequence
BSP: kernel init (GDT, IDT, memory, heap, caps, scheduler)
BSP: init_per_cpu(0, bsp_lapic_id)
BSP: start_aps()
AP1: ap_entry() → switch CR3/RSP → init GDT/TSS/syscall state → idle_loop()
AP2: ap_entry() → switch CR3/RSP → init GDT/TSS/syscall state → idle_loop()
...
BSP: wait for all APs ready
BSP: load init process, schedule it
BSP: enter scheduler
Phase C: SMP Correctness
With APs parked in kernel idle loops, Phase C makes user scheduling safe on more than one CPU. The order is:
- Move syscall entry/exit and per-CPU access to
KernelGsBase/swapgsso APs do not use BSP-symbol-relative syscall stack fields. This includes non-sysretqpaths that block or exit through scheduleriretqrestore. Done for syscall stack fields and syscall-originated restore paths. - Add LAPIC timer and IPI support so each CPU can take local scheduler ticks and receive cross-CPU requests. Done for PIT-calibrated BSP LAPIC ticks, parked-AP LAPIC initialization, spurious-vector handling, vector 49, a bounded vector-49-only fixed IPI send primitive, live TLB shootdown users, and bounded idle-to-runnable reschedule requests.
- Add TLB shootdown before any user address space can run on more than one CPU over its lifetime. Done for user page-table map/unmap/protect through resident CPU masks, vector-49 shootdown, pending full-TLB flush generations, completion waits, and syscall-entry/flush-before-user-return hooks. Remote AP targets become active when AP scheduler ownership records AP residency.
- Split scheduler current/run-queue ownership into per-CPU state, with a reviewed AP idle-to-runnable handoff. Done for per-CPU current-thread slots, the first AP cpu=1 scheduler owner handoff, temporary scheduler ownership on CPUs 0-3, per-CPU WFQ runnable queues, bounded stealing, and bounded idle-to-runnable wake targeting; shared scheduler lock reduction, temporary pinning replacement, broader workload evidence, and higher-thread-count evidence remain deferred.
- Prove the existing manifest/ring/thread/park smokes under
-smp 2.
With multiple CPUs running scheduler-owned work, shared mutable state needs careful handling.
TLB Shootdown
When the kernel modifies page tables that other CPUs may have cached in their TLBs, it must send an IPI to those CPUs to invalidate the affected entries.
Scenarios requiring shootdown:
- Process exit – unmapping user pages. Only the CPU running the process has the mapping cached, but if the process migrated recently, stale TLB entries may exist on the old CPU.
- Shared kernel mappings – changes to the kernel half of page tables (e.g., heap growth, MMIO mappings) require all-CPU shootdown.
- Capability-granted shared memory – if future stages allow shared memory regions between processes, modifications require targeted shootdown.
Current code uses local mapper flushes in AddressSpace::map,
AddressSpace::unmap, and AddressSpace::protect, then calls the serialized
shootdown helper with the address space’s resident CPU mask. Those methods are
reached from VirtualMemoryCap’s parse_map, parse_unmap, and
parse_protect anonymous mapping paths and
MemoryObjectCap::{map,unmap,protect} borrowed mapping paths. Scheduler CR3
handoff marks the selected address space resident on the current CPU, including
AP cpu=1 during the first AP scheduler-owner proof.
Implementation state consists of vector 49, a resident CPU target mask, and
per-CPU pending full-TLB flush generations. The first implementation records
pending flush generations for online resident CPUs other than the caller, after
the local page-table edit and local flush complete, then sends vector-49 IPIs to
prompt immediate drain and returns a completion token. VM capability handlers
enqueue completion work after dropping the address-space guard, and cap_enter
or timer polling drains the queue after ring dispatch releases cap-table and
scratch locks. Handlers reserve fixed-size queue slots before page-table
mutation, so overload is reported before rollback, unmap, or protect can mutate
state. Drains flush the current CPU before waiting, so a CPU that is itself in
the target mask cannot wait on its own pending generation. A target CPU that is
already in a syscall and contending on those
same locks can eventually reach the IPI or return-path drain. If a target CPU
has maskable interrupts delayed while it runs a kernel path, it still drains its
pending generation at syscall entry or before returning to userspace from
syscall, timer, or scheduler restore paths.
#![allow(unused)]
fn main() {
fn shootdown_page(resident_cpu_mask: u64) {
let targets = resident_cpu_mask & online_cpu_mask() & !current_cpu_bit();
let generation = next_shootdown_generation();
for cpu_id in targets {
PENDING_FLUSH_GENERATION[cpu_id].store(generation, Ordering::Release);
lapic::send_fixed_ipi(lapic_id_for_cpu(cpu_id));
}
ShootdownCompletion { targets, generation }
}
fn flush_pending_for_current_cpu() {
while pending_generation(current_cpu_id()) != flushed_generation(current_cpu_id()) {
let generation = pending_generation(current_cpu_id());
x86_64::instructions::tlb::flush_all();
FLUSHED_GENERATION[current_cpu_id()].store(generation, Ordering::Release);
}
}
}
The first implementation targets the address space’s resident CPU mask rather than every online CPU so parked APs with interrupts disabled are not disturbed. It relies on kernel user-buffer access continuing through address-space-locked HHDM copy/read helpers rather than raw user virtual addresses while a delayed flush generation exists. Broader range and page-level coalescing can be added after AP scheduling exists.
LAPIC/IPI Boundary
The normal timer path is now local-APIC-backed: vector 48 handles scheduler ticks with LAPIC EOI after PIT-channel-2 calibration, vector 49 handles TLB shootdown and bounded idle-to-runnable reschedule requests, vector 255 handles LAPIC spurious interrupts without EOI, and vector 32 remains only for the PIT/PIC fallback. AP scheduler owners program their LAPIC timers from the BSP calibration before entering the scheduler-owner loop; if AP timer setup is unavailable, the BSP keeps scheduler ownership. The remaining LAPIC/IPI work is broader scheduler-driven AP idle policy, future preemptive reschedule policy, and a later x2APIC MSR backend after the architectural xAPIC MMIO path is correct, not the bounded idle-to-runnable wake request path.
The TLB shootdown IPI handler must not allocate and must not take locks that can be held while sending a shootdown. Completion waits must happen after dropping the mutated address space’s lock and ring dispatch’s cap-table/scratch locks. The deferred completion queue must remain bounded, non-allocating at enqueue, and reserved before page-table mutation. Syscall-entry and user-return paths must drain pending flush generations so delayed maskable IPI delivery cannot leave a target CPU unable to observe completion or resume a thread with stale TLB state.
KVM paravirtual features such as kvm-pv-eoi, kvm-pv-ipi, and
kvm-pv-tlb-flush are future performance work. They must not be required for
the first LAPIC timer, IPI, or TLB-shootdown correctness proofs.
Lock Audit
Existing spinlocks need review for SMP safety:
| Lock | Current Use | SMP Concern |
|---|---|---|
SERIAL | COM1 output | Safe but high contention if many CPUs print. Acceptable for debug output. |
ALLOCATOR | Frame bitmap | Hot path. Holding lock during full bitmap scan is O(n). Consider per-CPU free lists. |
KERNEL_CAPS | Kernel cap table | Low contention (init only). Safe. |
SCHEDULER.current | Single global running-thread slot | Split into PerCpu.current_thread in Phase A. |
Before APs can run userspace, the scheduler also needs an explicit CPU residency record for each live thread or address space. That record drives TLB shootdown targeting and prevents migration from racing page-table changes. Process exit and thread exit must clear residency before freeing stacks, address spaces, or ring state that another CPU might still observe.
Interrupt + spinlock deadlock: if CPU A holds a spinlock and takes an
interrupt whose handler tries to acquire the same lock, deadlock. This is
already noted in REVIEW.md. Fix: disable interrupts while holding locks
that interrupt handlers may need (frame allocator, serial). The spin crate
supports MutexIrq for this pattern, or use manual cli/sti wrappers.
Allocator Scaling
The frame allocator is behind a single spinlock with O(n) bitmap scan. Under SMP, this becomes a contention bottleneck.
Options (in order of complexity):
- Per-CPU free list cache – each CPU maintains a small cache of free frames (e.g., 64 frames). Refill from the global allocator when empty, return batch when full. Reduces lock acquisitions by ~64x.
- Region partitioning – divide physical memory into per-CPU regions. Each CPU owns a bitmap partition. Cross-CPU allocation falls back to a global lock. More complex, better NUMA behavior (future).
Option 1 is recommended for initial SMP. ~50-100 lines.
The heap allocator (linked_list_allocator) is also behind a single lock.
For a research OS this is acceptable initially – heap allocations in the
kernel should be infrequent compared to frame allocations.
Cap’n Proto Schema Additions
SMP introduces a kernel-internal CpuManager capability for inspecting and
controlling CPU state. This is not exposed to userspace initially but follows
the “everything is a capability” principle.
interface CpuManager {
# Number of online CPUs.
cpuCount @0 () -> (count :UInt32);
# Per-CPU info.
cpuInfo @1 (cpuId :UInt32) -> (lapicId :UInt32, online :Bool);
}
This capability would be held by init (or a system monitor process) for diagnostics. It’s additive and can be deferred until the mechanism is useful.
Estimated Scope
| Phase | New/Changed Code | Depends On |
|---|---|---|
| Phase A: BSP per-CPU foundation | Done (BSP PerCpu, syscall-stack storage, scheduler mirror, stack-update hook) | Stage 5 |
| Phase B: AP startup | Done (MpRequest, AP records/stacks, AP CR3/RSP handoff, parked idle) | Phase A |
| Phase C: Multi-CPU scheduling | In progress (GS/swapgs migration, LAPIC timer/IPI with EOI, shootdown-aware VM mutation wrappers, pending TLB generation completion, per-CPU current slots, temporary scheduler ownership on CPUs 0-3, per-CPU WFQ runnable queues, bounded stealing, and bounded idle-to-runnable wake targeting are implemented; shared scheduler lock reduction, temporary pinning replacement, scheduler-driven AP idle policy, broader workload evidence, and higher-thread-count evidence remain open) | Phase B |
| Ring v2 for full SMP | TBD (per-thread rings, completion routing, SQPOLL ownership) | Phase C plus threading/park |
| Total | TBD after Phase C hardware/scheduler audit |
Milestones
- M1: Per-CPU data on BSP – BSP
PerCpusyscall-stack/current-thread state, BSP per-CPU kernel-entry stack hook, and single-CPU QEMU proofs. Done. - M2: APs running – secondary CPUs reach
idle_loop(). BSP prints “N CPUs online”.make runstill runs init on BSP. Done. - M3: TLB shootdown – page table modifications are safe across CPUs. Process exit on one CPU doesn’t leave stale mappings on others. Done for address-space resident masks and AP cpu=1 residency marking.
- M4: Multi-CPU scheduling – processes can run on any CPU. The existing
boot-manifest service set still works, but the scheduler distributes work
across CPUs once runnable processes are available (runtime spawning still
depends on
ProcessSpawner). Temporary scheduler ownership on CPUs 0-3, per-CPU WFQ runnable queues, bounded stealing, and bounded idle-to-runnable wake targeting are implemented; shared scheduler lock reduction, temporary pinning replacement, scheduler-driven AP idle policy, broader workload evidence, and higher-thread-count evidence remain open. - M5: Ring v2 completion ownership – every live thread can own a ring
endpoint; endpoint, timer, park, process-wait, and thread-join completions
route by
ThreadRef. This is the target for full SMP where sibling threads in one process wait independently on different CPUs.
Open Questions
-
x2APIC backend. Phase C currently has an xAPIC MMIO LAPIC foundation. A later x2APIC MSR backend is still needed for newer/high-core systems and firmware states where xAPIC is unavailable or locked out; it should not block TLB shootdown on the current implementation path.
-
Idle strategy.
hltis the simplest idle.mwaitis more power-efficient and can be used to wake on memory writes. Overkill for QEMU, but worth noting for future hardware targets. -
CPU hotplug. Limine starts all CPUs at boot. Runtime CPU online/offline is a future concern, not needed initially.
-
NUMA awareness. Multi-socket systems have non-uniform memory access. Per-CPU frame allocator regions could be NUMA-aware. Deferred – QEMU emulates flat memory by default.
-
Scheduler policy. The current multi-CPU scheduler uses per-CPU WFQ runnable queues ordered by
virtual_finish_nsunder the shared scheduler lock, with bounded stealing from sibling queues when a CPU has no local runnable entry. Scheduler Evolution Phase D (per-CPU WFQ and bounded stealing, closed 2026-05-10) and Phase E (SchedulingContextbind/revoke, budget, donation/return, depletion notification) are closed against this substrate; Phase F has landed the one-SQ-consumer prerequisite, nohz telemetry, housekeeping/deferred-work placement, the bounded SQPOLL ring mode, the clockevent/deadline substrate, and bounded non-periodic SQPOLL producer-wake progress; the first automatic nohz activation increment and SQPOLL-driven auto-nohz activation are both closed (seedocs/tasks/done/2026/scheduler-phase-f-auto-nohz-activation.mdanddocs/tasks/done/2026/scheduler-phase-f-auto-nohz-sqpoll.md). The older round-robin/global-overflow starting point is historical, not the current baseline. Future refinements are shared-lock reduction, temporary pinning replacement, stronger CPU-affinity/admission policy, broader workload-class evidence, higher-thread-count evidence, and the Phase F.5 full-SMP 16/32-core scalability proof.
References
Specifications
- Intel SDM Vol. 3, Chapter 8 – Multiple-Processor Management (AP startup, APIC, IPI)
- Intel SDM Vol. 3, Chapter 10 – APIC (Local APIC, I/O APIC, x2APIC)
- xAPIC Deprecation Plan – Intel guidance on x2APIC defaults, legacy xAPIC deprecation, and guest virtualization
- CPUID Enumeration and Architectural MSRs – x2APIC MSR range and xAPIC disable/lock behavior
- OSDev Wiki: SMP
- OSDev Wiki: APIC
Limine
- Limine SMP Feature – MP request/response API, AP startup mechanism
Virtualization
- QEMU / KVM CPU model configuration – CPU feature exposure, host passthrough, and named-model configuration
- QEMU Paravirtualized KVM features – optional KVM PV EOI, IPI, TLB-flush, and extended destination-id features
- Linux KVM API – VMM-side LAPIC/x2APIC state handling
Prior Art
- Redox SMP – per-CPU contexts, LAPIC timer, IPI-based TLB shootdown
- xv6-riscv SMP – minimal multi-core OS, clean per-CPU implementation
- Hermit SMP – Rust unikernel with SMP support via per-core data and APIC
- BlogOS – educational x86_64 Rust OS (single-CPU, but good APIC coverage)
Proposal: Ring v2 For Full SMP
How capOS should evolve the capability ring once multiple threads from one process can run concurrently on multiple CPUs.
The current ring design is intentionally process-wide: one ring page per
process, one SQ, one CQ, and one blocked cap_enter waiter admitted per
process. That was the right first threading milestone because it preserved the
existing transport while moving scheduler identity from process ids to
generation-checked ThreadRef values.
That design can support an initial multi-CPU scheduler proof if the runtime continues to serialize process-ring consumption. It should not be the endpoint for full SMP where sibling threads from one process run and wait on different CPUs. A single process CQ forces those sibling threads to coordinate completion consumption in userspace and keeps the kernel from knowing which thread should block for which CQ stream. The full-SMP target is per-thread ring ownership.
Design Grounding
The local research files checked before this design were:
docs/research/completion-ring-threading.md;docs/research/out-of-kernel-scheduling.md;docs/research/llvm-target.md;docs/research/sel4.md;docs/research/zircon.md.
The relevant result is that efficient shared rings want clear producer/consumer
ownership. Linux io_uring uses user_data to identify requests, but its
aggregate wait model does not by itself solve multiple user consumers waiting
on one raw CQ. Futexes provide the right user-runtime parking primitive for
compatibility demux. Windows IOCP is a shared completion packet queue model,
which is useful as a runtime abstraction but should not be confused with
letting several kernel-blocked threads wait on the same circular CQ storage.
Target Model
Each live process thread owns one capability ring endpoint. A ring endpoint is a complete SQ/CQ pair with one userspace-visible identity; it may be mapped as one page per thread or as a lane in a larger ring bundle, but a lane is not just a CQ attached to a shared process SQ.
Each endpoint has:
- one userspace SQ/CQ pair;
- one kernel
RingScratchor equivalent dispatch scratch owned by that thread or by the ring endpoint; - one blocked
cap_enterwaiter for that thread’s CQ; - one ring address passed to the thread at startup.
The process remains the authority boundary. Address space, cap table, CapSet, and resource accounting stay process-owned. Result-cap transfers still install capabilities into the process cap table. Per-thread rings only split transport progress and completion ownership.
cap_enter(min_complete, timeout_ns) keeps its current syscall shape, but the
meaning becomes:
Process pending SQEs for the current thread’s ring, then block the current thread until at least
min_completeCQEs are available on that same thread’s CQ, or until timeout.
Userspace still matches individual requests by user_data within the current
thread’s CQ. The kernel does not add slot-specific waits; CQ slots are storage,
not durable request identities.
Thread Creation And Bootstrap
The initial thread may keep the legacy fixed RING_VADDR mapping during the
transition. Additional threads need unique ring mappings because all threads
share one address space.
The initial accepted contract is kernel-chosen ring mapping. ThreadSpawner
does not accept a caller-supplied ring address for the first Ring v2 slice.
The kernel allocates a ring record, maps that ring at a collision-free user
virtual address in the caller’s address space, charges it to the process
ledger, stores the address on the child ThreadRef, and passes the address in
the child start registers. If no ring mapping or record can be allocated, thread
creation fails before the child thread becomes runnable and rolls back all
thread and ring reservations.
Runtime-supplied ring address ranges remain a later extension. They need
reviewed VirtualMemory reservation semantics so the runtime can reserve a
ring arena without racing normal user mappings. Until that extension lands,
Ring v2 implementation branches must not add a ThreadSpawner.create parameter
for a caller-selected ring address.
The child thread entry contract should continue to pass bootstrap register values equivalent to:
RDI = arg;RSI = tid;RDX = pid;RCX = thread_ring_addr;R8 = CAPSET_VADDR, or zero if absent.
For the initial process thread, _start keeps receiving the ring address from
the loader ABI. Once every userspace binary uses the runtime-provided ring
address instead of assuming RING_VADDR, the fixed mapping can become a
bootstrap-only compatibility detail.
When Ring v2 introduces versioned SQE/CQE layouts, the register-level ring address handoff becomes one field of the negotiated runtime boot record:
#![allow(unused)]
fn main() {
struct RuntimeBootInfo {
ring_addr: u64,
ring_abi_version: u32,
sqe_size: u16,
cqe_size: u16,
}
}
RuntimeBootInfo, ring ABI version constants, and fixed SQE/CQE layouts live in
capos-config/src/ring.rs. Kernel code and capos-rt must import the shared
definition instead of maintaining parallel boot-ABI structs.
The first implementation may continue using the existing fixed SQE/CQE layout
and RING_VADDR for the initial thread. It still needs a shared ring-endpoint
descriptor in the kernel so initial-thread and child-thread rings use the same
lifetime, waiter, and completion-routing rules. The fixed initial mapping is a
compatibility special case, not a separate process-wide ring once Ring v2 is
enabled for a process.
The Tickless/Realtime proposal owns the first CapSqeV2 use case
(deadline_ns, qos_flags, and sched_ctx_id), but Ring v2 owns the transport
rule: every thread ring handoff must carry or imply the same ABI version and
entry sizes that cap_enter validates. A runtime must not infer CapSqeV2
from the address alone.
Completion Routing
Any kernel record that can later post a CQE must store a target ThreadRef and
post to that thread’s ring after generation validation:
- ordinary CALL completions target the submitting thread;
- endpoint RECV completions target the receiver thread;
- endpoint RETURN completions target the original caller thread;
Timer.sleepcompletions target the sleeping thread;ProcessHandle.waitcompletions target the waiting thread;ThreadHandle.joincompletions target the joining thread;- ParkSpace wait wake/timeout completions target the waiting thread;
- deferred endpoint cancellation completions target the thread that posted the cancelable operation.
Process exit cancels every ring owned by the process. Thread exit cancels that
thread’s own ring operations and wakes/drops waiters that name its ThreadRef.
If a thread exits with outstanding operations that can still complete, the
kernel must either cancel them before releasing the ring or hold the ring
record until all generation-checked completion paths drain.
Normative lifetime invariant: a ring record cannot be freed while any CPU, waiter, endpoint call, timer waiter, park waiter, cancellation path, deferred completion path, or SQPOLL worker can still post to it. Thread exit either cancels every such record first or keeps the ring record alive until all generation-checked completion paths have drained.
The implementation contract for completion routing is:
- scheduler state resolves
ThreadRef -> RingEndpointimmediately before posting a CQE; - a missing process, stale process generation, missing thread, stale thread generation, or closed ring endpoint turns the completion into a stale completion and must not write userspace memory;
- a ring endpoint stays pinned while a completion writer owns its reference;
- result-cap installation still targets the shared process cap table, but the CQE that names the installed result-cap slot is written only to the target thread’s CQ;
cap_enterdrains and waits on the current thread’s ring only; it never drains a sibling thread’s SQ and never waits on a process-wide CQ;- same-process thread scaling remains unclaimable until endpoint, timer, park,
process-wait, thread-join, deferred-cancel, and direct IPC completion paths
all follow this
ThreadRef -> RingEndpointrule.
SQPOLL And Kernel Consumers
Each thread ring must have exactly one kernel SQ consumer at a time:
- syscall mode: the owner thread’s
cap_enterdrains its own SQ; - SQPOLL mode: a kernel worker drains that ring’s SQ, and
cap_enterwaits for CQ availability and returns counts. Userspace remains the CQ consumer.
The Phase F prerequisite now makes this an explicit kernel-side lease for the
current per-thread ring endpoints. Syscall-mode dispatch has a
generation-checked owner covering both caller-driven cap_enter and bounded
timer-side current-thread ring service; a stale owner cannot advance SQ head,
and a duplicate future SQPOLL owner is rejected while the syscall owner is
live. This does not enable SQPOLL mode, nohz, or CPU isolation.
Mode changes require quiescing the ring so cap_enter and SQPOLL do not both
consume the same SQ. SQPOLL workers should be bound through scheduler policy or
future CPU grants after APs run kernel idle loops and per-CPU scheduling exists.
Timer interrupt polling may continue to process bounded interrupt-safe work for the current thread’s ring in syscall mode, but it must not become a second SQ consumer for an SQPOLL-owned ring.
Full-nohz for SQPOLL is a later CPU-isolation contract, not part of initial Ring v2. A poller CPU may suppress the periodic scheduler tick only when a housekeeping CPU remains online, the SQPOLL worker is the only runnable entity on that CPU, no timer-side SQ polling or transitional network scheduler polling is pinned there, and CPU accounting is boundary/counter driven rather than tick-driven. Phase F now reports explicit housekeeping/deferred-work placement or rejection for those prerequisites while keeping syscall-mode SQ ownership, periodic ticks, and SQPOLL disabled. The broader staging is in Tickless and Realtime Scheduling.
Scheduler And SMP Requirements
Per-thread rings are not sufficient for full SMP by themselves. Multi-CPU userspace scheduling also requires:
- per-CPU current-thread state as the scheduler authority, not only a BSP mirror;
- per-CPU run queues plus a migration/work-stealing protocol;
- a current-CPU field for runnable/running threads plus an address-space active-CPU mask, or equivalent target set, for TLB shootdown;
- TLB shootdown before a thread can migrate or two threads in one address space can run on different CPUs while mappings change;
- cap-table locking or finer object locks that tolerate concurrent calls from sibling threads;
- address-space locking rules for concurrent
VirtualMemoryoperations, process exit, and user-buffer copy paths; - process and thread ring cleanup that cannot free a ring while another CPU is posting a completion to it.
The first Phase C multi-CPU scheduler smoke may keep the current process ring if the runtime still serializes process-ring consumption. A later full-SMP smoke that runs sibling threads from one process concurrently on different CPUs should wait for per-thread ring completion routing and TLB shootdown review.
Compatibility Bridge
Before Ring v2, capos-rt can support multithreaded programs on the current
process ring with a runtime reactor:
- one runtime-owned waiter drains the process CQ;
- ordinary client threads block on runtime wait records using ParkSpace;
- the reactor matches CQEs by
user_dataand unparks the waiting thread.
This is a bridge, not the final SMP ABI. It is useful for validating runtime logic and higher-level language support before kernel per-thread rings land.
Rejected Direction: Slot-Specific cap_enter
Do not extend cap_enter to wait for raw CQ slots. Slots are circular-buffer
storage and can be reused after cq_head advances. A correct specific-wait
design would need stable request ids or completion tokens, at which point
per-thread ring endpoints solve the same ownership problem with less
special-case kernel state.
Roadmap
- Runtime reactor bridge on the current process ring.
- Add the shared
RingEndpointkernel record and make the initial fixed bootstrap ring use it without changing userspace behavior. - Move ring allocation/accounting from process-only state to thread-owned ring records.
ThreadSpawner.createallocates/maps a kernel-chosen per-thread ring and passes its user address to the child.- Scheduler waiters and endpoint/timer/park/process/thread completion paths
post by target
ThreadRefto that thread’s ring. cap_enteroperates on the current thread’s ring; remove the one-process-ring waiter rule.- Add SQPOLL mode only after per-CPU scheduler state exists.
- Add SQPOLL nohz only after CPU isolation leases, housekeeping placement, non-tick CPU accounting, and network polling placement are reviewed.
- Run full-SMP sibling-thread workloads that wait independently on different CPUs only after per-thread ring routing, TLB shootdown, and cross-CPU cleanup rules are reviewed.
Proposal: Scheduler Evolution
capOS should evolve its scheduler in layers. The goal is not one clever algorithm; it is a capability-shaped CPU subsystem that scales ordinary work, admits realtime islands, allows service/runtime-specific policy, and preserves a small auditable kernel dispatch path.
This proposal complements, rather than replaces, Tickless and Realtime Scheduling. That proposal owns timer/tickless/SQPOLL-nohz details. This proposal owns the broader scheduler architecture and roadmap.
Design Grounding
Local grounding:
- Scheduling
- In-Process Threading Contract
- Design Risks Register, Q9 – CPU accounting and scheduling contexts
- SMP Phase C
- SMP
- Ring v2 For Full SMP
- Tickless and Realtime Scheduling
- Stateful Task and Job Graphs
- Future Scheduler Architecture
- NO_HZ, SQPOLL, and Realtime Scheduling
- Out-of-kernel scheduling
- Completion rings and threaded runtimes
- Multimedia pipeline latency
- Robotics realtime control
Goals
- Keep protected dispatch, budget enforcement, interrupt handling, and idle in the kernel.
- Replace the single global runnable queue with per-CPU runnable ownership and bounded cross-CPU wake/migration.
- Add CPU accounting before adopting policy that depends on runtime charge.
- Make ordinary best-effort scheduling fair by virtual time, with EEVDF-like virtual-deadline scheduling as the target after accounting exists.
- Represent admitted CPU time as
SchedulingContextcapability authority. - Represent isolated CPU ownership as
CpuIsolationLeaseauthority. - Support user-space scheduler policy services for admission and tuning without putting user-space calls on every dispatch path.
- Provide enough telemetry to distinguish scheduler cost, serial/MMIO logging, TLB/CR3 effects, QEMU/KVM artifacts, and workload contention.
Full-SMP Scalability Focus
The scheduler work after the current Phase F chain should be judged by whether capOS can keep useful throughput and bounded scheduling overhead on 16/32-core machines, not by another small QEMU-only speedup row. The SMP proposal owns CPU bring-up and APIC/TLB substrate; this proposal owns the scheduler changes needed to make that substrate useful at higher core counts.
The scheduler side of the milestone should include:
- dynamic scheduler CPU sets derived from discovered topology instead of the temporary four-owner mask;
- per-CPU run queues and current-thread state that do not require one shared lock for ordinary local pick/requeue paths;
- narrower shared metadata locks for process/thread lookup, blocking waiters, exit cleanup, direct IPC handoff, and timer/deadline waiters;
- bounded cross-CPU wakeup and migration that records target, source, steal, reschedule-IPI, and failed-placement counters;
- topology-aware placement that separates physical cores, SMT siblings, and later NUMA/cache groups;
- total-time accounting for spawn/join/exit and service-bound workloads, not only syscall-free work windows;
- hardware-run artifacts that include native Linux baselines on the same machine and QEMU rows only as regression or virtualization context.
The benchmark shape should include static map/reduce, uneven dynamic tasks, barrier-heavy phase loops, independent processes, same-process threads, and a capability-call/service-bound workload. That matrix is intentionally broader than the old thread-scale checksum row because high core counts often expose lock convoying, wakeup storms, timer/IPI cost, TLB-shootdown scaling, and runtime lifecycle overhead before pure compute saturates.
Non-Goals
- Do not import Linux CFS/EEVDF, FreeBSD ULE, or sched_ext as code.
- Do not expose arbitrary user-supplied scheduler programs in the kernel in the near term.
- Do not make a user-space process the mandatory next-thread dispatcher.
- Do not claim hard realtime until admission, budget enforcement, IRQ/device behavior, kernel-path latency, and WCET evidence exist.
- Do not make nohz/full-nohz a thread flag. It is a CPU lease plus scheduler proof.
Architecture
The target scheduler has four layers:
- Kernel mechanism: per-CPU run queues, current-thread state, idle, context switch, cross-CPU wake/migration, timer/IPI handling, CPU accounting, budget enforcement, and timeout/depletion faults.
- Kernel policy primitives: best-effort weights, virtual deadlines, scheduling contexts, CPU masks, isolation leases, direct IPC donation, and realtime-island hooks.
- Privileged scheduler policy service: admission, budget/profile selection, CPU partitioning, isolation grants, service/runtime hints, policy reload, and operator diagnostics.
- Application/runtime schedulers: work stealing, actors, async reactors, language M:N schedulers, request queues, and service-local priority and batching.
The hot path remains local and bounded: timer interrupt or wakeup, charge runtime, update runnable state, pick from a per-CPU queue or a bounded steal path, switch context. User-space policy participates at slower boundaries: profile changes, thread/process creation, budget depletion, realtime admission, lease grant/revoke, or explicit operator policy updates.
Stateful task/job graph coordinators sit above these layers. They may own
graph node queues, leases, retry state, cancellation, and assignment metadata,
but they do not own CPU dispatch. A graph node’s priority, deadline,
budget, or queue field is workload policy until a capability-authorized
scheduler policy service maps it to a weight, scheduling context, CPU lease,
or request deadline.
Stage 0: Evidence Before Policy
Before changing the default policy, the active thread-scale attribution work must keep policy conclusions separated from benchmark artifacts. Current mainline evidence now includes:
- scheduler candidate/outcome, reschedule-IPI, serial-byte, scheduler-lock,
timer interrupt, and CR3/TLB counters behind
CAPOS_THREAD_SCALE_GUEST_MEASURE=1; - raw guest-PC samples for user-mode timer preemption points;
- logging-suppression A/B evidence through
CAPOS_THREAD_SCALE_SUPPRESS_SWITCH_LOGS=1; - exact native Linux pthread baseline evidence, including compact-versus-padded result-slot diagnostics;
- larger-workload/Amdahl evidence through
CAPOS_THREAD_SCALE_TOTAL_BLOCKSandLINUX_THREAD_SCALE_TOTAL_BLOCKS.
This evidence does not prove the primary remaining cause of non-scaling. Per-CPU runnable ownership, accepted work/total speedup thresholds, and optional symbolic guest attribution remain follow-on work before a scheduler policy claim.
This protects the design from treating QEMU/KVM, serial MMIO, or benchmark cache contention as a scheduler algorithm problem.
Stage 1: Per-CPU Runnable Ownership
Split the scheduler’s runnable state first. The accepted initial shape has
per-CPU run queues with a runnable ThreadRef deque or priority buckets,
current-thread state, a local reschedule flag, and local counters. Shared
scheduler state keeps process/thread metadata, sleeping/deadline waiters,
blocked waiters, migration records, and the global policy epoch.
Rules:
- A runnable
ThreadRefis owned by exactly one CPU queue at a time. - Cross-CPU wake enqueues to the target CPU or a policy-selected CPU and sends a bounded reschedule IPI when needed.
- Migration removes from one owner before publishing to another.
- Idle CPUs steal only through bounded policy, not by scanning every process.
- Process exit and thread exit keep cleanup bounded and must not allocate in interrupt, cancellation, or emergency paths.
This stage may still use round-robin within each CPU queue. The objective is SMP structure and evidence, not perfect fairness.
First implementation evidence exists as commit 1a8bf909: capOS introduced
four bounded per-scheduler-CPU FIFO runnable queues under the existing
global scheduler lock. That slice proved the basic ownership structure and
bounded steal path. Follow-up review fixes reserved per-CPU queue capacity
before a thread became runnable, using a live reservation count released on
process/thread exit or pre-publication rollback, so timer and unblock
requeues did not allocate after work moved between CPUs. Update 2026-05-02:
the per-CPU queues were collapsed back into a single global runnable queue
under the same scheduler lock with the per-CPU run-queue-collapse cleanup
slice (see docs/backlog/scheduler-evolution.md and
docs/architecture/scheduling.md). Update 2026-05-07 23:45 UTC: Phase D
Task 3 reintroduced the per-CPU runnable queues, this time ordered
ascending by virtual_finish_ns (Weighted Fair Queueing) and balanced by
a bounded steal path that picks the most-overdue sibling Runnable
candidate (each sibling queue’s first entry the destination CPU
considers Runnable; ties broken by lower CPU id). The queue ownership
and migration contract is documented in the scheduling architecture
page. This does not close the stage: the scheduler still
needs stronger cross-CPU wake counters, further separation from shared
process/thread metadata, replacement of temporary pinning policy, and
accepted benchmark evidence before policy conclusions should change.
Stage 2: CPU Accounting
Add a monotonic runtime charge model. ThreadCpuAccount records runtime,
last-start time, virtual runtime, context switches, preemptions, and voluntary
blocks. SchedEntity records weight, latency class, eligible time, and virtual
deadline.
Accounting must be stable enough to support fair scheduling, quotas, and future scheduling contexts. It must account context switches, blocking syscalls, endpoint direct handoff, timer preemption, thread exit, and idle.
Where exact cycle attribution is not yet credible, the implementation should label the metric as diagnostic rather than enforcing policy from it.
Stage 3: Best-Effort Fair Policy
Stage 3’s first implementation slice has landed. Phase D passed its Task 6
evidence gate at commit 77caafc0 (2026-05-10 19:39 UTC,
docs(scheduler): record phase d thread-scale gate) and closed in docs commit
1a08ec23 (2026-05-10 21:47 UTC, docs(scheduler): close phase d) with
weighted fair queueing (WFQ) as the accepted best-effort policy. The
controlled Task 6 benchmark pair recorded capOS 1-to-4 work/total
speedups 3.088x / 2.700x at 4 workers, materially closing the
prior single-global-queue 1.566x / 1.538x diagnostic gap while
the matching Linux pthread baseline on the same host and physical-core
logical CPUs 0,1,2,3 recorded 3.974x / 3.850x. The completed
execution plan is archived at
docs/backlog/scheduler-evolution.md.
After Phase D, capOS should continue ordinary best-effort scheduling from WFQ toward virtual-time fairness with stronger eligibility semantics only when that follow-on is explicitly selected.
The long-term target policy is EEVDF-like:
- runnable entities accrue lag against their fair share;
- eligible entities are ordered by virtual deadline;
- weights affect virtual runtime/deadline progression;
- latency-sensitive best-effort entities can request smaller slices within policy limits;
- migration preserves accounting so moving CPUs does not reset fairness.
The first implementation slice was intentionally narrower than EEVDF: weighted fair queueing on top of the existing per-thread runtime/vruntime accounting. That decision and its accepted evidence are recorded in the next subsection.
Phase D first-policy decision (2026-05-05 19:00 UTC)
Decision: weighted fair queueing (WFQ) for the first Phase D slice; EEVDF
remains the deferred follow-on. Recorded against main commit
60e421ab and the 2026-05-02 21:38 UTC thread-scale evidence pair
against main commit 374f8556 (capOS work 1.566x versus Linux
3.963x at 1-to-4 on the same physical-core pin set).
Rationale (concise):
- The 1-to-4 gap is dominated by single-global-queue scheduler-lock contention plus exit/join/block/schedule overhead, not by ordering. Any fair-share policy that successfully consumes a per-CPU split should close most of the gap. The simpler policy reaches that signal sooner with less risk.
- The existing
ThreadCpuAccountingrecord separates the load-bearing ledger from benchmark diagnostics:runtime_ns,virtual_runtime_ns, andlast_started_nsare unconditional, whilecontext_switches,preemptions,voluntary_blocks,migrations, placement history, and blocked/exited stability probes stay behindcfg(feature = "measure"). WFQ needs only a per-thread weight and a virtual finish time derived from the unconditional vruntime; that mapping is direct. EEVDF additionally needs a per-thread request size, lag, eligibility deadline, and an ordered eligible-set structure (BTreeMapby virtual deadline). The runtime/vruntime accounting fields exist, but the eligibility/lag fields do not. - The target environment is
no_stdplusspin::Mutexplus a single global scheduler lock. WFQ keeps the eligibility structure as a bucketed per-CPU FIFO ordered approximately by virtual finish time; that is a familiarVecDeque-shaped data structure that mirrors the currentrun_queue: VecDeque<ThreadRef>ownership. EEVDF requires an ordered set inside the scheduler-lock-protected dispatch state, which is a larger structural change than the slice the gap evidence motivates. - Latency-class differentiation (interactive / batch / IPC server) is expressible in WFQ; Phase D pins the mapping below in the capability-surface section so the implementation slice and the short-sleeper smoke have one rule. The Phase H policy service can layer richer policy on top without requiring a tree representation underneath.
- Linux moved from CFS to EEVDF in mainline 6.6 (released 2023-10); WFQ has decades of stable OS lineage. Either choice is defensible. The weighted-fair slice does not lock capOS into WFQ permanently — the same accounting fields, capability surface, and migration contract carry directly into EEVDF when the eligibility structure is added.
Rejected alternative: EEVDF-first. It is the stronger long-term
policy and Linux’s current default. We are not picking it for the first
slice because (1) the eligibility-set data structure is a larger
diff that mixes structural change with the per-CPU enqueue
reintroduction the 1-to-4 gap evidence already motivates; (2) the lag
accounting and request-size ABI are not load-bearing for closing the
single-global-queue contention bottleneck the recorded benchmark
exposes; (3) moving from WFQ to EEVDF is a localized policy-module
change once the capability surface, migration contract, and per-CPU
queue split are accepted. The deferred EEVDF follow-on is tracked as
a later policy-evaluation slice; it is not a Phase D blocker and does
not displace Phase E SchedulingContext, which is the next scheduler
authority phase after the accepted WFQ gate.
First-slice scope (smallest implementable surface that closes the 1-to-4 gap):
- per-thread
weight: u16andlatency_class: LatencyClassfields, default values matching the current single-class FIFO behavior; the cap-boundary path rejectsweight = 0and any nonzero value outside[MIN_WEIGHT, MAX_WEIGHT](Phase D constants) withCapException::InvalidArgumentrather than silently clamping, so no later divide-by-zero or overflow path can be reached throughsetWeightand so callers see policy denial instead of a hidden mutation. TheinvalidArgumentvariant landed inExceptionTypealongsideSchedulingPolicyCapandLatencyClasswith Phase D Task 1 (commit cb8c58b1, 2026-05-07); seedocs/proposals/error-handling-proposal.mdfor the updated client-response taxonomy. The full validation rule lives in the cap-surface authority section below; this bullet records only that the validation runs at the cap boundary, not the dispatch path; - per-thread weighted vruntime charging at runtime-charge points: the
existing
ThreadCpuAccounting.virtual_runtime_nsadvances byelapsed_ns * REFERENCE_WEIGHT / weight(instead of the current 1:1 elapsed) on every charge_runtime call.runtime_nscontinues to advance 1:1 with elapsed time so monotonic CPU accounting, measure-mode reporting, and snapshot APIs are unchanged. The weighted-vruntime change is the actual fairness mechanism; without it, weights affect only enqueue-order ties rather than cumulative share. This matches the CFS-lineage approach and keeps the WFQ derivationvirtual_finish = vruntime + slice * REFERENCE_WEIGHT / weightpurely as an ordering aid for the local bucket; - per-thread
virtual_finish_ns: u64recomputed at each enqueue fromvirtual_runtime_ns + slice_ns * REFERENCE_WEIGHT / weight. It is not stored across blocking and is never carried as committed state; it is the per-enqueue ordering tag only; - per-CPU bounded
run_queues: [VecDeque<ThreadRef>; SCHEDULER_CPUS](reintroduced) each ordered ascending byvirtual_finish_ns; local selection scans the queue by index for the first destination-Runnable entry (RetryLater entries left in place; the first Runnable hit is also the lowestvirtual_finish_nscandidate the destination can accept because the queue is ordered), then falls back to a bounded steal scan of sibling per-CPU queues; - scheduler-lock-contained migration that keeps
virtual_runtime_nswith the thread (per-thread state, not per-CPU) and re-inserts on the destination CPU at the post-migration virtual finish time; - a capability-authorized policy path (see §“Phase D capability surface” below) that gates weight/latency-class mutation and reads;
- one-bisect-cycle single-global-queue fallback under
CAPOS_SCHED_DISABLE_WFQ=1, now retired by Phase E preflight beforeSchedulingContextschema work.
The first slice is accepted: the 2026-05-10 19:46 UTC
make run-thread-scale evidence pair recorded in docs/changelog.md
and docs/benchmarks.md passed the harness-enforced 1-to-2 work/total gates,
and Phase D manually accepted the recorded 1-to-4 work/total diagnostics for
closeout. The historical success threshold lives in
docs/backlog/scheduler-evolution.md.
Phase D capability surface (kernel-side authority, no ambient process fields)
Per docs/capability-model.md “the interface IS the permission”, weight
and latency-class authority is granted by giving a process a
SchedulingPolicyCap with the appropriately scoped target. The kernel
rejects any state mutation that does not arrive through such a cap.
Schema (landed with Phase D Task 1, commit cb8c58b1, 2026-05-07; the
original sketch took a target :ThreadHandle per method, but the
methods carry no target argument because Phase D associates the
target through cap state, not a per-method handle parameter.
Phase D Task 2 (closeout 2026-05-07 22:51 UTC) selected the
context-derived caller-thread fallback binding from the three
sketched options. Every method routes to the calling thread,
looked up through CapCallContext::caller_thread. The kernel
cap object remains zero-sized (SchedulingPolicyCap); routing
moved from call to call_with_context so the dispatch path
sees the caller’s ThreadRef. There is no per-cap-object
ThreadHandle, no badge-encoded thread id, and no cross-thread
or cross-process mutation in this slice; per-cap-object target
references and badge-encoded thread ids are reserved for the
Phase H privileged scheduler policy service that will need
cross-thread authority. Today the manifest grant path therefore
authorizes the holder’s own threads in the strict sense – a
holder cannot reach another thread’s weight or latency_class
through this cap):
enum LatencyClass {
interactive @0;
normal @1;
batch @2;
ipcServer @3;
}
interface SchedulingPolicyCap {
setWeight @0 (weight :UInt16) -> ();
setLatencyClass @1 (class :LatencyClass) -> ();
snapshot @2 ()
-> (weight :UInt16, class :LatencyClass,
runtimeNs :UInt64, virtualRuntimeNs :UInt64);
}
The snapshot return is intentionally narrow: the four fields it
exposes (weight, class, runtimeNs, virtualRuntimeNs) are
the ones the WFQ slice promotes out of cfg(feature = "measure")
unconditionally. The benchmark-only counters
(context_switches, preemptions, voluntary_blocks,
migrations) stay behind the measure feature because they are
not load-bearing for ordering and remain useful only for
benchmark instrumentation; a future operator-observability slice
can add them to a separate snapshot cap once a non-emergency-path
storage and reporting surface exists.
Authority rules:
setWeightandsetLatencyClassare kernel-checked: an SQE invocation must carry a liveSchedulingPolicyCap. The methods carry no per-callThreadHandle; the target binding (selected in Phase D Task 2) is the context-derived caller-thread fallback: the kernel routes throughCapCallContext::caller_thread, so a holder can only mutate its own running thread by construction. If a future cross- process grant lets a holder invoke the cap without authority over its bound target, the call fails closed through the standard cap-revocation transport-error path (thedisconnected-classCapExceptionproduced by the ring dispatcher when the cap is revoked or stale); theExceptionTypetaxonomy has noDeniedvariant by design.setWeightvalidates the input at the cap boundary, not at the dispatch path. The validation rule is:weight = 0(which would make the WFQ derivationslice_ns * REFERENCE_WEIGHT / weightdivide by zero) is rejected withCapException::InvalidArgument; any nonzero value outside[MIN_WEIGHT, MAX_WEIGHT](Phase D constants) is also rejected withCapException::InvalidArgument. The kernel does not silently clamp out-of-range values, because a silent clamp masks caller bugs and hides cap-boundary policy from the audit surface. TheinvalidArgumentvariant landed inExceptionTypewith Phase D Task 1 (commit cb8c58b1, 2026-05-07); the updated client-response taxonomy is indocs/proposals/error-handling-proposal.md.- The bootstrap
SchedulingPolicyCapis granted by manifest only. Its initial domain isSelf(the holder’s own threads). Wider authority (cross-process weight/class mutation) belongs to the Phase H privileged scheduler policy service; Phase D does not promise that grant in the default boot manifest. Phase D manifests grant only the focused-proof scope needed for the test-matrix smokes. - Default policy: a thread without any explicit cap-driven mutation
carries
weight = DEFAULT_WEIGHTandlatency_class = LatencyClass::Normal. Behavior with all defaults must preserve the pre-Phase-D default workload behavior at the limit (no fairness regressions for unmodified workloads). - Stale-cap revoke:
SchedulingPolicyCapmutations carry the generation/epoch model used elsewhere. A weight change submitted after the cap is revoked fails closed; partially applied changes on a thread that exits between SQE arrival and dispatch fail with the standardStaleoutcome and do not leak weight state. - The cap surface is a single typed interface; restriction is by
granting a narrower wrapper (e.g.,
SchedulingPolicyCapwhose authority domain is exactly oneThreadHandle). The kernel does not carry a parallel rights bitmask.
Latency-class semantics for Phase D (pinned mapping):
LatencyClass::Normalis the baseline;weightalone determines the WFQ share. The selectedslice_nsis the Phase D default quantum.LatencyClass::Interactivereduces the per-enqueue slice contribution by a Phase D constant (INTERACTIVE_SLICE_DIVISOR; Phase D Task 2 ships2): the WFQ derivation becomesvruntime + (slice_ns / INTERACTIVE_SLICE_DIVISOR) * REFERENCE_WEIGHT / weight. This places the entity earlier in the per-CPU queue on each enqueue, so a short-sleeper that wakes on a Timer completion runs ahead of a same-weight CPU hog within the same scheduling window. The cumulative share is unchanged because vruntime accounting still advances atelapsed_ns * REFERENCE_WEIGHT / weight; the class only affects the per-enqueue tag, not the runtime-charge step.LatencyClass::Batchincreases the per-enqueue slice contribution by a Phase D constant (BATCH_SLICE_MULTIPLIER; Phase D Task 2 ships4): the derivation becomesvruntime + (slice_ns * BATCH_SLICE_MULTIPLIER) * REFERENCE_WEIGHT / weight. This places the entity later in the per-CPU queue on each enqueue, so a CPU hog atLatencyClass::Batchyields wake-to- run latency toLatencyClass::NormalandLatencyClass::Interactivesiblings without losing its weighted share over a long window.LatencyClass::IpcServeris treated identically toLatencyClass::Normalfor the WFQ ordering tag in this slice. The class exists in the ABI so a Phase H policy service can later re-bind direct-IPC preference, server affinity, or scheduling-context donation rules without an ABI break; Phase D does not change the existing direct-IPC preference slot semantics for this class.- The class is stored on
Threadand read at every enqueue. A class change throughsetLatencyClassis observed on the next enqueue (next dequeue + re-enqueue, or next wake from blocked). No retroactive recomputation of an in-queue tag.
Phase D does not build the userspace policy service (Phase H). It
adds the kernel-side primitive that Phase H will consume.
SchedulingContext (Phase E) is a separate authority for
budget/period/CPU mask; weight/latency-class is the WFQ ordering knob,
not CPU-time authority. The two cap surfaces stay disjoint.
Phase D migration fairness sketch
A thread migrating from CPU A to CPU B mid-quantum must preserve its share. Rules:
virtual_runtime_nsis per-thread, not per-CPU. It travels with the thread on every migration. The accounting record already encodes that (ThreadCpuAccounting.virtual_runtime_nslives onThread, not on a CPU slot). Phase D promotes that field out ofcfg(feature = "measure")and changes thecharge_runtimestep so the field advances byelapsed_ns * REFERENCE_WEIGHT / weightrather than 1:1 with elapsed time; the migration contract is otherwise unchanged.- Per-CPU local clocks are not used as a vruntime reference. The
scheduler reads the global monotonic clocksource through
crate::arch::context::monotonic_ns(), the same source the unconditional runtime/vruntime ledger uses. There is no per-CPU clock offset because there is no per-CPU vruntime reference. virtual_finish_nsis recomputed at enqueue on the destination CPU from the destination weight, not carried as committed state. The migration step is remove-from-source, recompute, insert-at-destination; the scheduler lock is held for the whole window.- Cross-CPU steal: a CPU whose local queue has no runnable entry
walks sibling per-CPU queues. For each sibling queue the scan
walks indices ascending and stops at that queue’s first entry
the destination CPU considers
Runnable; because each queue is ordered ascending byvirtual_finish_ns, the first Runnable hit per queue is the lowestvirtual_finish_nscandidate the destination can accept on that source. The steal target is then the source queue whose first-Runnable candidate has the lowestvirtual_finish_nsglobally — the same fair-share rule the local pick uses (most overdue first) — with ties broken by lower CPU id. The chosen entry is removed from its actual position on the source queue (not necessarily the head: a RetryLater or single-CPU-owner thread may sit at the front and stay there); the destination recomputesvirtual_finish_nsand inserts at the destination ordered position. The steal is allocation-free because both queues are pre-reserved against the live runnable count. - The
ThreadCpuAccounting.migrationscounter is incremented on each cross-CPU enqueue, both for placement-time spread and for steal. The behavior mirrors the prior pre-collapse counter; the Phase D slice keeps it undercfg(feature = "measure")until a permanent operator snapshot path lands.
The one-bisect-cycle single-global-queue fallback has been retired before Phase E. The accepted Phase D behavior is now always the per-CPU WFQ queue shape described above.
Phase D test matrix
Workload shapes the implementation slice verified before close:
- CPU hogs (existing
make run-thread-scale). Equal-weight same-process threads must split CPU share within bench tolerance. Different-weight threads must split CPU share approximately in proportion to weights (e.g., weights2:1→ roughly2:1runtime ratio). Phase D manually accepted the recorded 1-to-4 diagnostic at3.088xwork speedup versus the recorded1.566xbaseline. - Short sleepers. Threads that block on
Timer.sleepfor short intervals must preempt CPU hogs within one quantum’s worth of bound after wake. Latency-classInteractiveshould have lower observed wake-to-run latency than latency-classBatch. Phase D closed this with focusedmake run-thread-fairnessandmake run-thread-fairness-interactiveQEMU smokes. - Direct IPC server/client pairs (existing
make run-spawn). An IPC server thread woken by an endpoint CALL must keep paired-call timing comparable to the current direct-IPC handoff. The direct-IPC preference slot must keep its existing generation-checked semantics under WFQ; a server should not starve when the global vruntime advances on other CPUs. - Multi-process load (existing
make run-smp-process-scale). Independent worker processes with default weights must preserve the recorded2026-04-301.6x1-to-2 gate. WFQ across processes (no shared address space) must not regress that proof. - Same-process sibling load. This is the same workload shape
as
make run-thread-scale; it doubles as the per-CPU-queue reintroduction proof.
The exact historical per-workload acceptance numbers live in
docs/backlog/scheduler-evolution.md.
Phase D overload behavior
Soft overload (runnable entities × weight exceeds the selected CPU set’s capacity):
- Each entity gets less than its weighted share. No entity is starved; vruntime ordering guarantees that the most-behind thread runs next.
- The scheduler does not refuse to enqueue. Phase D’s WFQ does
not implement strict admission; that belongs to Phase E
(
SchedulingContextbudget/period) and Phase G (RealtimeIslandadmission).
Hard overload (e.g., a RealtimeIsland admission attempt that
collides with an active CpuIsolationLease):
- Use the existing isolation/admission path; Phase D defers to
Phase F’s
CpuIsolationLeaseand Phase G’sRealtimeIslandfor that behavior. WFQ continues to schedule best-effort work on the housekeeping CPU set. - If an isolation lease holds CPU N and N has runnable best-effort work that cannot migrate (e.g., bound by manifest pinning), the lease attempt fails closed; existing CPU-mask validation remains the gate. Phase D does not introduce new pinning policy.
Strict admission, deadline overrun, and budget depletion are explicitly out of scope for Phase D and stay in Phase E/G.
Stage 4: Scheduling Contexts
CPU-time authority becomes a capability. SchedulingContext records budget,
period, relative deadline, priority or criticality, CPU mask, remaining
budget, replenishment state, timeout endpoint, and overrun policy.
The landed Phase E slices remain narrower than the full target above. The ABI
now has SchedulingContextSpec authority inputs for budgetNs, periodNs,
relativeDeadlineNs, byte-oriented cpuMask, and overrunPolicy, plus a
read-only SchedulingContextInfo snapshot with context identity, lifecycle
state, binding state, remaining budget, and an explicit dispatch-effect label.
SchedulingContext.info() remains method id 0. SchedulingContext.create()
creates a same-interface result cap for a validated spec,
bindCallerThread() records one caller-thread binding for the current
generation, and revoke() advances the generation and clears the matching
thread metadata binding. Bootstrap-granted contexts and contexts returned by
create() draw from the same non-wrapping context-id allocator, so the
(contextId, generation) binding key does not alias distinct cap objects.
Bound active contexts now install a fixed per-thread dispatcher budget ledger:
runtime charge decrements remainingBudgetNs, runnable selection replenishes
elapsed periods, and exhausted contexts remain queued but ineligible until the
next replenishment period. The effect label is budgetEnforced for active
contexts and stays infoOnlyNoDispatchChange for stale/revoked fail-closed
paths. Deadline-driven accounting now arms a sub-tick budget-exhaustion
one-shot when the selected thread’s remaining budget would deplete before the
next periodic scheduler tick, and nohz re-arm folds the leased thread’s budget
deadline into its existing nearest-deadline timer. Kernel-mode budget one-shot
fires restore a live periodic timer before returning to kernel code, so the
ordinary and tick-masked paths no longer rely on a full tick quantum to observe
budget depletion.
Synchronous endpoint donation/return now covers passive receiver threads:
endpoint in-flight state carries an internal donation token, receiver runtime
charges to the caller-donated context, RETURN, application-exception RETURN,
or invalid-result RETURN restores the reduced budget to the caller before
caller wake, a donor with an in-flight token is blocked from returning to
userspace until RETURN/cancel using an atomic marker-to-block transition that
treats already-returned fast paths as normal completion, and nested donation of
an already donated context is rejected until stacked return tokens have a
dedicated design.
Timeout/depletion notifications now use fixed per-context cells allocated at
context creation/bootstrap. The cells coalesce budget-depleted and
deadline-or-timeout events with typed sequence/count metadata, holder identity,
remaining budget, next timestamp, donated-holder marking, explicit-revoke
lifecycle state, and ok/revoked/staleGeneration observer results through
SchedulingContext.drainNotifications(). Notification publishing does not
allocate in scheduler hard paths, publish result caps, append unbounded queues,
donate budget, reorder runnable entities, bypass throttling, or imply nohz
behavior. A pre-armed observer waiter/wakeup path, realtime admission, SQPOLL,
nohz, and CPU placement enforcement remain future work. Stale caps report
staleGeneration and cannot mutate the new generation’s scheduler metadata or
budget ledger; revoked contexts report revoked. Ordinary non-donated
session logout now uses the same stale-generation rule: after
UserSession.logout() flips the liveness cell, the scheduler removes matching
non-donated bound thread contexts and marks the old cap generation stale. The
focused session-context proof covers stale info, bindCallerThread,
create, revoke, and notification-drain behavior without result-cap
publication or metadata mutation. Donated receiver logout keeps the
conservative skip policy: if logout observes a receiver thread holding an
endpoint-donated context, the hook counts the skipped donated binding and
leaves the donor blocked until endpoint RETURN/cancel commits cleanup. The
focused session-context proof covers the RETURN case by showing the receiver
logs out while holding the donation, the donor stays blocked, the hook reports
donation_inflight_skipped=1, and the caller observes a bound context with
reduced remaining budget after RETURN rather than fresh budget. Clean local
owner-shell exit now calls the held UserSession.logout() before process exit,
and the shell smoke observes the same scheduler hook with no bound local shell
SchedulingContext.
cpuMask is a canonical little-endian bitset. CPU n maps to bit n % 8 of
byte n / 8, with bit 0 as the least-significant bit of each byte. Empty data
means no CPUs are selected, not “all CPUs”; future admission/bind validation
rejects empty masks for runnable contexts. Producers omit trailing zero bytes:
the all-zero set is encoded as empty, and any non-empty canonical mask has a
nonzero final byte. This slice only snapshots that shape and does not enforce
placement from it.
Remaining kernel responsibilities:
- prevent a thread without eligible CPU authority from running;
- charge runtime to exactly one authority target;
- add any pre-armed timeout/depletion observer wake path without allocating in emergency paths.
Policy-service responsibilities:
- admit or reject scheduling contexts;
- choose budget/period/priority;
- bind contexts to threads/services;
- revoke or adjust contexts safely;
- record operator-visible decisions.
SQE.deadline_ns remains request metadata. It may influence drop, freshness,
propagation, and telemetry, but it does not grant CPU budget.
Stage 5: CPU Isolation Leases and SQPOLL
CpuIsolationLease grants placement and exclusivity, not CPU time. It records
the owner process/session/service, CPU set, mode, housekeeping exclusions,
accounting target, maximum revocation latency, and revoke endpoint.
The current Phase F implementation keeps ticks periodic but makes
housekeeping/deferred-work placement explicit: at least one online scheduler
housekeeping CPU must remain outside active lease candidates, and preflight
telemetry routes or rejects deferred cleanup, timer/deadline, network polling,
IRQ affinity, scheduler accounting, and cleanup latency before later SQPOLL or
nohz behavior can use the lease.
The Phase F substrate landed so far is:
- the one-SQ-consumer ring-ownership prerequisite that lets nohz/SQPOLL reason about a single submission consumer per ring;
- nohz activation telemetry that labels admit/reject decisions, rollback reasons, and current periodic-tick fallback state without changing dispatch behavior;
- housekeeping/deferred-work placement preflight, which fail-closes when unrelated timers, deferred cleanup, network polling, debug/watchdog work, or IRQ delivery would otherwise be pinned to a candidate isolated CPU;
- a bounded SQPOLL ring-mode worker (
MAX_SQPOLL_WORKERS = 16) that recordstick_suppression=disabled/full_nohz=disabledstrings while the activation proof is still open, with generation-checked stale-owner rollback; - a clockevent/deadline substrate independent of the periodic tick, so the scheduler can express “wake at deadline T” without depending on periodic ticks to enforce budget;
- a bounded non-periodic SQPOLL producer-wake progress path that lets a parked SQPOLL worker make forward progress on producer activity without reverting to a periodic tick.
Automatic nohz activation – actually suppressing the periodic scheduler
tick on an admitted CPU and restoring it on rollback/revoke/stale
generation – was closed for the first increment via
docs/tasks/done/2026/scheduler-phase-f-auto-nohz-activation.md:
the CpuIsolationLease preflight now performs
real per-CPU periodic-tick suppression for the narrow single-runnable-entity
window, satisfying proof obligations for single runnable entity on the
target CPU, ready housekeeping CPU outside the lease, non-local
deferred-cleanup/timer/network/IRQ dependencies, valid accounting target,
bounded revocation latency, and generation-checked ring ownership, with
fail-closed rollback. SQPOLL-driven auto-nohz activation is also closed via
docs/tasks/done/2026/scheduler-phase-f-auto-nohz-sqpoll.md:
a ring-coupled kernelSqpoll lease whose bound ring is in SQPOLL
running/sleeping mode with a live owner is admitted for tick suppression,
with the SQPOLL ring-state re-check as the decisive rollback gate. The
tick_suppression, auto_nohz, and sqpoll telemetry counters reflect
real suppression. Generic full-nohz for ordinary budgeted compute threads is
now admitted by explicit SchedulingContext-targeted CpuIsolationLease
preflight; production realtime island admission remains deferred independently
of these closed tasks.
Activation requires scheduler proof:
- at least one housekeeping CPU remains online;
- unrelated timers, deferred cleanup, network polling, and debug/watchdog work are not pinned to the isolated CPU;
- the active ring has exactly one SQ consumer;
- the accounting target is valid and chargeable;
- revocation latency fits the lease policy.
The scheduler idle path is now a per-CPU CPL0 (kernel-mode) idle thread;
the user-mode idle process was removed in commit e3c0df01 (2026-05-14 UTC).
There are two CPL0 idle paths: the cooperative boot/AP path that hlts at
CPL0 on the per-CPU kernel stack, and the steady-state idle-thread path
reached from the four dispatch sites (schedule, capos_block_current_syscall,
exit_current, exit_current_thread). Both are described in detail in
Scheduling.
SQPOLL uses the ring-mode contract in Tickless and Realtime Scheduling. The scheduler proposal adds the CPU-ownership and policy-service side of that contract.
Stage 6: Realtime Islands
A RealtimeIsland is an admitted graph, not a single priority. It records
scheduling contexts, memory reservations, device and IRQ reservations,
rings/endpoints/notifications, any CPU isolation leases, admission evidence,
and overrun/shutdown policy.
Use cases include local audio, realtime voice, robotics control, and selected provider/runtime loops. Admission must fail closed if the graph cannot fit the declared period/quantum and reservations.
Stage 7: User-Space Scheduler Policy
After kernel primitives are in place, a privileged scheduler policy service can own:
- default resource profiles;
- session/account/service CPU policy;
- scheduling-context admission;
- CPU lease grant/revoke;
- runtime hints such as latency-sensitive, batch, driver, poller, or agent;
- AutoNoHz placement for ordinary threads that appear capable of utilizing a full CPU core (see Policy-Service Userstories in tickless-realtime-scheduling-proposal);
- operator-facing diagnostics and policy reload.
AutoNoHz placement is the policy-service surface that turns the “thread
appears capable of utilizing a full CPU core” observation into a bounded
CpuIsolationLease against a pre-authorized account or session CPU pool. The
lease adds isolation; it does not mint CPU-time authority. The thread still
consumes time through its existing SchedulingContext (or coarse
ResourceLedger); the lease just removes tick and scheduler noise while that
budget is being consumed. Bounds the policy service must enforce on every
auto-issued lease – lifetime, revocation latency, accounting target,
auto-claim pool capacity, and fairness preemption – are detailed in the
tickless proposal.
The kernel still owns emergency fallback. If the policy service is dead, blocked, stale, or malicious, the kernel must continue to enforce safety, revoke leases as policy permits, and schedule a minimal recovery path.
Validation Gates
- Per-CPU queue work must preserve
run-smoke,run-spawn,run-thread-scale, park/ring/process-exit smokes, and SMP smokes. - A thread-scale milestone closeout must include repeated controlled
capos-benchevidence and raw logs. - CPU accounting must include sanity tests that measured runtime increases monotonically while a thread runs and stops while it is blocked.
- Fair policy changes must include adversarial tests: CPU hogs, short sleepers, direct IPC handoff, multi-process load, and same-process sibling load.
- Scheduling-context work must include admission rejection, budget depletion, replenishment, endpoint donation/return, timeout notification, stale cap revocation tests, and any future pre-armed notification waiter coverage.
- CPU leases must include revocation, process exit, session close, and housekeeping fallback tests.
- Realtime island proofs must show preallocation, no allocation/blocking on admitted paths, deadline miss telemetry, and fail-closed overrun behavior.
Open Decisions
Whether the first best-effort fair policy should be weighted fair queueing or direct EEVDF.Resolved 2026-05-05 19:00 UTC: WFQ first; EEVDF deferred follow-on. See “Phase D first-policy decision” above.- Whether scheduling-context priority is a scalar, a criticality band, or both.
- Whether
SchedulingContextshould be bindable to a process default, individual thread, endpoint call path, or all three in the first ABI. - Which scheduler telemetry is permanent ABI and which is benchmark-only.
- How much policy-service state belongs in the boot manifest versus mutable operator configuration.
- Whether the WFQ slice’s bucketed
VecDequeper-CPU queue is the long-term representation or a stepping stone to an EEVDFBTreeMap-based eligibility set. EEVDF is an evaluated follow-on policy, not a committed migration; re-evaluate only when the explicit Phase D follow-on EEVDF migration backlog item is selected. Phase F’s one-SQ-consumer prerequisite, nohz telemetry, housekeeping/deferred-work placement, bounded SQPOLL ring mode, clockevent/deadline substrate, and bounded non-periodic SQPOLL producer-wake progress have landed on top of the closed Phase ESchedulingContextgate; the first automatic nohz activation increment is also closed viadocs/tasks/done/2026/scheduler-phase-f-auto-nohz-activation.mdand SQPOLL-driven auto-nohz activation is closed viadocs/tasks/done/2026/scheduler-phase-f-auto-nohz-sqpoll.md; timeout-based auto-revoke and ordinary-thread generic full-nohz admission are also landed. The policy-service AutoNoHz capstone and generic SQPOLL nohz for arbitrary rings remain open. Phase F.5 (full-SMP 16/32-core scalability) is still planning.
Proposal: Tickless and Realtime Scheduling
This proposal captures the scheduling design from the 2026-04-29 discussion and the subsequent implementation status: tickless idle is useful, full-nohz belongs behind explicit CPU isolation authority, and realtime requires scheduling contexts rather than only per-request deadlines.
Design Grounding
The directly relevant grounding is:
- NO_HZ, SQPOLL, and Realtime Scheduling
- Out-of-kernel scheduling
- Completion rings and threaded runtimes
- Multimedia pipeline latency
- Robotics realtime control
- x2APIC and APIC virtualization
- Scheduling
- Ring v2 For Full SMP
- SMP
- Realtime Voice Agent Shell
External grounding is recorded in the research note so reviewers can audit the prior-art claims without treating this proposal as the source of truth.
Goals
- Add tickless idle: when a CPU has no runnable work, stop the periodic scheduler tick and program the local timer for the earliest known deadline.
- Split monotonic timekeeping from timer interrupt delivery.
- Convert scheduler timeout waiters to absolute monotonic deadlines.
- Stage full-nohz as an explicit CPU isolation/lease mode for SQPOLL and realtime executors, not as a generic scheduler default.
- Define
SQE.deadline_nsas request freshness metadata. - Define
SchedulingContextas CPU-time authority. - Define
RealtimeIslandas the admission object for media, robotics, provider, and other bounded realtime graphs.
Non-Goals
- No ambient Linux-style
NO_HZ_FULLfor arbitrary unbudgeted user threads. Ordinary-thread full-nohz requires an explicit budgetedSchedulingContexttarget and aCpuIsolationLease. - No SQPOLL on the current process-wide ring.
- No second SQ consumer through timer-side polling for SQPOLL rings.
- No TSC-deadline or x2APIC requirement for the first tickless-idle milestone.
- No hard realtime claim before kernel-path, IRQ, device, locking, and WCET evidence exists.
- No full realtime policy blob inside every SQE.
CPU Authority Taxonomy
These terms must not drift into overlapping authority systems:
ResourceProfile:
policy template selected by identity, session, account, or service profile;
it is not spendable authority by itself.
ResourceLedger:
coarse accounting and quota owner for a resource class. It records and
enforces limits, including non-realtime CPU share/runtime budgets where the
scheduler has not minted finer scheduling contexts.
SchedulingContext:
spendable CPU-time authority with budget, period, relative deadline,
priority/criticality, CPU mask, and overrun policy.
CpuIsolationLease:
placement, exclusivity, and nohz/noise-isolation authority for a CPU or CPU
set. It does not grant CPU-time credit and must charge consumed time through
a SchedulingContext or coarse scheduler ResourceLedger.
NoHzEligibility:
a reviewed claim or hint that a thread, ring, poller, or island may use nohz
isolation if the scheduler can prove the current CPU state allows it.
NoHzActivation:
the scheduler-proven current CPU state that actually suppresses ticks.
RealtimeIsland:
admitted bundle of SchedulingContexts, memory reservations, device
reservations, rings, endpoint/service constraints, and optional
CpuIsolationLeases.
Scheduling-context donation is not generic resource donation. It donates only execution budget/deadline along a synchronous capability path; it does not donate capability authority, invocation subject identity, disclosure scope, memory budget, network budget, storage budget, or service-management authority.
Layer 1: Tickless Idle
Tickless idle should be the first behavioral milestone. It applies only when the CPU has no runnable thread and no local work that still depends on a periodic scheduler tick.
Clocksource
Add a monotonic clock layer:
#![allow(unused)]
fn main() {
pub fn monotonic_ns() -> u64;
}
The first backend can use the current periodic tick as a compatibility source while the system is still periodic. The selected QEMU/x86_64 backend should eventually use a calibrated stable counter, with SMP consistency handled when multiple scheduler owners exist.
Required invariant:
monotonic_ns() never moves backwards on one CPU.
Clockevent
Add a small scheduler timer backend boundary:
#![allow(unused)]
fn main() {
trait ClockEvent {
fn program_periodic(period_ns: u64);
fn program_oneshot(delta_ns: u64);
fn stop();
fn min_delta_ns() -> u64;
fn max_delta_ns() -> u64;
}
}
The first backend is the current PIT-calibrated xAPIC LAPIC timer on vector 48. PIT/PIC and periodic LAPIC remain fallback paths.
Deadline Waiters
Convert timeout state from tick counts to absolute deadlines:
#![allow(unused)]
fn main() {
struct DeadlineWaiter {
deadline_ns: u64,
target: ThreadRef,
kind: WaiterKind,
user_data: u64,
}
}
Affected paths:
Timer.sleep;cap_enter(timeout_ns);- ParkSpace timeout;
- future process/thread wait timeouts;
- network poll deadline through
NetworkPollClock.
Waiter storage remains bounded. No interrupt path may allocate.
Network Poll Clock
The kernel-resident networking path is scheduler-polled. Rather than keep every
network-coupled lease in ForcedPeriodic, the in-kernel virtio-net poll is now
routed off a lease-isolated CPU (landed 2026-06-04,
scheduler-nohz-network-poll-housekeeping-routing): virtio::poll_scheduler
consults sched::current_cpu_lease_nohz_active() and skips driving the poll
from a CPU inside a lease-backed tick-suppression window, so that CPU no longer
needs the periodic tick to make network progress. The always-ticking
housekeeping CPU the lease admission already requires keeps servicing virtqueue
completions and pending network-waiter scans. The CpuIsolationLease activation
preflight reflects this with a network_polling=routed-periodic-network-polling- to-housekeeping-cpu admit label when a housekeeping CPU is available, failing
closed (rejected-network-polling-no-housekeeping-cpu-to-relocate, and the lease
is refused at create when no housekeeping CPU exists) otherwise. The longer-term
explicit poll-deadline interface below remains the target for fully removing the
dependency on a housekeeping CPU continuing to tick:
#![allow(unused)]
fn main() {
trait NetworkPollClock {
fn next_poll_deadline_ns(now_ns: u64) -> Option<u64>;
fn poll_until_budget(now_ns: u64, budget_ns: u64) -> PollResult;
}
}
next_poll_deadline_ns lets the scheduler include TCP/runtime timers in
earliest_global_deadline(). poll_until_budget prevents network progress
from becoming an unbounded idle-exit or interrupt path. A CPU with active
networking may enter tickless idle only when the network runtime is inactive or
has exposed a bounded deadline through this interface.
Kernel Idle
Tickless idle depends on replacing the user-mode idle process with a kernel/per-CPU idle context. Timer IRQ handling must distinguish:
IRQ from CPL3 user thread -> save/restore user context
IRQ from CPL0 idle -> wake/check scheduler without fake user context
Idle entry shape:
if no runnable work:
deadline = earliest_global_deadline()
clockevent.program_oneshot(deadline - now)
enter_kernel_idle()
The idle loop enables interrupts, halts, wakes on timer/IPI/device interrupt, then rechecks runnable work and deadline expiry.
Tickless State
Per CPU:
Periodic:
normal scheduler tick active
TicklessIdle:
no runnable thread
one-shot local timer programmed for earliest deadline
CPU in kernel idle
ForcedPeriodic:
fallback when a subsystem still needs regular polling
Enter TicklessIdle only when:
run queue empty
no direct IPC target
no deferred completion work
no timer-side ring work required
clockevent supports one-shot
kernel idle context available
network runtime inactive or deadline-driven
Keep periodic preemption whenever there is runnable contention. Even one runnable user thread remains periodic until Ring v2, CPU accounting, and timer-side polling dependencies are resolved.
Layer 2: SQPOLL NoHz
SQPOLL full-nohz is a later CPU ownership mode:
full-nohz is not a timer feature here;
it is part of the SQPOLL CPU ownership contract.
Required prerequisites:
- Ring v2 or equivalent per-thread rings;
- one SQ consumer per ring, including implemented syscall-mode leases and bounded SQPOLL mode transitions;
- per-CPU scheduler ownership;
- reschedule IPI and idle-to-runnable handoff;
- at least one housekeeping CPU;
- explicit placement of network polling away from isolated CPUs.
Current Phase F status: CpuIsolationLease and nohz telemetry exist, the
housekeeping/deferred-work placement child records selected online
housekeeping CPU masks plus deferred cleanup, timer/deadline, network polling,
IRQ-affinity, accounting-target, and cleanup-latency placement or rejection
labels, bounded SQPOLL ring mode can progress from periodic service or one
current-thread syscall/producer-wake batch, and the clockevent/deadline
substrate has split monotonic clocksource reads from LAPIC clockevent
programming. The clockevent one-shot’s firing precision is proven, not just its
programming: a runtime-reprogrammed TICK_NS/2 one-shot armed over the live
periodic timer is measured to fire at its requested sub-tick instant (~5 ms
for a 5 ms request, far under the 10 ms tick, with the current-count correctly
reset to the sub-tick value), and the kernel-mode-fire path restores a live
periodic timer so a one-shot consumed without running schedule() cannot
strand the CPU with no timer source (make run-scheduling-context).
The monotonic clocksource discipline is now sub-tick-accurate as well. The
periodic discipline step previously floored every fire to epoch + TICK_NS
(max(tsc_interpolated, epoch + TICK_NS)), which inflated a real sub-tick
interval to a full tick and hid sub-tick deadlines from the accounting clock.
discipline_clocksource_tick now trusts the TSC interpolation at sub-tick
granularity and falls back to the TICK_NS floor only when the interpolated
advance is implausibly small (below MIN_DISCIPLINED_ADVANCE_NS), preserving a
minimum forward rate against a degenerate TSC (publish_monotonic_ns enforces
only non-decreasing time, not a minimum rate). A boot proof advances a real
TICK_NS/2 interval through one discipline step and asserts monotonic_ns()
tracked the sub-tick delta rather than the full-tick floor
(make run-scheduling-context).
The first activation increment is now real: the CpuIsolationLease
activation preflight performs real per-CPU periodic-tick suppression for
the narrow single-runnable-entity window. When the preflight finds every
proof obligation satisfied – exactly one runnable caller on the target CPU,
ready housekeeping CPU, no local deferred-cleanup/timer dependency, valid
accounting target, live monotonic clocksource, non-stale one-SQ-consumer, and
bounded revocation latency – and the target CPU is the CPU running the
preflight, it masks the periodic LAPIC tick and arms a bounded one-shot
deadline at min(nearest pending timer wakeup, now + max revocation latency).
Network polling is now routed to a housekeeping CPU rather than kept read-only
fail-closed (landed 2026-06-04): the in-kernel virtio-net poll skips driving
from a lease-isolated CPU (virtio::poll_scheduler consulting
sched::current_cpu_lease_nohz_active()), so the admission network_polling
gate flips to a routed-periodic-network-polling-to-housekeeping-cpu admit when
a housekeeping CPU is available and fails closed otherwise. IRQ affinity is now
routable in a bounded form (landed 2026-06-04): when a lease opts in, the
activation path reprograms the leased CPU’s legacy IO-APIC redirection-entry
destinations onto the selected housekeeping CPU (mask-before-reprogram +
read-back, restored on rollback/revoke) before admitting tick suppression, and
keeps the conservative rejected-irq-affinity-not-routed-to-housekeeping refusal
for any ring-coupled lease whose IRQ dependency cannot be safely rerouted. The
live reroute is presently scoped to a quiescent housekeeping destination: under
the in-kernel KVM irqchip, reprogramming an IO-APIC redirection-entry destination
onto a CPU that is actively scheduling stalls forward progress on that
destination CPU, so a general “reroute onto any housekeeping CPU regardless of
occupancy” admission remains future work behind a real destination-quiescence
gate or a delivery backend without that re-evaluation cost. Every disqualifying
change (stale lease generation, a
second runnable entity, stealable sibling work, a local deferred-cleanup
dependency, a target-CPU mismatch, or a one-shot backend that can no longer
arm a deadline) rolls the CPU back to the periodic LAPIC tick first, before
ordinary work continues. Generic full-nohz for ordinary budgeted compute threads
is now admitted through explicit SchedulingContext-targeted compute leases. A
generic SQPOLL nohz state machine now admits explicitly leased caller-thread
rings when the ring is in SQPOLL running/sleeping mode with a live owner, one
SQ consumer, and bounded producer-wake/deadline rollback. Broader
userspace-poller/device-queue admission and production realtime island
admission remain future work; the periodic tick stays the fail-closed fallback
everywhere else. Timeout-based auto-revoke has since landed:
a lease created
with leaseLifetimeNs > 0 auto-revokes on first observation past its deadline
(reason=lease-expired) and a tickless CPU under it rolls back at the next
recheck (lease-lifetime-expired)
(docs/tasks/done/2026-05-30/scheduler-cpu-isolation-lease-timeout-auto-revoke.md).
SQPOLL-driven activation is now proven by
make run-scheduler-generic-sqpoll-nohz: a ring-coupled kernelSqpoll lease
whose bound ring is in SQPOLL running/sleeping mode with a live owner is
admitted for tick suppression, producer wake drives bounded non-periodic
service, and revoke/stale-owner rollback fails closed. The per-CPU
idle thread has also landed – the scheduler idle path is now a CPL0 per-CPU
kernel idle thread and the user-mode idle process is gone (docs/tasks/README.md).
The non-atomic createLease-vs-revokeGrant SMP window
(kernel/src/cap/cpu_isolation_pool_grant.rs:472-483) – a createLease that
passes the grant live-check on one CPU can register its lease just after a
concurrent revokeGrant on another CPU snapshotted the registry, so that lease
is not cascade-terminated and lingers until its own leaseLifetimeNs or an
explicit revoke – is now a modeled, bounded residual rather than a prose-only
caveat. The Alloy lease/grant authority model represents it explicitly as the
WindowLingering set and checks that no live lease reaches a revoked grant
outside it. That the lingering lease was nonetheless legitimately authorized
(no lease is ever minted through an already-revoked grant) is a temporal
mint-time-vs-revoke property the static relational model does not itself check;
it rests on the code’s create-time minted_grant_live gate
(cpu_isolation_pool_grant.rs:484), which fails closed before admission. Taken
together this is a bounded capacity-hold window, not an authority escalation. The
companion TLA+ model checks the two-lock teardown the cascade and prune share
(generation advances exactly once, no capacity double-free, no stranded
generation). Both run under make model-scheduler-lease-alloy /
make model-scheduler-lease-tla; see models/scheduler/README.md.
The nohz/tickless activation-rollback path – the lock-free NOHZ_ACTIVE_CPUS
bit read from ISR context against the locked dispatch.nohz_activation[slot]
record, with IPI-delivered cross-CPU activation/rollback – is likewise now a
checked model rather than a prose-only invariant. The TLA+ lifecycle model
(models/scheduler/nohz_activation.tla) checks that no scheduler CPU is ever
left timer-less (a fired one-shot always has the contention fallback re-arm
enabled, and is always eventually re-armed), that the lock-free bit and the
locked record always reconcile (the bit-set/record-cleared and
record-present/bit-cleared divergences the rollback and contention paths produce
are transient), and that a staled remote activation is dropped rather than
applied to a newer lease (a staled generation is never committed, and a
recorded generation staled by the cap-side maybe_expire path is always rolled
back by the stale-lease-generation disqualifier). A focused Loom test pins the
lock-free-bit ↔ locked-record reconciliation under the C11 memory model. Both
run under make model-scheduler-nohz-tla / make model-scheduler-nohz-loom;
see models/scheduler/README.md.
Ring mode:
#![allow(unused)]
fn main() {
enum RingMode {
Syscall,
SqpollStarting,
Sqpoll,
SqpollStopping,
}
}
In syscall mode, the owner thread’s cap_enter drains SQ. In SQPOLL mode, a
kernel worker owns SQ head; userspace owns SQ tail and CQ head; cap_enter
waits for completions and may wake a sleeping poller, but it does not drain
SQ.
SQPOLL state:
Disabled -> Starting -> Running -> IdleSpinning -> Sleeping -> Stopping
The wake protocol uses a NEED_WAKEUP flag. Userspace release-stores the SQ
tail, acquire-loads flags, and invokes a wake path only if the poller has gone
to sleep.
The race-free sequence is normative.
Poller before sleeping:
#![allow(unused)]
fn main() {
flags.fetch_or(NEED_WAKEUP, SeqCst);
let tail = sq_tail.load(Acquire);
if sq_head != tail {
flags.fetch_and(!NEED_WAKEUP, Release);
continue;
}
park();
}
Producer:
#![allow(unused)]
fn main() {
write_sqe();
sq_tail.store(new_tail, Release);
fence(SeqCst);
let flags = flags.load(Acquire);
if flags & NEED_WAKEUP != 0 {
wake_poller();
}
}
The poller must set NEED_WAKEUP before the final tail recheck. Otherwise a
producer can publish a new SQE after the poller checks the tail but before it
parks, losing the wake.
The NEED_WAKEUP publication must also be ordered before the final tail
recheck by a full store-to-load barrier. A SeqCst RMW is the simplest
portable rule for the ABI text; an implementation may substitute an explicitly
reviewed architecture-specific fence or park primitive that provides the same
ordering. A plain release store or release-only RMW is not sufficient for this
protocol.
The producer must likewise order the SQ tail publication before checking
NEED_WAKEUP. The normative sequence uses a full fence between
sq_tail.store(..., Release) and flags.load(Acquire); an implementation may
substitute an explicitly reviewed equivalent that prevents the producer from
missing NEED_WAKEUP while the poller misses the new tail before parking.
An SQPOLL CPU may suppress the periodic tick only if:
cpu role is SqpollIsolated
exactly one runnable entity is the poller
no ordinary user thread is runnable there
no timer-side SQ polling is enabled
no network scheduler polling is pinned there
no deferred cleanup is pinned there
stable clocksource/accounting exists
housekeeping CPU is online
If any condition fails, restore periodic tick or migrate the unrelated work.
NoHz Activation Proof Obligations
To enter SqpollNoHz or future AutoNoHz, the scheduler must prove:
exactly one runnable entity is assigned to the CPU
at least one housekeeping CPU is online
no local network polling dependency remains
no timer-side SQ polling can run for the active ring
no local deferred cleanup or unbound kernel worker is pinned there
no unmigratable IRQ targets that CPU unless explicitly allowed
clocksource and CPU accounting are boundary/counter driven, not tick driven
revocation latency is within the lease policy
The proof is dynamic. If any condition stops holding, the scheduler must restore periodic tick, migrate unrelated work, revoke the lease, or leave nohz mode before continuing.
Layer 3: AutoNoHz CPU Lease
The long-term design should split eligibility from activation.
Eligibility says a thread, process, ring, or realtime island may use nohz isolation:
#![allow(unused)]
fn main() {
enum NoHzKind {
Idle,
KernelSqpoll,
AutoCompute,
AutoUserspacePoller,
RealtimeIsland,
}
struct NoHzEligibility {
kind: NoHzKind,
max_revocation_latency_ns: u64,
preferred_cpus: CpuSet,
allow_busy_spin: bool,
accounting_target: CpuAccountingTarget,
}
enum CpuAccountingTarget {
CurrentSchedulingContext,
SchedulerResourceLedger,
}
}
Activation is a scheduler proof that a CPU currently satisfies isolation conditions. Without a lease, a latency-sensitive hint may influence placement but must not grant exclusive CPU access.
Future lease shape:
CpuIsolationLease:
owner process/session
allowed CPU set
allowed mode: poller/compute/kernel-worker
accounting target, not CPU-time credit
revocation policy
Housekeeping must be explicit:
Housekeeping CPU set:
global timers
deferred frees
cleanup
statistics
non-critical kernel workers
debug/watchdog
load balancing and migration control
Layer 4: Deadline Metadata
Deadline metadata lives in fixed ring ABI fields, not in a Cap’n Proto SQE
envelope and not in variable side metadata. The current fixed SQE layout should
not be silently reinterpreted; add these fields through a versioned
CapSqeV2/ring ABI gate when the transport is ready.
#![allow(unused)]
fn main() {
#[repr(C)]
struct CapSqeV2 {
// existing fixed CapSqe fields, unchanged in order and meaning
deadline_ns: u64, // absolute monotonic deadline, 0 = none
qos_flags: u32, // drop/allow/reorder/propagate semantics
sched_ctx_id: u32, // 0 = current/default scheduling context
}
}
deadline_ns is an absolute monotonic timestamp. It is request freshness
metadata, not a promise of nanosecond wakeup precision. The kernel may round
timer programming to clockevent granularity, coalesce timers where policy
allows, or report a miss when dispatch observes the timestamp has already
expired. The field remains u64 nanoseconds because absolute u64 ns values
are simple, tracing-friendly, and shared with existing timeout surfaces; a
u64 microsecond field saves no ABI space.
Only consider a compact profile if SQE space becomes critical:
#![allow(unused)]
fn main() {
deadline_delta_us: u32
}
That profile would be a soft-deadline compact transport shape only. It is not
the primary realtime or SchedulingContext ABI and must not replace
deadline_ns for admitted realtime work.
ABI negotiation uses both bootstrap metadata and a runtime query surface:
#![allow(unused)]
fn main() {
struct RuntimeBootInfo {
ring_addr: u64,
ring_abi_version: u32,
sqe_size: u16,
cqe_size: u16,
}
}
- Process bootstrap passes the ring ABI version and fixed entry sizes alongside the ring address.
RuntimeBootInfo, ring ABI version constants, and fixed SQE/CQE layouts live incapos-config/src/ring.rs; the kernel andcapos-rtimport the same definition rather than carrying local copies.- A future
RuntimeInfo/SystemInfoquery returns the kernel-supported ring ABI range so language runtimes can fail before mapping incompatible rings. cap_enterrejects unsupported SQE versions or entry sizes with stable transport errors such asCAP_ERR_UNSUPPORTED_RING_ABIandCAP_ERR_UNSUPPORTED_SQE_VERSION.- Runtimes in Rust, C, Go, and other languages must generate or mirror the exact fixed layout for the negotiated version.
Suggested flags:
DROP_IF_LATE:
if now > deadline_ns before dispatch, post DEADLINE_EXPIRED
ALLOW_LATE:
dispatch anyway, but CQE/telemetry marks late
PROPAGATE_DEADLINE:
endpoint CALL/RETURN carries deadline metadata to server-side request
DEADLINE_ORDERED:
SQPOLL may reorder within a bounded window only when all reorder-safety
checks below pass
NO_BLOCKING_PATH:
reject if target method/op is not declared realtime-safe
Do not put budget, period, priority, criticality, or CPU affinity into each SQE. Deadline is per request. Budget is execution authority.
DEADLINE_ORDERED is valid only when all of the following are true:
the ring mode permits reordering
the SQE marks this request reorderable
the target capability interface and method declare reorder-safe semantics
the reordering window is bounded
the operation does not depend on earlier same-ring requests for correctness
Ordered side effects such as write A; write B; flush or lock; mutate; unlock must not be deadline-reordered unless the target method contract
explicitly defines that sequence as reorder-safe.
Layer 5: SchedulingContext
CPU time should become a capability-controlled object:
#![allow(unused)]
fn main() {
struct SchedulingContext {
budget_ns: u64,
period_ns: u64,
relative_deadline_ns: u64,
priority: u16,
criticality: u8,
cpu_mask: CpuSet,
overrun_policy: OverrunPolicy,
timeout_endpoint: Option<EndpointRef>,
}
}
Kernel responsibilities:
- decrement remaining budget by actual runtime;
- replenish budget by period;
- throttle or fault a thread on depletion;
- enforce CPU mask and scheduling eligibility;
- dispatch among eligible contexts by the selected realtime policy;
- prevent untrusted SQE bytes from minting budget.
Policy-service responsibilities:
- admission control;
- budget/period/priority selection;
- CPU-isolation lease policy;
- overload response;
- telemetry and retuning.
Layer 6: Donation
Synchronous capability calls need scheduling-context donation:
client SchedulingContext -> passive server endpoint
server runs on donated budget/deadline
context returns on reply
timeout/overrun reports to caller or island policy
Without donation or inheritance, a realtime caller can be defeated by a normal-priority server that holds the capability implementation path.
Donation semantics must be fixed before implementation:
max donation call depth:
bounded per SchedulingContext or RealtimeIsland; overflow fails closed.
nested donation:
nested synchronous calls carry the current donated context until the depth
bound, unless a callee uses its own admitted context by explicit policy.
cycle handling:
a donated context may not re-enter a thread already on its donation stack;
cycles fail with a typed realtime/donation error.
partial failure:
budget already consumed stays charged to the context that ran the work.
rollback of authority or memory is separate from CPU charge rollback.
timeout propagation:
the earliest of request deadline, scheduling-context deadline, and explicit
call timeout bounds downstream execution.
server-side blocking:
a passive server running on donated context may block only on approved
realtime-safe waits or synchronous calls that continue donation.
return on exception:
application exceptions, transport errors, and cancellation return the
context to its previous owner before CQE/error delivery.
async endpoint queues:
donation does not cross ordinary async endpoint enqueue by default. Async
donation requires an explicit future token/lease design.
Hot admitted paths should avoid blocking locks. If a shared resource cannot be modeled as a passive service, it needs a reviewed priority/deadline-inheritance primitive or a bounded try-lock/fail/drop policy.
Layer 7: RealtimeIsland
RealtimeIsland admits a whole loop or graph:
#![allow(unused)]
fn main() {
struct RealtimeIslandSpec {
period_ns: u64,
deadline_ns: u64,
cpu_set: CpuSet,
nodes: Vec<NodeBudget>,
rings: Vec<RingSpec>,
memory: Vec<PreallocSpec>,
devices: Vec<DeviceReservation>,
overrun_policy: OverrunPolicy,
}
}
Admission requires:
- total budget fits period/deadline constraints;
- all hot-path buffers are preallocated;
- hot-path memory is committed and resident before start;
- guaranteed hot-path memory uses the OOM proposal’s
MemoryResidencypolicy aspinnedorsecret;normalmemory is not admitted for guaranteed hot paths. A future lock-resident operation may transition ordinary memory into a pinned reservation before admission, but the admitted island sees the result aspinned, not asnormal; - all caps and policy decisions are resolved before start;
- no expected page faults on the hot path;
- no unbounded lock acquisition;
- no blocking endpoint calls inside callback loops;
- no allocation, logging, service discovery, or provider credential work on the realtime path;
- IRQ and deferred work are bounded or moved outside the island.
Failure semantics must be typed:
CAP_ERR_DEADLINE_EXPIRED
CAP_ERR_BUDGET_EXHAUSTED
CAP_ERR_REALTIME_UNSAFE_PATH
CAP_ERR_REALTIME_ADMISSION_DENIED
CAP_ERR_OVERRUN
CAP_ERR_STALE_INPUT
CQE/status should distinguish not-started-late, completed-late, dropped by policy, throttled, and dependency-cancelled.
Policy-Service Userstories: AutoNoHz Placement for Compute-Capable Threads
The Layer 1-7 primitives above are mechanism: NoHzEligibility is a reviewed
claim, CpuIsolationLease is the placement authority, SchedulingContext and
the coarse ResourceLedger own CPU-time budget, and NoHzActivation is the
scheduler proof that current CPU state allows tick suppression. They do not
answer who decides to issue an eligibility hint for an ordinary user thread
that was not pre-declared as a realtime island or kernel SQPOLL worker, or
what observation justifies the issuance. That decision is policy, and it
belongs in the user-space scheduler policy service described in
Stage 7 of scheduler-evolution-proposal.
This section records the userstories that motivate the responsibility and the
bounds the policy service must enforce so auto-promotion never becomes an
implicit “unlimited CPU-hold” grant.
Core property: promotion is placement, not budget
Auto-promotion adds isolation; it never mints CPU-time authority. A
policy-issued CpuIsolationLease only removes tick and scheduler noise while
its bound thread consumes time that was already authorized through its
SchedulingContext or coarse ResourceLedger. SchedulingContext budget
exhaustion is now folded into the same nearest-deadline timer as nohz
revocation/timer work, so a tick-masked CPU is re-observed at the budget
deadline rather than at a later periodic tick. When budget exhausts, or when any
existing Layer 3 activation obligation stops holding, the existing fail-closed
rollback path restores the periodic tick. Priority-aware revocation of the lease
itself when an equal-or-higher-priority runnable arrives is new Phase H surface
(see “Bounds the policy service must enforce” below); today’s Phase F rollback
only restores ticks on the leased CPU and does not terminate the lease.
This separation answers the obvious objection. A busy-spinning thread cannot escalate itself into permanent CPU exclusivity, because the spin drains its allotted budget at the same rate periodic scheduling would have drained it. If the operator has granted enough budget to saturate a core, auto-promotion removes tick interference while that budget is consumed; if not, the same authority that would have throttled the thread under periodic scheduling still throttles it under nohz.
Trigger: “thread appears capable of utilizing a full CPU core”
The trigger is not a fixed percentage threshold inside the kernel. The kernel exports per-thread observation; the policy service synthesizes a saturation-capability signal from those observations and decides what “capable of utilizing a full CPU core” means for a given account, session, or service profile. Plausible inputs the policy service may combine:
- runtime accumulated over a rolling window approaches the wall-clock window the thread had on its assigned CPU;
- voluntary-block count over the same window stays low (the thread is not IPC- or IO-bound at a rate that would lose the benefit);
- runnable-but-not-running time stays low when the thread is the only runnable entity on its CPU, or correlates with placement contention rather than IO when it is not.
Concrete window length, smoothing, and the synthesis rule are policy-service
choices, replaceable without ABI churn. As of 2026-05-30 the kernel exports
the observation inputs the heuristic consumes as ordinary (non-measure)
per-thread state: runtime_ns/virtual_runtime_ns, voluntary_blocks,
preemptions, and a cumulative runnable_accumulated_ns
(runnable-but-not-running time) are all returned by
SchedulingPolicyCap.snapshot @2. voluntary_blocks and preemptions were
promoted out of cfg(feature = "measure") and runnable_accumulated_ns was
added at the run-queue enqueue/select boundary; only migrations remains
measure-gated. This closes the Phase H “monitoring/status surface that
exports per-thread saturation observation” prerequisite. The surface exports
raw cumulative counters only: no fixed threshold and no windowing live in the
kernel – the policy service synthesizes the saturation signal.
Userstories
-
Long-running compute tenant with declared budget. A model-training, video-encoding, or HPC build job is admitted with a
SchedulingContext(or coarseResourceLedgerallocation) sized for sustained near-core utilization on a declared CPU pool. The policy service observes the thread saturating the pool’s CPU share, issues a boundedCpuIsolationLeaseagainst the pool, the scheduler proves the activation obligations from Layer 2/3, and ticks are suppressed for as long as the thread keeps consuming the granted budget. The lease ends when the budget exhausts, the job completes, the operator revokes the pool, or the saturation signal subsides. -
Userspace poller that earned isolation. A service polls a ring or device queue (a candidate
AutoUserspacePollerin theNoHzKindtaxonomy). The policy service sees consistent saturation with low voluntary blocking, recognizes theAutoUserspacePollereligibility kind, and issues a lease. The bounds are the same as for the kernel SQPOLL path; only the consumer differs. -
Account-scoped auto-claim pool. An operator pre-declares “account X may auto-claim up to N isolated CPUs from pool P, maximum auto-lease lifetime L, with revocation latency R, charging to ledger E.” The policy service monitors threads owned by X, issues leases against P when saturation capability is observed, and refuses promotion when X already holds N leases or when no CPU in P currently satisfies the activation proof. Without the operator declaration the policy service does not auto-promote.
-
Background agent that bursts to full-core compute. A general-purpose agent process does not normally saturate a core. When it briefly does (a planning phase, a build step, a local inference call), the policy service may issue a short-lifetime lease if the agent’s account has authorized auto-promotion. When the burst ends the signal subsides; the lease is not renewed.
Bounds the policy service must enforce
For every auto-issued lease the policy service records:
lifetime_ns: bounded; shorter than admin-issued leases by
default; renewal requires re-observing the
saturation signal.
max_revocation_latency_ns: bounded by NoHzEligibility.max_revocation_latency_ns;
cannot exceed the operator/account policy.
accounting_target: a live SchedulingContext or coarse ResourceLedger;
the lease does not mint CPU-time authority.
auto_claim_pool: the pre-authorized CPU set; no implicit fallback to
system-wide isolation.
fairness_preemption: another runnable entity at equal-or-higher policy
priority terminates the lease if no other CPU
authorized by both the pool and lease mask is
eligible.
Two of these bounds map to existing kernel-enforced surfaces:
max_revocation_latency_ns is already a field on NoHzEligibility and the
closed Phase F activation preflight; accounting_target is already a field
on NoHzEligibility and the live SchedulingContext/ResourceLedger
authority.
The other three bounds need new kernel-enforced surfaces before the heuristic can ship and are named as Phase H prerequisites:
lifetime_ns: LANDED 2026-05-30.CpuIsolationLeaseSpecnow carriesleaseLifetimeNs @6(0= no expiry, the default). A lease records an absolute monotonicexpires_at_nsat creation; the first observation past the deadline auto-revokes through the existing generation-advancing cleanup (reason=lease-expired), and the nohz activation record carries the lifetime deadline so a tickless CPU rolls back at the next timer/IPI recheck (lease-lifetime-expired), bounded bymaxRevocationLatencyNs. This is the bounded-lifetime guarantee the auto-issued placement lease needs, so a compromised, blocked, or malfunctioning policy service cannot leave an auto-issued lease holding the CPU indefinitely. The bounded renewal primitive LANDED on top of this:CpuIsolationLease.renew @4pushesexpires_at_nsforward tonow + leaseLifetimeNs(clamped to the same one-hour ceilingread_specenforces), keeping the same(leaseId, generation), accounting binding, and nohz activation state – distinct from re-minting a fresh lease. It is callable only before expiry (a revoked, auto-revoked, or past-deadline lease staysstaleGenerationand is not resurrected; an unboundedleaseLifetimeNs = 0lease reportsnotRenewable), and the renewed deadline is propagated to a tickless CPU’s nohz activation record so thelease-lifetime-expireddisqualifier no longer rolls it back at the old deadline;CpuIsolationLeaseInfo.expiresAtNsechoes the deadline read-only. Only the Phase H renewal heuristic – re-observing the saturation signal to decide whether to callrenewon a near-expiry lease – remains future policy-service work on top of this primitive.auto_claim_pooland per-account capacity (Nin userstory 3): the operator-declared CPU-pool descriptor LANDED 2026-05-30, making a non-defaultpoolIdmeaningful for the first time.CpuIsolationLeaseSpeccarriespoolId @7(0= the implicit default pool over every scheduler CPU), and the kernel seeds a fixed declared-pool registry (CpuIsolationPoolDescriptor: the default pool0plus exactly one declared non-default pool1over a single CPU). The create-time admission gate now looks the pool up: an undeclaredpoolIdis rejectedinvalidSpec; a declared pool whose CPU mask the lease’sallowedCpuMaskexceeds is rejectedinvalidSpec; a declared pool with a subset mask is admitted and its id/mask are echoed read-only throughCpuIsolationLeaseInfo(admittedPoolId/admittedPoolCpuMask) (proofmake run-scheduler-cpu-isolation-lease:nondefault_pool=invalidSpecfor the undeclared id,declared_pool=ok admitted_pool_id=1 admitted_pool_cpu_mask_subset=true,declared_pool_mask_violation=invalidSpec,default_pool_id=0). The declared-pool table is now operator-sourced (LANDED 2026-05-30): the kernel installs it from the boot manifestSystemConfig.cpuIsolationPools @14(aList(CpuIsolationPoolDescriptor)), with the in-kernel constant as the fail-closed default when the manifest omits the list, and validates each entry fail-closed at boot (canonical CPU mask subset of the scheduler mask, default pool0synthesized if omitted, duplicate ids rejected). The boot linecpu-isolation: declared-pools source=manifest count=3 default_pool_id=0 nondefault_pool_id=1 nondefault_pool_cpu_mask=0x2proves the source (proofmake run-scheduler-cpu-isolation-lease; the kernel-default fallback is proven bycargo test-configdecode/empty assertions). The descriptor now also carries a per-pool live-lease capacity bound (poolMaxLeases @2, LANDED 2026-05-31): a non-zero value caps the number of simultaneously live (non-revoked, current-generation) leases the kernel admits against that pool at create-time, counted from the existingLEASE_REGISTRYafterprune_dead, rejecting an over-capacity create fail-closedresourceExhausted(0= unbounded, preserving the default pool0and every existing producer). The manifest bounds pool2atpoolMaxLeases: 2; the proof admits two live leases, refuses a third non-overlapping create (cpu-isolation: pool-capacity-rejected admitted_pool_id=2 live_leases=2 pool_max_leases=2 result=resourceExhausted,pool_capacity_exceeded=resourceExhausted), then reclaims after a revoke (pool_capacity_reclaimed=ok), proving the bound is live-count not cumulative. The account identity and per-accountNthen landed on top of this counter (LANDED 2026-05-31):CpuIsolationLeaseSpeccarriesaccountId @8 :UInt64(0= unattributed, caller-asserted and inert until counted, echoed read-only throughCpuIsolationLeaseInfo.accountId @6) andCpuIsolationPoolDescriptorcarriespoolMaxLeasesPerAccount @3 :UInt32(0= unbounded per account). After the pool-wide check,registercounts the requesting account’s live entries (matching bothadmitted_pool_idandaccount_id) against the per-account bound and rejects an over-bound create fail-closedresourceExhausted(0account or0bound skips the gate). The manifest bounds pool2atpoolMaxLeasesPerAccount: 1; the proof admits one account-7 lease, refuses a second account-7 create (cpu-isolation: account-capacity-rejected admitted_pool_id=2 account_id=7 account_live_leases=1 pool_max_leases_per_account=1 result=resourceExhausted,account_capacity_exceeded=resourceExhausted), admits a different account-9 lease on that CPU (account_capacity_other_account=ok– per-account, not pool-wide), and reclaims after revoking account-7 (account_capacity_reclaimed=ok). The account id is caller-asserted on the plain lease path. The authentication half LANDED 2026-05-31:CpuIsolationPoolGrant(schema/capos.capnp; sourcecpu_isolation_pool_grant; kernelkernel/src/cap/cpu_isolation_pool_grant.rs) introduced a bootstrap-staged grant that binds one authenticated account to one declared pool. ItscreateLeasestamps the bound account/pool onto the minted lease, overriding any caller-assertedaccountId/poolId, and reuses the same lease-create admission path (cpu_isolation::create_lease_for_caller) – so the per-account bound is unforgeable by cap-possession: a holder cannot assert another account to evadepoolMaxLeasesPerAccount. The initial single-grant proof used account7bound to pool2; the currentmake run-scheduler-cpu-isolation-pool-grantproof boots manifest-declared grants. The grant binding is now operator-declared (LANDED 2026-06-01): the manifestSystemConfig.cpuIsolationPoolGrantstable seeds the bound(account, pool)pairs (mirroring thecpuIsolationPoolstable), and thecpu_isolation_pool_grant/cpu_isolation_pool_grant_secondarysources stage seeded binding index0/1, so an operator can pre-authorize multiple distinct accounts/pools, each staged as its own bootstrap grant cap. An absent/empty list falls back to one in-kernel binding at index0: account7bound to preferred pool1when active, otherwise account7bound to synthesized default pool0, preserving a usable single default grant when a manifest-sourced pool table omits pool1.make run-scheduler-cpu-isolation-pool-grantnow boots a two-entry table (account5/pool1, account8/pool2) and proves each grant stamps its OWN bound account with the per-account bound still enforced.make run-scheduler-cpu-isolation-pool-grant-defaultboots the empty-list fallback with pool1omitted and proves the synthesized(account 7, pool 0)grant is usable. Runtime grant minting landed 2026-06-02 22:24 UTC (CpuIsolationGrantMinter): one cap mints a freshCpuIsolationPoolGrantfor an operator-chosen(account, pool)at call time, bounded by the declaredSystemConfig.cpuIsolationGrantMinterAllowlist(an out-of-allowlist pair is refusedunauthorized, so the minter is never an ambient grant-any authority; the minted grant reuses the same unforgeablecreateLeaseadmission path). The samemake run-scheduler-cpu-isolation-pool-grantsmoke mints a grant for the allowed(account 6, pool 2), proves itscreateLeasestamps account6and stays bounded by the per-account gate, and proves an out-of-allowlist mint is refused. Grant-revocation lifecycle landed 2026-06-03 17:11 UTC (CpuIsolationGrantMinter.revokeGrant), closing (c): a runtime-minted grant carries a revocable(grantId, generation);revokeGrant(grantId)advances the grant generation so a stale grant handle’screateLeasefailsstaleGenerationand mints nothing, and revocation cascades to every live lease minted through that grant – reusing the landed fairness-termination cleanup (reason=grant-revoked, periodic-tick rollback, registry unregister) once per tagged lease, so per-pool/per-account live-lease capacity frees immediately and a fresh grant is admitted into the reclaimed slot. Double-revoke isalreadyRevokedand an unknowngrantIdisunknownGrant, both fail-closed; seeded bootstrap grants are not minter-owned and stay un-revocable. The samemake run-scheduler-cpu-isolation-pool-grantsmoke proves the full lifecycle. No pool authority is minted from holding a lease cap; the kernel stays the fail-closed admission gate.fairness_preemption: LANDED 2026-06-02 21:17 UTC. The Phase F rollback path now compares policy priority at the existing nohz recheck site: when a second runnable entity appears on the leased CPU at equal-or-higher WFQ policy priority (latency_class,weight) than the captured leased thread, and no sibling CPU authorized by both the admitted pool and the leaseallowedCpuMaskis eligible to host the lease, the kernel terminates theCpuIsolationLeaseitself (fairness-preempted ... result=lease-terminated) rather than only restoring the periodic tick, bounded bymaxRevocationLatencyNs. The termination runs the same generation-advancing cleanupleaseLifetimeNsexpiry uses (reason=fairness-preempted) immediately after the scheduler restores the periodic tick, so a subsequentinfo/revokereportsstaleGenerationand placement/account capacity is freed without waiting for the holder’s next cap call; a strictly-lower arrival or an eligible sibling CPU inside both masks keeps the existing tick-restore-only behavior. The kernel supplies the comparison and fail-closed termination; the policy service remains the issuer and bookkeeper of the saturation signal. Re-placement of the leased thread onto an eligible sibling CPU (instead of terminating) remains generic-full-nohz work; the “no sibling eligible” condition is recorded.
The policy service is the issuer and the bookkeeper of the synthesized saturation signal; the kernel remains the authority gate, the activation prover, and the fail-closed rollback path – including for the three not-yet-existing surfaces above.
Explicit non-goals
- The kernel does not contain a saturation-detection rule of its own. It exports observation; it does not synthesize the signal.
- Auto-promotion does not grant unlimited CPU-hold. The lease is bounded by lifetime, budget, revocation, and pool capacity; absent a pre-authorized pool, no auto-promotion occurs.
- Auto-promotion does not grant realtime authority.
RealtimeIslandadmission remains a separate, stricter path with preallocation, deadline, and no-blocking proofs. - Auto-promotion does not bypass donation, fairness, or session-lifecycle invariants. Process exit, session logout, and explicit revoke still tear the lease down through the existing Layer 3 rollback.
Telemetry Requirements
Tickless, nohz, SQPOLL, and realtime behavior must be observable through future monitoring/status capability surfaces, not only through ad hoc debug logs. The first counters should include:
scheduler_tick_count{cpu}
ticks_suppressed{cpu,mode}
nohz_enter_count{cpu,kind}
nohz_exit_count{cpu,reason}
oneshot_deadline_miss_count
sqpoll_busy_ns
sqpoll_sleep_count
deadline_expired_count
budget_exhausted_count
realtime_overrun_count
donation_depth_max
housekeeping_offload_count
These counters are correctness evidence. Missing or surprising values should fail focused nohz/realtime proofs rather than being treated as performance-only diagnostics.
The ticks_suppressed{cpu,mode} / scheduler_tick_count{cpu} evidence is
realized as an asserted proof line on the lease path:
make run-scheduler-cpu-isolation-lease now counts genuine periodic LAPIC
fires per CPU (a fire is counted only when neither the lease-backed nor the
idle tick-suppression bit is set, so the one-shot replacement is never
miscounted) and, on lease nohz rollback, emits
cpu-isolation: nohz suppressed-ticks cpu=<n> window_ns=<w> expected_periodic=<e> actual_periodic=<a> suppressed=<e-a>. The harness
asserts that over a bounded masked window the leased CPU recorded actual near
zero while expected was substantial – the periodic tick demonstrably stopped,
not merely that the mask write was issued – and that a bounded post-rollback
cpu-isolation: nohz restored-rate window shows the periodic rate returning.
This is bounded proof-line evidence, not yet a durable
SchedulingPolicyCap/monitoring telemetry field; the persistent
ticks_suppressed surface and the generic-full-nohz path’s inheritance of the
same measured assertion remain future telemetry work.
Implementation Sequence
- Add timer/scheduler instrumentation around the existing periodic tick.
- Add
monotonic_ns()backed by a clocksource that is not derived from the scheduler tick, and switchTimer.nowplus scheduler accounting to that clocksource while keeping periodic scheduling. Completed for normal QEMU/x86_64 by the Phase F clockevent/deadline substrate. - Convert timeout waiters to
deadline_ns. Completed forTimer.sleep, finitecap_enter, and park timeouts by the Phase F clockevent/deadline substrate. - Add LAPIC one-shot programming, periodic restore state, and a focused one-shot smoke. Completed as a disabled-nohz substrate proof by the Phase F clockevent/deadline substrate.
- Replace user-mode idle with kernel/per-CPU idle while keeping periodic
ticks. Completed: the scheduler idle path is now a CPL0 per-CPU kernel idle
thread and the user-mode idle process is gone (
docs/tasks/README.md). - Enable tickless idle only when there is no runnable work. Completed by
docs/tasks/done/2026/scheduler-tickless-idle-step6.md: true-idle CPUs with no runnable non-idle work, no active nohz lease, no local deferred cleanup, no cap-enter polling dependency, and a one-shot LAPIC clockevent mask the periodic tick and arm a bounded one-shot at the nextTimer/ParkSpacedeadline or the 100 ms idle housekeeping floor. The scheduler restores the periodic tick before ordinary non-idle dispatch, on reschedule IPIs, and on backend/refusal rollback. Cap-enter polling waiters and ready-but-budget-throttledSchedulingContextretry windows remain periodic until the legacy terminal/network/IRQ polling and scheduling-context retry surfaces move behind explicit deadlines or housekeeping placement. - Route the in-kernel virtio-net poll off a lease-isolated CPU to the
housekeeping CPU (landed 2026-06-04); an explicit
NetworkPollClockpoll deadline remains the longer-term target. - Add ring mode state and refuse timer-side SQ processing for SQPOLL rings.
- Land Ring v2 per-thread ring ownership and completion routing.
- Add the SQPOLL wake/sleep protocol and a host or Loom-style lost-wakeup model.
- Add kernel SQPOLL without full-nohz, under normal scheduler ticks.
- Add CPU isolation leases and housekeeping CPU placement.
- Prove SQPOLL progress through a wake/deadline path that does not depend on periodic scheduler ticks. Completed for bounded current-thread syscall/producer-wake progress by the Phase F SQPOLL nohz-progress child.
- Enable SQPOLL nohz on isolated CPUs for explicitly leased caller-thread rings. Landed 2026-06-07 09:45 UTC; broader userspace-poller/device-queue policy issuance remains separate.
- Add request
deadline_nsmetadata and typed late/drop CQE outcomes. - Add
SchedulingContextand admission-controlled realtime islands. - Add generic full-nohz admission for ordinary budgeted compute threads
through explicit
SchedulingContext-targetedCpuIsolationLeasepreflight. Landed 2026-06-06 09:44 UTC; policy-service issuance remains separate. - Add the user-space policy-service AutoNoHz placement heuristic. The
kernel exports per-thread saturation observation through the
monitoring/status surface; the policy service synthesizes the “thread
appears capable of utilizing a full CPU core” decision and issues
bounded
CpuIsolationLeasegrants against pre-authorized account or session CPU pools. The auto-revoke timeout primitive (leaseLifetimeNs) landed 2026-05-30 15:22 UTC at84c1c5ba, priority-aware fairness lease termination landed 2026-06-02 21:28 UTC atcae825a4with immediate release remediation atca28ef63, runtime grant minting (CpuIsolationGrantMinter) landed 2026-06-02 22:25 UTC at5c5c63cc, and the grant-revocation lifecycle (CpuIsolationGrantMinter.revokeGrantwith cascade-to-leases) landed 2026-06-03 17:11 UTC, completing the pool-grant authority surface. The local userspace policy-service proof landed 2026-06-07: it reads the per-thread saturation counters, denies a voluntarily blocking worker, issues a finite grant-stamped full-nohz lease only after a saturated local window, renews only after re-observation, and lets stopped renewal expire fail-closed. A reusable production policy daemon with profile-driven smoothing, cross-process target discovery, and richer operator policy remains future work.
Verification
Tickless idle gates:
make fmt-check
cargo test-lib
cargo test-config
make run-smoke
make run-spawn
Additional tickless proof:
1 second idle interval does not produce 100 scheduler ticks
Timer.sleep still completes
cap_enter timeout still completes
ParkSpace timeout still completes
preemption fairness unchanged with runnable contention
SQPOLL gates:
thread-lifecycle
timer-smoke
timer-flood
park wake/timeout
endpoint CALL/RECV/RETURN
mandatory host or Loom-style lost-wakeup model before any real SQPOLL worker:
poller: set NEED_WAKEUP -> full barrier -> recheck tail -> park
producer: write SQE -> publish tail -> full barrier -> check NEED_WAKEUP -> wake
Realtime gates:
deadline ordering tests
budget depletion tests
donation/return tests through passive endpoint
admission denial tests
QEMU proof for late/drop/overrun behavior
telemetry counters prove ticks suppressed, deadlines expired, budgets
exhausted, and donation depth bounded as expected
Decision
Adopt this staged direction:
Tickless idle:
yes, after the kernel/per-CPU idle context and activation proof. The
clocksource/clockevent split is implemented.
Generic full-nohz:
implemented for explicit budgeted compute leases targeting a live
SchedulingContext. Automatic issuance and unbudgeted ordinary threads remain
out of scope.
SQPOLL nohz:
yes, for explicitly leased caller-thread rings whose SQPOLL poller is live,
single-consumer, and bounded by producer wake plus rollback deadlines.
AutoNoHz placement for ordinary threads:
yes, but only as a user-space policy-service decision that issues a
bounded CpuIsolationLease against a pre-authorized CPU pool. The lease
adds isolation; it never mints CPU-time authority. The "thread appears
capable of utilizing a full CPU core" signal is synthesized in the
policy service from observations the future monitoring/status surface
must export, not as a fixed kernel threshold.
Realtime:
`SQE.deadline_ns` is useful metadata, but `SchedulingContext` is the
authority that provides CPU time.
Proposal: mdBook Documentation Site
Turn the existing Markdown documentation into a navigable mdBook site that explains capOS as a working system, while keeping proposals and research as deep reference material.
The current docs are useful for agents and maintainers who already know what
they are looking for. They are weaker as a reader path: a new contributor has
to jump between README.md, docs/roadmap.md, docs/tasks/README.md, proposal files,
research reports, and source code before they can form an accurate model of
the system. The mdBook site should fix that by adding a concise, current
system manual above the existing archive.
Goals
- Make the first reading path obvious: what capOS is, how to build it, what works today, and where the important subsystems live.
- Separate implemented behavior from future design, rejected ideas, and research background.
- Preserve existing long-form proposal and research documents instead of rewriting them prematurely.
- Give architecture pages a repeatable structure so future edits do not turn into ad hoc status notes.
- Make validation visible: each architecture page should name the host tests, QEMU smokes, fuzz targets, Kani proofs, Loom models, or manual checks that support its claims.
- Keep the docs useful from a local clone, without requiring hosted services, databases, or custom frontend code.
Non-Goals
- Replacing
docs/tasks/README.md. Task records remain operational planning documents;REVIEW_FINDINGS.mdis only a tombstone for older links, anddocs/roadmap.mdis now part of the book while still owning long-range planning. - Turning proposals into user manuals by bulk editing every existing document. Long proposal files stay as references until a subsystem needs a targeted refresh.
- Building a marketing site, blog, changelog, or public product page.
- Adding MDX, React, Vue, custom components, or a JavaScript application layer.
- Automatically generating API reference documentation from Rust or Cap’n Proto. That can be evaluated later as a separate documentation track.
Audience
The site should serve three readers:
- New contributor: wants to build the ISO, boot QEMU, understand the current architecture, and find the right files to edit.
- Reviewer: wants to verify whether a change preserves the intended ownership, authority, lifecycle, and validation rules.
- Future agent: wants current project context without having to infer the system from stale proposals or source code alone.
The primary audience is maintainers and agents, not end users. This matters: accuracy, status labels, and code maps are more important than a polished external landing page.
Current State
The repository already has a substantial Markdown corpus:
README.mdexplains the project and core commands.docs/roadmap.mddescribes long-range stages and visible milestones.docs/tasks/state.tomltracks the selected milestone.docs/tasks/state.tomltracks the selected milestone; task records underdocs/tasks/track active implementation order.docs/tasks/**tracks open remediation, review-finding work, and verification history.docs/capability-model.mdis a real architecture reference.docs/proposals/contains accepted, future, exploratory, and rejected design material.docs/research/contains prior-art analysis (thecapability-systems-survey.mdsynthesis plus per-system deep-dive reports).docs/*-design.mdand inventory files capture targeted design/security decisions.
The weakness is not lack of content. The weakness is keeping the current manual visibly separate from archival planning, proposal, and research material.
Site Shape
The mdBook site should be structured as a book, not as a mirror of the file tree. The current hierarchy is:
- Start Here: reader orientation and commands.
- Runnable Demos: current user-visible proofs.
- System Architecture: current implementation, with code maps and invariants.
- Security and Verification: threat boundaries, validation workflow, and security inventories.
- Planning: roadmap, changelog, and backlog links.
- Design Archive: proposal index plus nested active, future, and rejected long-form design documents.
- Research Archive: research index plus nested prior-art reports.
All proposal and research files should remain reachable through the sidebar so mdBook builds them, but they should be nested under their indexes rather than listed as peer pages beside the current system manual. Sidebar folding should be enabled so the default reader path stays compact.
Page Standard
Every architecture page should use this shape:
---
status: "Partially implemented."
last_reviewed: "2026-04-27 10:00 UTC"
description: "Page description."
topics:
- { key: "capabilities-ipc-and-authority", reason: "Explains authority or invocation behavior." }
---
# Page Title
What problem this subsystem solves and why a reader should care.
The preprocessor strips front matter from rendered page content and uses the
metadata to regenerate docs/topics.md. A post-build agent asset pass patches
final rendered HTML so status, description, and last_reviewed appear as
page-head metadata without adding visible status blocks to each page. The same
pass adds HTML head discovery links for llms.txt and each page’s Markdown
mirror.
The docs build also emits agent-facing static assets in target/docs-site:
llms.txt, Markdown source mirrors for pages listed in docs/SUMMARY.md,
sitemap.xml, robots.txt, and a Cloudflare Pages _headers file with
discovery links. robots.txt includes a comment pointing agents to
llms.txt; crawler rules stay in standard User-agent, Allow, Disallow,
Sitemap, and Content-Signal fields.
Current Behavior
What exists in the repo today.
Design
How it works, with concrete data flow.
Invariants
Security, lifetime, ownership, ordering, or failure rules.
Code Map
Important files and entry points.
Validation
Relevant host tests, QEMU smokes, fuzz/Kani/Loom checks.
Open Work
Concrete known gaps, linked to task ledger records when relevant.
Architecture pages should normally stay between 100 and 300 lines. Longer
background belongs in proposals or research reports.
## Status Vocabulary
Use explicit status labels only where a reader could reasonably confuse
implemented behavior, accepted design, future design, or rejected material.
Status belongs on the page itself only when the page role is not already
obvious from the page type or nearby index. Put this information in YAML front
matter (`status`, `last_reviewed`, `topics`) as the first block in the file.
Canonical page-level form:
```md
---
status: "Partially implemented."
last_reviewed: "2026-04-25 11:36 UTC"
description: "Canonical page-level metadata layout."
topics:
- { key: "capabilities-ipc-and-authority", reason: "Describes authority and invocation behavior." }
---
last_reviewed is hand-maintained and uses the same minute-precision,
timezone-aware format as status updates in docs/tasks/README.md,
docs/roadmap.md, and task records. Get it from
date '+%Y-%m-%d %H:%M %Z'; do not infer or round from memory. Use this field
for substantial content edits that should reset a reader’s trust.
Use one of these labels:
- Implemented: behavior exists in the mainline code and has validation.
- Partially implemented: some behavior exists, but the page also describes missing work.
- Accepted design: intended direction, not fully implemented.
- Future design: plausible direction, not selected for near-term work.
- Rejected: explicitly not the chosen direction.
- Research note: background used to inform design, not a direct plan.
Add a page-level status label to:
- proposal pages whose content could be mistaken for current behavior
- architecture or design pages that mix implemented facts with future or partial behavior
- design-gate documents whose role is to define an accepted implementation contract before the implementation is complete
- research pages that would otherwise read like selected design rather than background
Do not add a page-level status label to:
- orientation, index, command-reference, and workflow pages where the page type already makes the role obvious
- reader-orientation overview pages whose role is to explain why the design
looks the way it does (design bets, project framing) rather than catalogue
what is implemented. These pages must point at
status.mdor the relevant architecture page for implementation state; a mixed “Partially implemented” label on them is misleading because each bullet it covers has its own, different status - status summary pages that already classify other documents
- pages whose content is purely operational and only describes current, validated behavior
When only one section differs from the rest of the page, keep the page-level
status for the dominant role of the document and add a local sentence in that
section such as Current implementation status: or Current status:. Do not
replace the page-level label with timestamped prose unless the timestamp itself
is the point.
Avoid ambiguous language like “planned” without a stage, dependency, or status label. When a page mixes current and future behavior heavily, split those sections instead of relying on status text alone.
Content Rules
The docs-scoped authoring contract lives in docs/AGENTS.md;
the rules below extend it with site-shape conventions specific to the mdBook
manual. Apply the AGENTS.md rules first when editing any file under docs/,
then layer the site-shape rules from this proposal.
- Start with operational facts, not motivation.
- Prefer concrete nouns: process, cap table, ring, endpoint, manifest, init, QEMU smoke.
- Name source files when a claim depends on implementation.
- State authority and ownership rules explicitly.
- State failure behavior explicitly.
- Link to proposals and research instead of duplicating long rationale.
- Keep
docs/roadmap.mdanddocs/tasks/README.mdas planning sources, not as content to paste into the book. - Do not describe behavior as implemented unless validation exists or the code map makes the claim directly checkable.
- Do not bury current limitations at the bottom of a long proposal.
Proposal Index
docs/proposals/index.md should classify proposal files instead of listing
them alphabetically. A useful classification:
- Active or near-term:
- service architecture
- service object capabilities
- storage and naming
- error handling
- security and verification
- SMP
- Ring v2 for full SMP
- Future architecture:
- networking
- userspace binaries
- shell
- SSH shell gateway
- boot to shell
- user identity and policy
- cryptography and key management
- certificates and TLS
- OIDC and OAuth2
- volume encryption
- cloud metadata
- cloud deployment
- live upgrade
- GPU capability
- formal MAC/MIC
- browser/WASM
- Rejected or superseded:
- rejected Cap’n Proto ring SQE envelope
Each proposal entry should have a one-sentence purpose and a status label.
Research Index
docs/research/index.md is the top-level research index, and the
capability/microkernel survey lives at
docs/research/capability-systems-survey.md with a “Design consequences for
capOS” section near the top. Readers should not need to read every long report
to learn which ideas were accepted.
Each long research report should eventually end with:
## Used By
- Architecture or proposal page that relies on this research.
- Concrete design decision influenced by this report.
Diagrams
Use Mermaid only where it clarifies flow or authority:
- boot flow: firmware, Limine, kernel, manifest, init
- capability ring: SQE submission,
cap_enter, CQE completion - endpoint IPC: client CALL, server RECV, server RETURN
- manifest startup: boot package, init, ProcessSpawner, child caps
Avoid diagrams that duplicate file layout or become stale when a function is renamed. Every diagram should have nearby text that states the same key invariant in prose.
Migration Plan
Phase 1: Skeleton and Reader Path
- Add
book.tomlwithdocsas the source directory and output undertarget/docs-site. - Add
docs/SUMMARY.md. - Add
docs/index.md. - Add
docs/overview.md. - Add
docs/status.md. - Add
docs/build-run-test.md. - Add
docs/repo-map.md.
Acceptance criteria:
mdbook buildsucceeds.- The first section explains what capOS is, how to build it, how to boot it, and where to find the major code areas.
- Existing proposal and research files are reachable through the sidebar.
Phase 2: Current Architecture Pages
- Add the first architecture pages:
- boot flow
- process model
- capability ring
- IPC and endpoints
- userspace runtime
- manifest and service startup
- memory management
- scheduling
- Keep
docs/capability-model.mdas a first-class architecture page.
Acceptance criteria:
- Each architecture page has status, current behavior, invariants, code map, validation, and open work.
- Each page distinguishes implemented behavior from future design.
- At least boot flow, capability ring, IPC, and manifest startup include a concise Mermaid diagram.
Phase 3: Security and Verification Pages
- Add
docs/security/trust-boundaries.md. - Add
docs/security/verification-workflow.md. - Link existing inventories and designs from the security section.
- Make each security page name the relevant validation commands and review documents.
Acceptance criteria:
- A reviewer can find the hostile-input boundaries, trusted inputs, and verification workflow without reading all proposals.
- The security section links to
REVIEW.md,docs/tasks/README.md,docs/trusted-build-inputs.md, anddocs/panic-surface-inventory.md.
Phase 4: Proposal and Research Curation
- Add
docs/proposals/index.md. - Keep proposal and research documents reachable through
SUMMARY.md, but nest them under archive groups so they do not dominate the default sidebar. - Add status labels to proposal files as they are touched.
- Add “Used By” sections to research files incrementally.
Acceptance criteria:
- Proposal status is visible before a reader opens a long document.
- Rejected and future proposals are not confused with implemented behavior.
- Research pages point back to the architecture or proposal pages they influence.
- The default sidebar presents the current manual before backlog, proposal, and research archives.
Maintenance Rules
- When implementation changes a subsystem, update the corresponding architecture page in the same change when the page would otherwise become misleading.
- When a proposal is accepted, rejected, or partially implemented, update its status and the proposal index.
- When
docs/tasks/state.tomlchanges the selected milestone, updatedocs/status.mdonly if the public current-system summary changes. Do not mirror every operational task into the docs site. - When validation commands change, update
docs/build-run-test.mdand the affected architecture page.
Tooling Follow-Up
The content proposal continues to assume mdBook because it matches the repo’s Rust toolchain and plain Markdown corpus. The current tooling baseline is:
book.tomlmake docsmake docs-servemake cloudflare-pages-build- pinned
mdbookandmdbook-mermaiddownloads inMakefile, with version and SHA-256 inputs catalogued indocs/trusted-build-inputs.mdunder the mdBook documentation tools row.make docsandmake cloudflare-pages-buildverify those checksums and the executable versions before rendering the book, andmdbook-mermaidsupplies the pinnedmermaid.min.jsbrowser bundle used by both mdBook HTML rendering and docs-PDF Mermaid rasterization - a small local stylesheet for readability and sidebar spacing
Do not add a frontend package manager, theme framework, or generated site assets unless the content structure proves insufficient. If mdBook becomes too limited after the sidebar, index, metadata, and styling cleanup, the preferred replacement candidate is Astro Starlight because it supports Markdown/MDX, content collections, structured sidebars, built-in docs components, and static Cloudflare Pages output. Docusaurus is better only if versioned public docs, blogging, and a larger external project site become requirements. VitePress is reasonable only if the project wants Vue-oriented customization.
Open Questions
- Should
docs/tasks/README.mdremain outside the book and linked fromstatus.md, or should redacted public summaries be generated later? - Should long proposal files keep their current filenames, or should accepted
designs eventually move from
docs/proposals/intodocs/architecture/? - Should
docs/status.mdbe manually maintained, or generated from a smaller checked-in status data file later? - Should Cap’n Proto schema documentation be generated into the book once the interface surface stabilizes?
- Should proposal and research indexes eventually be generated from structured frontmatter instead of hand-maintained Markdown tables?
Recommended First Commit
The first implementation commit should be deliberately small:
- Add mdBook config.
- Add
SUMMARY.md. - Add the Start Here pages.
- Link existing proposal and research files without rewriting them.
- Verify
mdbook build.
That gives the project a usable docs site quickly, without blocking on a full architecture rewrite.
Proposal: Userspace TCP/IP Networking
How capOS gets from “kernel boots” to “userspace process opens a TCP connection.”
The host-local Telnet flow on 127.0.0.1:2323 described in Part 2 was a
plaintext, loopback-only research demo, not a shippable Telnet service. It
exercised the
TerminalSession/SessionManager/AuthorityBroker/RestrictedShellLauncher
boundary over a real TCP socket on the path toward the SSH Shell Gateway
(see SSH Shell Gateway). That target is now
retired because it depended on the removed qemu-only kernel TCP listener.
Non-loopback exposure, production credential handling, and any treatment of
Telnet as a long-lived service remain out of scope.
Historical trust-boundary debt: Phase A/B kept the smoltcp stack, per-port
TCP listener and accepted-socket capability state, UDP socket cap state, line
discipline byte handler, and Telnet IAC filter inside the kernel. Phase C has
now retired that kernel owner: kernel no longer depends on smoltcp, the
qemu-only TCP/UDP socket entry points fail closed, and the
run-network-client, run-tcp-listen-authority, run-telnet, and
run-posix-dns-smoke fixtures exit with retirement diagnostics. The forward
path is the userspace network stack over DeviceMmio/DMAPool/Interrupt
authority and typed NIC/socket capabilities. New protocol logic belongs in
that Phase C userspace stack.
The Device Driver Foundation now has a bounded provider-consumer proof for one
selected virtio-net TX route: a manifest-granted service can compose
DMAPool, DeviceMmio, and Interrupt authority, validate the selected
bounce-buffer descriptor path, publish a bounded provider-owned queue entry,
ring the selected notify doorbell after policy gates, and consume the matching
used-ring completion through a route-scoped tx_interrupt.wait event. That is
proof coverage for a selected manager-owned route, not Phase C completion. It
does not grant full NIC ownership, arbitrary MMIO doorbells, hardware
ack/mask/unmask ownership, direct DMA, IOMMU programming, broader completion
queue ownership, provider storage/NIC drivers, cloud NIC support, or
production networking readiness.
This document has four parts:
- a historical kernel-internal smoke test that proved virtio-net and smoltcp,
- historical in-kernel capability interfaces for TCP sockets and the Telnet Shell Demo,
- userspace decomposition after driver authority capabilities exist, and
- cross-cutting TLS and open design questions.
Part 1: Kernel-Internal Networking (Phase A)
Prove that capOS can send and receive TCP/IP traffic. Everything runs in-kernel — no IPC, no capability syscalls, no multiple processes needed.
What’s Needed
- PCI enumeration — scan config space, find virtio-net device. Uses the standalone PCI/PCIe subsystem described in Cloud Deployment Phase 4 (~200 lines of glue code on top of the shared PCI infrastructure)
- virtio-net driver — init virtqueues, send/receive raw Ethernet frames.
Use
virtio-driverscrate or implement manually (~600-800 lines) - Timer — PIT or LAPIC timer for
smoltcp’s poll loop (retransmit timeouts,Instant::now()support). Not a full scheduler — just a monotonic clock (~50-100 lines) - smoltcp integration — implement
phy::Devicetrait over the in-kernel driver, create anInterfacewith static IP, ICMP ping, then TCP - QEMU flags — add
-netdev user,id=n0 -device virtio-net-pci,netdev=n0to the Makefile
Current implementation status: PCI enumeration, make run-net, modern virtio
PCI transport capability discovery, feature negotiation, RX/TX split-virtqueue
initialization, descriptor-accounting guard evidence, ARP resolution, and ICMP
echo validation are implemented as lower-layer QEMU fixture evidence. The QEMU
default device currently appears as transitional 1af4:1000 but exposes
standard modern vendor capabilities; capOS accepts it only after finding
bounded MMIO common, notify, ISR, and device-specific config regions. The
kernel negotiates VIRTIO_F_VERSION_1, VIRTIO_NET_F_MRG_RXBUF, and MAC when
safe, allocates kernel-owned DMA pages for the RX/TX queue metadata plus packet
buffers, sets DRIVER_OK, submits device-valid TX descriptors, posts RX
descriptors, resolves the QEMU user-mode gateway 10.0.2.2 with ARP from
static guest address 10.0.2.15, then validates an IPv4 ICMP echo reply from
the gateway, including the reply checksums. The former kernel smoltcp adapter,
TCP HTTP smoke, and scheduler-polled socket runtime are retired; the
make qemu-net-harness path now asserts the lower-layer QEMU fixture evidence
instead of a host-backed kernel TCP proof. Current TCP/UDP socket proof lives in
the Phase C userspace network-stack gates, including
make run-cloud-prod-userspace-network-stack-smoltcp.
Milestones
- Ping: ICMP echo to QEMU gateway (10.0.2.2 with default user-mode
net). Achieved by commit
b56a5c1at2026-04-24 15:37 UTC. - HTTP: TCP connection to a host-side server, send GET, receive
response. Achieved by commit
a4f1722at2026-04-24 16:47 UTC.
Estimated Scope
~1000-1500 lines of new kernel code. ~200 more for TCP on top of ping.
Crate Dependencies
| Crate | Purpose | no_std |
|---|---|---|
smoltcp | TCP/IP stack | yes (features: medium-ethernet, proto-ipv4, socket-tcp) |
virtio-drivers | virtio device abstraction | yes (optional — can implement manually) |
Timer Source Decision
Historical Phase B resolution: the scheduler timer advanced the monotonic
TICK_COUNT (AtomicU64 in kernel/src/arch/x86_64/context.rs), and the
retained kernel smoltcp runtime used that clock instead of a bounded synthetic
10 ms-per-poll clock. Phase C cleanup removed that retained runtime; scheduler
ticks no longer poll kernel smoltcp.
Intermediate Tickless Bridge
The retained smoltcp runtime described below is retired. The bridge rules are archival context for why scheduler-polled kernel networking was not acceptable as a long-term tickless/nohz design. Future socket progress belongs in the userspace stack or an IRQ/deadline-driven device path, not in scheduler polling.
#![allow(unused)]
fn main() {
trait NetworkPollClock {
fn next_poll_deadline_ns(now_ns: u64) -> Option<u64>;
fn poll_until_budget(now_ns: u64, budget_ns: u64) -> PollResult;
}
}
Historical bridge rules:
- a retained smoltcp runtime would have needed to expose
NetworkPollClockbefore active networking could coexist with tickless idle; - the scheduler would have included
next_poll_deadline_nsinearliest_global_deadline(); poll_until_budgetwould have been the only scheduler/idle-exit network progress path;- the budget would have bounded work done outside ordinary process execution;
- absent this bridge, active networking would have forced periodic tick;
- SQPOLL/nohz isolated CPUs would not have run retained network scheduler polling.
QEMU Network Config
| Config | Use case |
|---|---|
-netdev user,id=n0 -device virtio-net-pci,netdev=n0 | Default: NAT, guest reaches host |
-netdev user,id=n0,hostfwd=tcp:127.0.0.1:2323-:23 -device virtio-net-pci,netdev=n0 | Historical host-local TCP forwarding for the retired Telnet Shell Demo |
Part 2: Capability Interfaces — In-Kernel (Phase B)
Phase B turns the Phase A smoke path into first-class TCP capabilities without
moving any code out of the kernel. The NetworkManager, TcpListener, and
TcpSocket objects become kernel-side CapObjects that user processes invoke
through the existing capability ring. The in-kernel smoltcp stack stays where
it is; what changes is that it is reached over capability dispatch instead of
a hard-coded boot-time call. UDP and raw Nic exposure are not part of this
milestone.
Phase B is the first point where a userspace process — the native shell, a boot-package demo, a language runtime — can open a TCP socket. It is also the first point where a visible networking milestone exists at the capability level.
Visible Phase B milestone — Telnet Shell Demo (historical; delivered and later retired with the kernel socket owner). Boot capOS in QEMU with
-netdev user,id=n0,hostfwd=tcp:127.0.0.1:2323-:23 -device virtio-net-pci,netdev=n0.
Init starts a dedicated telnet-gateway service with scoped port-23 listen
authority and restricted shell-launch authority, then gives the child shell
only the exact grants described below.
On accept, the gateway refuses a bounded initial Telnet option negotiation
burst and acts as the terminal host for that connection. It exposes a
socket-backed TerminalSession to capos-shell, not a raw TcpSocket,
ByteStream, or StdIO replacement for the shell’s existing terminal
boundary.
From the host:
$ telnet 127.0.0.1 2323
capos login: <anon>
capos$ help
capos$ exit
Connection closed by foreign host.
The same boot proves the shell does not know or care whether its interactive
terminal is UART, framebuffer, or TCP-backed Telnet — the TerminalSession
provider is interchangeable while the shell-facing authority stays the same.
It also exercises the full TCP listener/accept path, not just the outbound
connect path used by the Phase A HTTP smoke.
telnet (RFC 854) is deliberate demo wiring: plaintext, no crypto, no
authentication of its own. The QEMU target binds the host forward to
127.0.0.1:2323 only and forwards to guest port 23, so the proof is a
host-local development demo rather than a remote-access feature. It is not a
production access path and will be replaced by the SSH gateway described in
SSH Shell Gateway once host-key, user-key,
account, audit, and persistence prerequisites are implementable. The value is
that Telnet is the cheapest forcing function for a server-side TCP capability
and for a socket-backed terminal host. The shell still requires credential
verification through the existing login flow
(Boot to Shell); the Telnet transport
only replaces the physical UART, not the login policy.
Phase B prerequisites
| Prerequisite | State | Why |
|---|---|---|
| Capability syscalls | Stage 4 done (sync) | All Nic/socket access goes through the ring |
| Scheduling + preemption | Stage 5 core done | Socket ops block/wake via the scheduler |
| IPC + capability transfer | Stage 6 3.6 done | Listener hands socket caps to the accepting process |
Timer capability | 7.0.0 done | Historical smoltcp poll clock and socket timeouts; the kernel smoltcp runtime is now retired |
| Scheduler-driven smoltcp poll | retired | The retained smoltcp runtime was polled from scheduler ticks on real TICK_COUNT; Phase C cleanup removed it |
TCP kernel CapObjects | retired | NetworkManager, TcpListener, and TcpSocket previously wrapped the retained smoltcp runtime; qemu-only kernel socket entry points now fail closed |
Socket-backed TerminalSession handoff | retired | TcpSocket.intoTerminalSession previously consumed a connected socket and returned a move-only TerminalSession cap; rebuild this proof on the userspace network stack before using it as validation |
| Shell launch bundle handoff | retired | telnet-gateway previously consumed an accepted TcpSocket into a move-only TerminalSession; the gateway demos are removed and remote-shell coverage lives in the in-guest login smokes (run-login, run-default-web-ui) |
Phase B does not depend on DeviceMmio, Interrupt, or DMAPool — the NIC
driver stays in the kernel. Security Verification Track S.11.2 is a Phase C
prerequisite, not a Phase B one.
Phase B schema (kernel CapObjects)
These interfaces are now defined in the canonical shared schema
(schema/capos.capnp). The current build pipeline watches and generates
bindings for schema/capos.capnp; additional networking schema files remain
unnecessary for Phase B.
interface NetworkManager {
getConfig @0 () -> (addr :Data, netmask :Data, gateway :Data);
createTcpListener @1 (port :UInt16) -> (listenerIndex :UInt16);
connectTcp @2 (addr :Data, port :UInt16) -> (socketIndex :UInt16);
# POSIX adapter Phase P1.2 Phase A: bind a UDP socket; the created
# cap is delivered as a transferred result cap.
createUdpSocket @3 (localAddr :Data, localPort :UInt16) -> (socketIndex :UInt16);
}
interface TcpListener {
accept @0 () -> (socketIndex :UInt16, peerAddr :Data, peerPort :UInt16);
close @1 () -> ();
}
interface TcpSocket {
send @0 (data :Data) -> (bytesSent :UInt32);
recv @1 (maxLen :UInt32) -> (data :Data);
close @2 () -> ();
intoTerminalSession @3 () -> (terminalIndex :UInt16); # retired; fails closed
}
interface UdpSocket {
sendTo @0 (addr :Data, port :UInt16, data :Data) -> (bytesSent :UInt32);
recvFrom @1 (maxLen :UInt32) -> (addr :Data, port :UInt16, data :Data);
close @2 () -> ();
}
Nic stays a separate lower-layer cap (schema shown below) and remains
kernel-internal in Phase B. UdpSocket landed for the POSIX adapter Phase
P1.2 Phase A DNS path: the kernel implements it on top of the same retained
smoltcp runtime, and userspace acquires it through NetworkManager.createUdpSocket.
It is not part of the Telnet Shell Demo contract.
The ring transport cannot return direct Cap’n Proto capability fields, so
capability-producing methods return result-cap indices in the serialized result
and append CapTransferResult records after the message bytes. Runtime clients
adopt those result caps by index.
accept and recv are blocking capability calls for the Phase B demo: they
complete when a connection or received bytes are available, when the socket is
closed, or when the caller’s cap_enter timeout/cancellation path fires.
recv(maxLen) clamps to the kernel/ring result-buffer limits, and send may
return a partial byte count. A readiness/poll interface can be added later
without being required for the first remote shell proof.
Telnet gateway launch contract
This contract is historical: the telnet-gateway demo is removed with the
kernel socket owner and the kernel SocketTerminalSession. It is retained as
the authority-model reference for any future userspace terminal host.
telnet-gateway was the terminal host for the remote connection. Its minimum
authority was:
- Manifest-forwarded
TcpListenAuthoritybadge 23, held by init and forwarded to the gateway as the only listener-creation authority for the demo path. - Manifest-forwarded
RestrictedShellLauncher, held by init and forwarded to the gateway as the only shell process launch authority. - Pass-through grants for the caps the current shell requires at startup:
creds,sessions,audit,broker, andsystem_info. - An anonymous
UserSessionminted throughSessionManagerand checked throughAuthorityBroker.shellBundle("anonymous")before launch. The shell still performs password login insidecapos-shelland upgrades the session after credential verification. - A way to provide the child shell a cap named
terminalwhose interface id isTerminalSession, backed by the accepted TCP socket.
The gateway must not grant the child raw NetworkManager, TcpListener,
TcpListenAuthority, TcpSocket, broad ProcessSpawner, or
RestrictedShellLauncher authority. The retired implementation used the
kernel socket wrapper (TcpSocket.intoTerminalSession, now failing closed) to
produce an actual TerminalSession CapObject; the shell-facing contract
stays TerminalSession for any future userspace terminal host.
Phase B exit criteria
schema/capos.capnpdefined the TCP types above; kernel implemented them asCapObjects on top of the existing smoltcp interface. Initial implementation landed in commit7446e04at2026-04-25 14:48 UTC; review follow-up added timer-safe deferred completion cleanup andmake qemu-network-client-harnessuserspace coverage for outbound sockets and listener accept. This is historical Phase B evidence; qemu-only kernel socket entry points now fail closed.- smoltcp polling was driven from the scheduler, not a synthetic clock, so sockets could survive longer than a single early-boot burst. That runtime is retired.
- A trusted
telnet-gatewayboot service usedTcpListener/TcpSocket, refused the bounded initial Telnet negotiation needed by normal host clients, and launchedcapos-shellfor the accepted connection with a socket-backedTerminalSessionplus the shell’s existing login/session caps. The child shell did not receive raw network, TCP listener/socket, broad spawn, scoped-listener, or restricted-shell-launcher authority. This target is retired. - A dedicated CUE manifest (
system-telnet.cue) and amake run-telnettarget historically booted the above and ran a scripted host-side smoke that completed a login + one command + clean exit overtelnet 127.0.0.1 2323.make run-telnetnow exits with a retirement diagnostic.
Part 3: Userspace Decomposition (Phase C)
Phase C moves the NIC driver and the TCP/IP stack out of the kernel into
separate userspace processes, so the kernel is left with only
DeviceMmio / Interrupt / DMAPool dispatch and the cap-ring transport.
Phase B must be complete first — Phase C is about relocating the code that
Phase B already wrapped in capabilities, not about adding new interfaces at
the socket layer.
Sequencing relative to the cloud usable-instance milestone. The Network-Reachable Datapath Scope Decision (2026-06-02) records that the real-GCE-boot milestone’s “reachable network stack” requirement means raw-frame TX/RX over the live NIC (the polled production provider), which the billable cloudboot gate already checks. The L4 socket reachability that Phase C delivers is therefore a separate future track sequenced after that milestone, not a milestone blocker.
IPv6 Support Status And Task Lane
Current capOS L4 socket behavior has one production forward path: the Phase C
userspace service-object stack. The old qemu-only retained smoltcp runtime that
configured 10.0.2.15/24, installed a default IPv4 route through 10.0.2.2,
resolved the gateway with ARP, and proved outbound ICMPv4 plus TCP HTTP is
retired. Non-qemu production manifests no longer grant the legacy
kernel-owned socket caps; requests for kernel network_manager or
tcp_listen_authority fail at bootstrap instead of falling through to
virtio_stub.rs, and qemu-only kernel TCP/UDP socket entry points fail closed.
The userspace IPv6 lane now has local link-local / Neighbor Discovery, Router
Advertisement / SLAAC, GCE-style DHCPv6 address configuration, ICMPv6 Echo
Reply, and IPv6 TCP listener/connect proofs.
The socket-address ABI is now explicit about address family rather than
overloading a raw four-byte assumption. schema/capos.capnp defines
IpAddressFamily (unspecified / ipv4 / ipv6) and documents a length
contract on every address Data field: empty is unspecified (only where the
method allows it), 4 bytes is ipv4, and 16 bytes is ipv6. getConfig
reports the configured addressFamily and an ipv6Supported flag, so an
all-zero IPv4 config is never misread as an IPv6 state.
kernel/src/cap/network.rs decodes addresses through a family-typed
read_ip_address, accepts IPv4 on the legacy stack, and fails closed on IPv6
there with a distinct ipv6Unsupported-class error and on any other length
with a malformedAddress class – so legacy IPv4-only callers reject IPv6
explicitly instead of treating every non-four-byte value as a generic error.
capos-rt surfaces the family and IPv6-support flag on NetworkConfig. The
wire format stays source-compatible for existing 4-byte IPv4 callers. The
behavior behind the userspace-service ABI now has bounded local IPv6 routing,
diagnostics, and TCP L4 proofs; private GCE reachability and public IPv6
ingress remain unproved.
The pinned userspace smoltcp dependency is version 0.13.0 in the networking
demo crates, not in kernel/Cargo.toml. capOS enables only the features each
userspace proof needs. The crate has IPv6, SLAAC, and ICMP socket features
available, and it does not provide a socket-dhcpv6 feature matching its
DHCPv4 socket. With the address-family ABI landed, remaining IPv6 work is
explicit userspace stack behavior and GCE reachability rather than kernel
feature enablement.
The protocol gap is larger than “turn on IPv6”: with the local link-local/Neighbor Discovery, Router Advertisement / SLAAC, GCE-style DHCPv6, ICMPv6 Echo Reply, and IPv6 TCP listener/connect proofs done, capOS still has no private GCE IPv6 reachability proof or GCE IPv6 firewall proof. The standards and cloud grounding are:
- RFC 4861: Neighbor Discovery, Router Solicitation/Advertisement, address resolution, and router defaults.
- RFC 4862: stateless address autoconfiguration, link-local address generation, and Duplicate Address Detection.
- RFC 4443: ICMPv6 including Echo Request / Echo Reply behavior.
- RFC 8415: DHCPv6 client and server exchanges on UDP 546/547.
- Compute Engine IPv6 configuration:
dual-stack or IPv6-only subnet requirement, one
/96per interface, first/128configured by DHCPv6 from the metadata server, default route via route advertisement, and link-local addresses used for Neighbor Discovery. - Google Cloud VPC firewall rules: IPv6 rules are supported, each firewall rule uses either IPv4 or IPv6 ranges, and IPv6 ingress needs an explicit allow rule before public access is reachable.
The resulting task lane is linked from
Hardware, Boot, and Storage.
The
cloud-prod-ipv6-architecture-status-grounding
scope decision is done (2026-06-03), and the address-family ABI entry point
cloud-prod-network-address-abi-ipv6
is done (2026-06-03) as historical qemu-only kernel socket evidence. That
target is now retired after kernel socket-owner removal; current
address-family/socket behavior is covered by the Phase C userspace IPv4 and
IPv6 gates below.
The local link-local/Neighbor Discovery proof
cloud-prod-ipv6-link-local-nd-local-proof
is done (2026-06-08), proved by make run-cloud-prod-ipv6-link-local-nd.
The local Router Advertisement / SLAAC proof
cloud-prod-ipv6-ra-slaac-local-proof
is done (2026-06-08), proved by make run-cloud-prod-ipv6-ra-slaac.
The local GCE-style DHCPv6 address configuration proof
cloud-prod-ipv6-dhcpv6-gce-config-local-proof
is done (2026-06-08), proved by
make run-cloud-prod-ipv6-dhcpv6-gce-config.
The local ICMPv6 Echo Reply proof
cloud-prod-icmpv6-echo-reply-local-proof
is done (2026-06-08), proved by make run-cloud-prod-icmpv6-echo-reply.
The local IPv6 TCP L4 proof
cloud-prod-ipv6-tcp-l4-local-proof
is done (2026-06-08), proved by make run-cloud-prod-ipv6-tcp-l4.
The lane then sequences private GCE IPv6 and public IPv6 ingress/TLS policy
tasks on top of that userspace-stack substrate.
IPv6 does not block the first public GCE Web UI proof while that proof remains scoped to IPv4 DHCP, ARP, Phase C L4, private GCE reachability, and reviewed public HTTPS ingress. It becomes relevant for a later dual-stack or IPv6-only cloud proof and for public IPv6 ingress policy.
Network Usability, Resolver, And Post-smoltcp Lane
The network usability backlog is
Network Usability and Post-smoltcp.
It records the user-facing work that starts after raw frames and the first
userspace L4 proof: operator status tooling, DHCPv4 lease lifecycle, a typed
system DnsResolver cap, POSIX getaddrinfo bridging, ping/ping6 diagnostics,
socket readiness/cancel/backpressure semantics, packet trace authority, and
transport policy/status.
Current boundaries are explicit there: the first local DHCP/IPv4 configuration
proof is now done by
cloud-prod-network-stack-dhcp-ipv4-config-local-proof
and is on the first GCE Web UI critical path, while DHCP renewal/rebind/expiry,
DNS option publication, and operator-visible lease status remain follow-up
work. The local bounded ICMPv4 Echo Reply proof is also done by
cloud-prod-icmp-echo-reply-local-proof,
proved by make run-cloud-prod-icmp-echo-reply; it answers a bounded local
same-subnet ping and rejects malformed or oversized requests, but it exercises
ICMP protocol logic over an in-process QueuePhyDevice, not the real bound
NIC. The real-NIC inbound path is now also done by
cloud-prod-icmp-echo-reply-real-nic-datapath-local-proof,
proved by make run-cloud-prod-icmp-echo-reply-real-nic-datapath: a kernel-owned
responder on the legacy virtio 0.9 datapath acquires a DHCP lease over the real
NIC, then receives an inbound Echo Request over the real RX vring and transmits
an RFC 792 Echo Reply over the same NIC’s TX vring (a host peer over a QEMU
socket netdev drives the inbound stimulus, since SLIRP drops inbound
host->guest ICMP Echo). Both remain diagnostics rather than Web UI readiness;
the real-NIC proof is the local pre-spend prerequisite for the billable private
GCE ICMP proof and the same responder serves that live run. The POSIX DNS smoke is a hand-rolled
A-query over UdpSocket, not a system resolver service or typed resolver
capability. DNS, operator ping tools, IPv6, packet tracing, and advanced
transport policy are usability/completeness lanes, not first public Web UI
blockers unless a later deployment policy explicitly promotes one.
The backlog keeps smoltcp relocation (Phase C slices 7a-7c: run the selected
smoltcp build in userspace, preserve the socket contract) distinct from
transport policy/status (the capOS control plane around it). The selected
userspace stack is smoltcp 0.13.0 and now has bounded local UDP socket-cap,
TCP listener/socket-cap, sustained receive, and serve-from-userspace production
socket-cap proofs. DHCPv4, DHCPv6, IPv6 L4, and ICMPv6 are explicit protocol
proof lanes rather than ambient production readiness claims; retained qemu-only
fixtures remain separate from the production cloudboot path. The done IPv6
protocol proofs (cloud-prod-ipv6-dhcpv6-gce-config, cloud-prod-ipv6-tcp-l4)
build their smoltcp interface on an in-process HarnessPhyDevice and self-declare
metadata_only=true; the IPv6 datapath over the real bound NIC is now done by
cloud-prod-ipv6-real-nic-datapath-local-proof,
proved by make run-cloud-prod-ipv6-real-nic-datapath: a userspace smoltcp service
on a real-Nic-backed phy (the IPv4 DHCP datapath NicPhyDevice pattern) learns
the default route from a Router Advertisement, configures the GCE-shaped /128
via DHCPv6 Solicit/Advertise/Request/Reply, and completes one ICMPv6 Echo probe –
every frame over Nic.transmit/Nic.receivePoll against a host peer on a QEMU
socket netdev (SLIRP has no stateful DHCPv6 server). That proof records the
real-NIC provenance with no metadata_only/in-process disclaimer and is the local
pre-spend prerequisite for the billable private GCE IPv6 reachability proof. No current capOS
build enables socket-tcp-reno/socket-tcp-cubic, so capOS runs with
CongestionControl::None by build configuration, not as a reviewed policy
choice. The
network-transport-policy-status-decomposition
task records that audit and decomposes read-only transport status, keepalive/
timeout policy inputs, and a deferred congestion-control evaluation gated on
workload evidence.
Architecture
+--------------------------------------------------+
| Application Process |
| holds: TcpSocket cap, UdpSocket cap, ... |
| calls: connect(), send(), recv() via capnp |
+---------------------------+----------------------+
| IPC (capnp messages)
+---------------------------v----------------------+
| Network Stack Process (userspace) |
| smoltcp TCP/IP stack |
| holds: NIC cap (from driver), Timer cap |
| implements: TcpSocket, UdpSocket, Dns caps |
+---------------------------+----------------------+
| IPC (capnp messages)
+---------------------------v----------------------+
| NIC Driver Process (userspace) |
| virtio-net driver |
| holds: DeviceMmio cap, Interrupt cap, DMAPool |
| implements: Nic cap |
+---------------------------+----------------------+
| capability syscalls
+---------------------------v----------------------+
| Kernel |
| DeviceMmio cap: maps BAR into driver process |
| Interrupt cap: routes virtio IRQ to driver |
| DMAPool cap: DMA-eligible frames w/o raw PAs |
| Timer cap: provides monotonic clock |
+--------------------------------------------------+
Three separate processes, each with minimal authority:
- NIC driver — only has access to the specific virtio-net device
registers, its interrupt line, and DMA-eligible frames. Implements the
Nicinterface. - Network stack — holds the
Niccapability from the driver. Runs smoltcp. Implements higher-level socket interfaces. - Application — holds socket capabilities from the network stack. Cannot touch the NIC or raw packets directly.
Phase C prerequisites (beyond Phase B)
| Prerequisite | Owning gate | Why |
|---|---|---|
Interrupt capability | DDF Task 5 + S.11.2 driver-transition gate | NIC driver receives IRQs without ambient authority |
DeviceMmio capability | DDF Task 5 + S.11.2 driver-transition gate | NIC driver accesses device registers under bounded ownership |
DMAPool capability | DDF Task 5 + S.11.1 invariants + S.11.2 gate | DMA-eligible frames without raw physical grants |
| Provider NIC smoke | DDF Task 6 | First end-to-end provider-driver path through reviewed userspace authority instead of the in-kernel ledger |
See DMA Isolation for the concrete invariants the three capabilities must satisfy and the Security Verification Track S.11.2 gate that unblocks moving the NIC driver out of the kernel. DDF Task 5 expands those invariants into a reviewable cap-table and ProcessSpawner manifest surface; DDF Task 6 is the first provider NIC smoke that consumes them end-to-end.
Current Phase C evidence includes the userspace virtio-net driver slices through
the clean independent Nic.transmit/Nic.receive split, the 7a local userspace
smoltcp substrate over that Nic cap, the 7b userspace UDP socket-cap layer,
the 7c-i inter-process UdpSocket proof, the 7c-ii(a) inter-process
TcpListener/TcpSocket proof, the sustained-receive TCP substrate, the
7c-ii(b) local serve-from-userspace production socket-cap proof, and retirement
of the non-qemu legacy kernel socket grant path. The 7c-ii(b) proof starts
the userspace network-stack process as the non-qemu cloudboot init process,
spawns an application client with only Console plus a userspace-served
TcpListenAuthority, and completes one local hostfwd TCP request/response
through served TcpListener/TcpSocket caps. It is still narrower than the
exit criteria below: the proof process keeps the existing
DeviceMmio/DMAPool/Interrupt bring-up caps in-process until the future
driver-service split, the long-lived service shape is still future work, and the
selected GCE Web UI milestone now consumes the done DHCP/IPv4 configuration
proof while still needing the local remote-session Web UI L4 proof, private GCE
reachability, and the tracked Web UI hardening gates. The legacy kernel
cap/network.rs / virtio_stub.rs socket
route is fixture/negative-path cleanup territory, not the architecture to
extend.
Phase C exit criteria
- NIC driver runs in its own userspace process, holding only
DeviceMmio,Interrupt, andDMAPoolcaps. - Network stack runs in a second userspace process, holding only the
Niccap from the driver and aTimercap. - A successor socket-backed terminal or Web UI proof is rebuilt on the userspace network stack; the Phase B Telnet fixture is retired after kernel socket-owner removal.
- The kernel contains no
smoltcpdependency and no virtio-net code on the hot path.
Lower-layer capability schema (drafts — used by Phase C)
Phase B does not expose these to userspace; Phase C does. Timer is already
implemented (see schema/capos.capnp).
Phase C track opened (2026-06-02). The Phase C Userspace NIC Driver Relocation design adopts this inline-
Dataframe ABI as-is (aDmaBuffer-handle zero-copy variant was considered and rejected to keep the change small; the frame stays in a kernel-owned bounce buffer the polled provider already proved). The methods carry the capOSresult/reason/sideEffectevidence triple, andreceivealso reports the observed EtherType. See that doc for the cap-surface gap (no pending security ruling – the writable common-config window extends the accepted notify-doorbell selected-write discipline) and the bounded slice chain.Slice 1 landed (2026-06-02). The unimplemented
Nicinterface below is now inschema/capos.capnpso the later coupled-TX/RX slices (3-4) extend it rather than introduce it; noCapObjectimplements it yet. Slice 1 (cloud-prod-nic-driver-userspace-features-ok-local-proof) also relocated the virtio device handshake to FEATURES_OK into a userspace driver shim over a writable selected-write common-configDeviceMmiowindow (the four handshake registers admitted onDeviceMmio.write32, queue-address writes fail closed); proofmake run-cloud-prod-nic-driver-userspace-features-ok.
The landed Nic schema (inline Data + the capOS evidence triple):
interface Nic {
transmit @0 (frame :Data)
-> (result :Text, reason :Text, sideEffect :Text);
receive @1 ()
-> (frame :Data, observedEthertype :UInt16,
result :Text, reason :Text, sideEffect :Text);
macAddress @2 () -> (addr :Data, result :Text, reason :Text, sideEffect :Text);
linkStatus @3 () -> (up :Bool, result :Text, reason :Text, sideEffect :Text);
}
The driver relocation reuses the production DeviceMmio cap (a read-only BAR
window with selected writes) and Interrupt cap (schema/capos.capnp) rather
than the simplified map/wait sketches earlier drafts of this section used.
Part 4: Cross-cutting
Userspace language runtimes that need sockets
Userspace language runtimes that map their stdlib socket APIs onto capOS
capabilities consume the same TcpSocket/UdpSocket surface this proposal
defines, so the Phase A-B kernel-resident state above is what their socket
imports currently fail closed against:
- The POSIX adapter (
libcapos-posix/) already mapssocket(AF_INET, SOCK_DGRAM, 0)/sendto/recvfrom/closeonto the Phase BUdpSocketcap for the Phase P1.2 Phase B DNS resolver smoke; see Userspace Binaries and POSIX Adapter. - WASI Preview 1
sock_send/sock_recvroute through the WASI host adapter on top of the same caps. Phase W.6 (sockets) remains blocked on socket authority surfacing through the wasm-host CapSet; the W.2ERRNO_NOSYSrefusal harness in Language Support Status and Plans (WASI / WebAssembly row) is the current evidence that no socket authority leaks before that gate.
Neither track changes the trust-boundary debt: socket-using userspace runtimes still depend on the kernel-resident smoltcp stack until Phase C relocates it.
TLS Layering
TLS does not live in this proposal: the TcpSocket here is the
bottom of the transport stack; a TlsSocket wraps it and is
configured from the certificate, trust-store, OCSP, and verifier caps
defined in
Certificates and TLS.
Keys consumed by TLS come from
Cryptography and Key Management.
Draft shape (tracked in the certificates proposal):
interface TlsSocket {
# Client handshake: wrap an outbound TCP socket with a client config.
connect @0 (tcp :TcpSocket, config :TlsClientConfig) -> ();
# Server handshake: accept on a TCP socket with a server config.
accept @1 (tcp :TcpSocket, config :TlsServerConfig) -> ();
send @2 (data :Data) -> (bytesSent :UInt32);
recv @3 (maxLen :UInt32) -> (data :Data);
close @4 () -> ();
peerCertificate @5 () -> (chain :CertificateChain);
alpnSelected @6 () -> (protocol :Text);
}
Open Questions
- DMA memory management. Dedicated
DmaAllocatorcapability vs extendingFrameAllocatorwithallocDma? - Socket readiness model. Phase B uses blocking
accept/recvcalls for the demo. The long-term interface still needs a readiness/poll or cancellation shape for multiplexed services. - Buffer ownership. Copy into IPC message vs shared memory vs capability lending?
References
Crates
- smoltcp —
no_stdTCP/IP stack - virtio-drivers —
no_stdvirtio drivers (rCore project)
Specs
- virtio 1.2 spec — Section 5.1 covers network device
- OSDev Wiki: PCI, Virtio
Prior Art
- rCore — virtio-drivers + smoltcp
- Redox smolnetd — microkernel userspace net stack
- Fuchsia Netstack3 — capability-oriented, userspace, Rust
- Hermit — unikernel with smoltcp + virtio-net
QEMU
Scope Decision: Real-GCE “Reachable Network Stack” – Raw-Frame TX/RX vs L4 Sockets
Decision
Option A. For the second cloud milestone (“usable cloud instance”,
docs/backlog/hardware-boot-storage.md), the network data-path reachability
bar – “a reachable network data path” / “reachable network stack” – means
raw-frame (ethernet) TX/RX reachability over the live GCE NIC: the
production polled userspace virtio-net provider exchanging frames over the real
function is the reachability proof. Slices 1-4 of the GCE polling-path track
plus the slice-6 billable boot close that data-path reachability bar.
L4 sockets (TCP/UDP reachable from a userspace application) are a separate future track – networking-proposal Phase C – and are explicitly not a real-GCE-boot data-path blocker. This decision does not start that track; it records that the track exists, is sequenced after the milestone, and is gated by its own Phase C prerequisites rather than by the cloud usable-instance data-path bar.
Scope boundary: data-path reachability vs L4 terminal access
The milestone bullet (docs/backlog/hardware-boot-storage.md, “Second cloud
milestone: usable cloud instance”) states two network requirements, not one: add
network drivers and “prove SSH/WebShell or other network terminal access
over the cloud NIC.” SSH and WebShell are inherently L4 (TCP) – a raw frame
cannot carry an SSH session. Option A therefore disambiguates only the first
requirement (the network data path / “reachable network stack”, which is also
what the billable gate checks). It does not claim that raw frames satisfy
the SSH/WebShell terminal-access requirement. L4 network terminal access
(SSH/WebShell) is deferred to Phase C and is tracked there; the operator
access path demonstrated today is the serial-console shell (cloudboot
access-path serial-console-shell marker), not a network terminal. Option A is
thus a deliberate re-scoping of the milestone’s network-reachability gate down
to the raw-frame data path, with L4 terminal access sequenced after the
milestone – not a claim that the milestone delivers SSH/WebShell.
Rationale
The decisive principle for the data-path bar: the milestone’s automatically
gated network proof is whatever the billable harness actually checks. The
billable gate is make cloudboot-test (tools/cloudboot/run-test.sh). Reading
that harness directly settles the ambiguity in the “reachable network stack”
phrasing in one observation – it never checks an L4 socket round-trip. (The
milestone’s separate SSH/WebShell terminal-access requirement is not
harness-gated today and is handled under “Scope boundary” above: deferred to
Phase C.)
What the cloudboot harness actually gates on
run-test.sh has exactly two success gates over kernel network behavior, and
both are below the L4 layer:
- Boot landmark.
run-test.sh:BOOT_LANDMARKis the literal stringcapos kernel starting;main’s step 5 polls the serial port until that landmark appears (run-test.sh:main, thegrep -q "${BOOT_LANDMARK}"poll loop). No TCP, no UDP, no handshake. - Provider-NIC proof (optional, raw-frame). Under
--require-provider-nic-proof(run-test.sh:REQUIRE_NIC_PROOF), the run fails unless the serial output contains therun-test.sh:NIC_PROOF_MARKERline (cloudboot-evidence: provider-nic-bound <token>). The gate is pure marker presence (serial_marker_tokens "${NIC_PROOF_MARKER}"non-empty); it parses no socket state and performs no connect/send/recv against the instance.
The provider-nic-bound marker is, by its own documented contract
(tools/cloudboot/README.md, “Serial evidence-marker contract”), a
raw-frame bind proof: the non-qemu kernel composes the DeviceMmio +
DMAPool/DMABuffer + MSI-X Interrupt grant proofs over one virtio function,
programs the MSI-X table entry, and tears down with stale-handle assertions. It
explicitly does NOT write any virtio common-config register, does NOT activate
the device, and emits a summary line recording
device_autonomous_raise=not-attempted. There is no IP address, no socket, and
no L4 protocol anywhere in the marker contract. The harness’s structured
provider.json schema (tools/cloudboot/README.md, “provider.json schema”)
likewise has no TCP/UDP/socket/L4 field – the network-facing fields are
provider_nic_proof, enumerated_device_classes,
enumerated_device_inventory, dma_pool_grant, interrupt_route_allocated,
interrupt_route_delivered, and storage_bind_proof, all device/frame-level.
Choosing Option B would mean adopting a milestone acceptance bar (an L4 socket round-trip) that the billable gate does not enforce, and blocking the milestone on a large Phase C chain that the milestone’s own proof substrate never exercises. That is not an honest reading of the gate.
What the production polled path can and cannot reach today
Can reach (raw frame): kernel/src/cap/virtio_net_polled_provider.rs is the
always-built (non-qemu) production provider. It exercises raw-frame DMABuffer
movement over the live virtio function: the provider submits the brokered RX
receive buffer and observes its completion by polling the used ring
(InterruptCapVirtioNetPolledProvider::invoke_wait reads the latched
PublishedRx used.idx/used[0] captured in attempt_rx_submit), with zero
interrupts – no device_interrupt::wait_kernel_injected_dispatch, no
inject_real_lapic_int_for_proof on the wait/ack path. The TX leg is a
kernel-half SLIRP stimulus (a manager-owned broadcast-ARP frame authored on
queue 1 to elicit the inbound reply, attempt_rx_submit “Stimulus” step), not a
provider-submitted frame. One real device->host RX DMA of used_len=76 (an
ethernet frame, ethertype 0x0806 ARP) has been observed this way. This is the
ethernet-frame level: frames traverse the live function in both directions, with
the provider owning the RX receive path.
Cannot reach (L4): there is no TCP/UDP socket layer in the production data
path. The entire L4 surface is cfg(feature = "qemu")-gated and replaced in
the cloud kernel by kernel/src/virtio_stub.rs, whose socket entry points all
fail closed:
virtio_stub.rs:create_tcp_listener->NetworkError::DeviceUnavailablevirtio_stub.rs:connect_tcp_ipv4->NetworkError::DeviceUnavailablevirtio_stub.rs:create_udp_socket->NetworkError::DeviceUnavailablevirtio_stub.rs:send_tcp/recv_tcp->NetworkError::InvalidSocketvirtio_stub.rs:accept_tcp->NetworkError::InvalidListenervirtio_stub.rs:network_config-> all-zeroaddr/netmask/gatewayvirtio_stub.rs:poll_scheduler-> no-op
The cap/network.rs TCP/UDP socket CapObject family
(TcpListener/TcpSocket/UdpSocket, deferred accept/recv waiters, the
socket-terminal handoff) is wired to crate::virtio::poll_scheduler – i.e. to
the stub in production – so in the cloud kernel a userspace caller holding a
socket cap gets DeviceUnavailable/InvalidSocket, not a connection. The
in-kernel smoltcp stack, TCP listeners, accepted-socket state, the cooked-mode
line discipline, and the Telnet IAC filter live only in the cfg(qemu)
kernel/src/virtio.rs build.
Why Option B is genuinely a separate, larger track
Option B is networking-proposal Part 3: Userspace Decomposition (Phase C):
relocating smoltcp and the cap/network.rs socket caps out of the cfg(qemu)
kernel/src/virtio.rs into a userspace NIC-driver process (holding
DeviceMmio/Interrupt/DMAPool) and a userspace network-stack process
(holding the Nic cap + Timer), with applications holding socket caps. Its
declared exit criterion is “the kernel contains no smoltcp dependency and no
virtio-net code on the hot path.” Its prerequisite table (networking-proposal
“Phase C prerequisites”) requires production grantable
DMAPool/DeviceMmio/Interrupt lifecycles, real provider-driver
interrupt wait/ack/mask/unmask consumption, durable audit consumption, an IOMMU
domain or explicit production bounce-buffer policy, and full driver ownership
handoff – and the proposal itself states current DDF evidence is “narrower than
these Phase C prerequisites.” This is a multi-slice chain, not a finishing touch
on the milestone.
Sequencing it after the milestone is also consistent with the GCE polling-path decision already recorded in the backlog (2026-06-01): the production data path is polled, device-autonomous MSI-X is a parallel efficiency follow-up, and the milestone is deliberately decoupled from interrupt delivery. Raw-frame reachability is the layer that decision already commits to; L4 sits above it.
Consequence
- Slices 1-4 (the real polled provider, its default-manifest graduation, the
real
provider-nic-boundsource, and the polled-provider stale-authority teardown) plus slice 6 (the billablemake cloudboot-test --require-provider-nic-proofboot) close the usable-cloud-instance milestone’s network data-path reachability bar – the requirement the billable gate actually checks. - The milestone’s separate SSH/WebShell / network terminal access requirement is not closed by these slices; it is L4 and is deferred to Phase C as future work. The access path demonstrated on the current cloud kernel is the serial-console shell, not a network terminal.
- L4 sockets remain future work under networking-proposal Phase C, gated by
the Phase C prerequisites, not by the data-path bar. No child task chain is
created by this decision; Phase C is tracked where it already lives (the
networking proposal and the DDF Task 5/6 prerequisites in
docs/backlog/hardware-boot-storage.md).
2026-06-08 Follow-Up: Phase C Web UI Chain
The later Phase C serve-from-userspace proof does not reopen the 2026-06-02 raw-frame-vs-L4 decision above. That decision remains the historical scope record for the closed usable-cloud-instance raw-frame data-path bar. The selected milestone has since moved to GCE Self-Hosted Web UI, whose proof chain owns L4 and Web UI reachability through separate task records.
The relevant Phase C design home is
Phase C Userspace NIC Driver Relocation.
Its local 7c proof is now landed in
cloud-prod-userspace-network-stack-smoltcp-local-proof:
the non-qemu cloudboot manifest starts the userspace smoltcp network-stack
process, serves a scoped TcpListenAuthority, and completes one local
host-forwarded TCP request/response through served TcpListener/TcpSocket
caps. That is local cloudboot L4 evidence, not private GCE reachability and not
public operator ingress.
The current Web UI ladder is task-owned:
cloud-prod-network-stack-dhcp-ipv4-config-local-proofis done and owns the local DHCP IPv4 configuration, default route, and ARP/neighbor proof for the Phase C userspace stack.cloud-prod-remote-session-web-ui-l4-local-proofowns the local cloudboot proof thatremote-session-web-uilistens through the Phase C L4 path after the done DHCP/IPv4 configuration proof.cloud-gce-private-self-hosted-webui-proofowns the private GCE Web UI proof over the live NIC and remains gated on the local Web UI L4 path plus Web UI hardening tasks: server-side session hardening is done (remote-session-web-ui-session-hardening), and connection bounds are done (remote-session-web-ui-connection-bounds: per-connection request-read/response-send deadlines in the Web UI client over the bounded network-stack listener).cloud-gce-public-self-hosted-webui-ingress-tlsis the separate public ingress/TLS step and remains on hold pending private GCE proof and explicit public-exposure authorization.
This follow-up changes documentation scope only. It does not change any remaining task status, selected milestone, cloud resource posture, public ingress authority, TLS custody, or production release authority.
Inputs weighed
tools/cloudboot/run-test.sh(BOOT_LANDMARK,NIC_PROOF_MARKER,REQUIRE_NIC_PROOF,main,PROVIDER_JSON_REQUIRED_KEYS) andtools/cloudboot/README.md(“Serial evidence-marker contract”, “provider.jsonschema”, “Gate semantics”) – the billable gate, and the single most decisive input.kernel/src/virtio_stub.rs– the production L4 surface (all socket entry points fail closed).kernel/src/cap/network.rs– the L4 socketCapObjectcontract, wired to the stubbedpoll_schedulerin production.kernel/src/cap/virtio_net_polled_provider.rs– the always-built raw-frame polled provider (real device->host RX DMAused_len=76, zero interrupts).docs/proposals/networking-proposal.md, Part 3 (Phase C architecture, prerequisites, exit criteria) – the scope of Option B.docs/backlog/hardware-boot-storage.md, “Cloud Device Tracks – Real GCE Polling Path (decoupled from MSI-X)” – the track this decision is slice 5 of.
Phase C: Userspace virtio-net Driver Relocation
This is the L4 track opened by the
Network-Reachable Datapath Scope Decision
(Option A): raw-frame TX/RX reachability is the cloud milestone bar, and the L4
socket path – relocating smoltcp and the cap/network.rs socket caps out of
the cfg(qemu) kernel/src/virtio.rs into userspace processes
(networking-proposal Part 3, Phase C) – is a separate future track. This doc
designs that track and sequences its slices.
The first Phase C Web UI path remains IPv4-scoped: userspace L4 plus DHCP/IPv4 configuration, ARP, and the private/public GCE Web UI proofs. IPv6 is tracked as a separate network-stack capability lane in Networking and the hardware/cloud backlog. Phase C must preserve enough address-family shape for that lane, but lack of IPv6 does not block the first IPv4 GCE Web UI proof.
Cap-Surface Delta: What The Userspace Driver Needs
The current DeviceMmio / DMAPool / Interrupt cap surface does not yet host
the virtio-net driver in userspace as built, but the missing pieces are bounded
extensions of accepted patterns and a reuse of the landed production
DMA-isolation track – not new isolation built from scratch. Per-primitive
evidence:
DeviceMmiogives a read-only BAR window with one selected write today.DeviceMmio.mapreturns a read-only BAR page; rawwrite32is refused (register_write = "blocked"), with exactly one selected write permitted – the notify doorbell at@5(notify_doorbell,kernel/src/cap/device_mmio.rs). A driver must additionally write the virtio common-config window (device status, feature-select/feature, queue-select, queue-size, queue-address/queue-enable). The relocation adds these as further selected writes under the same accepted range-check + read-back discipline the notify doorbell already enforces (see “The Common-Config Window” below) – not a new write primitive.DMAPooldoes not yet export a device-usable address to this driver, but the export discipline is landed.DMAPoolgives one bounce page; the host-physical / device address is not exported in the bounce posture (host_physical_user_visible = false,direct_dma = "blocked",iova_export = "disabled-future-only",kernel/src/cap/dma_buffer.rs). The vring is kernel-owned today (kernel/src/cap/virtio_net_polled_provider.rs), so userspace does not yet place its own descriptors. The mechanism to let it do so safely – a manager-owned bounce buffer or a domain-scoped IOMMU IOVA, never a raw host-physical address – is already landed (the production DMA-isolation track; see “The Userspace-Ownable vring Slice” below); the slice-2 work wires it to the driver’s vring.Interruptis wait-only over a kernel-latched used ring.Interrupt.waitreads a kernel-latched used-ring index,acknowledgeis a no-op, and mask/unmask are refused. A real driver owns its IRQ lifecycle (mask, unmask, EOI ordering).
Classifying the virtio-net bring-up steps against what userspace can do today makes the gap concrete – almost every step is kernel-only:
| Bring-up step | Userspace-doable today? |
|---|---|
| Device reset (write status = 0) | No – needs writable status register |
| ACKNOWLEDGE / DRIVER status bits | No – needs writable status register |
| Feature negotiate (select + read + write + FEATURES_OK) | No – needs writable feature-select/feature + status |
| Queue program (queue-select, queue-size, queue-address, queue-enable) | No – needs writable common-config + a device-usable vring address |
| vring allocation (avail/used/descriptor tables) | No – vring is kernel-owned; no device-usable buffer address export |
| DRIVER_OK (write status) | No – needs writable status register |
| MSI-X program / vector assignment | No – kernel-owned |
| Submit + notify (ring doorbell) | Partial – the one selected doorbell write @5 exists |
| Poll used ring | Partial – via the kernel-latched index Interrupt.wait reads |
| Teardown (reset, scrub, release) | No – kernel-owned reset path |
The Nic ABI (Inline Data, Per the Proposal Draft)
This track keeps the networking-proposal Part 3 frame ABI: frames cross the cap
boundary as inline Data (transmit @0 (frame :Data), receive @1 () -> (frame :Data), networking-proposal:443). The kernel copies the frame into and
out of the manager-owned bounce buffer the polled provider already established,
so no host-physical address or device-usable buffer handle is exported to
userspace and host_physical_user_visible=0 is preserved. capOS method
convention adds the result/reason/sideEffect evidence triple and the
observed EtherType to the result:
interface Nic {
transmit @0 (frame :Data)
-> (result :Text, reason :Text, sideEffect :Text);
receive @1 ()
-> (frame :Data, observedEthertype :UInt16,
result :Text, reason :Text, sideEffect :Text);
macAddress @2 () -> (addr :Data);
linkStatus @3 () -> (up :Bool);
}
Why inline Data, not a zero-copy buffer handle. A DmaBuffer-handle
(zero-copy) ABI was considered – it would avoid the per-frame copy – but
rejected to keep the change small: it introduces a new buffer-ownership protocol
across the cap boundary (who allocates, who frees, lifetime versus the call) on
top of the security work this track already requires, for a copy cost that does
not matter at research scale. Inline Data matches the accepted proposal draft,
keeps the frame staging kernel-owned exactly as the polled provider proved, and
defers any zero-copy optimization to a later, separately justified slice.
The Common-Config Window (Selected-Write, No New Ruling)
Relocation writes the virtio common-config window from userspace through a
bounded, range-checked, selected-write path modeled on the existing single
selected write (notify_doorbell @5, kernel/src/cap/device_mmio.rs;
device_manager::provider_notify_doorbell_write_for_cap). This is the next
register in the accepted selected-write pattern, not a new security relaxation
requiring a ruling. DeviceMmio already refuses raw write32
(register_write = "blocked") and admits exactly the claimed selected write,
range-checked against the decoded BAR and followed by a kernel-asserted
read-back; the handshake registers are added to that same admission list under
the same discipline.
The bounded design:
- A selected-write common-config window: only the named virtio common-config registers needed for handshake (device status, feature-select, device-feature, driver-feature) are writable in slice 1, each range-checked against the claimed BAR.
- Read-back-assertion discipline: every selected write is followed by a read-back the kernel asserts, so a userspace driver cannot leave the device in an unverified state.
- Queue-address registers stay fail-closed in slice 1 and are admitted to the same selected-write list only in slice 2, where each programmed value must resolve to a device-usable address the writing driver was granted (the DMA-isolation discipline below decides), so userspace can never point the device at arbitrary physical memory.
No new ruling is pending: the project already decided this posture through the accepted selected-write discipline and the IOVA-export discipline below. Slice 1 is ready.
The Userspace-Ownable vring Slice (Reuses Landed DMA Isolation)
The expensive-sounding piece – slice 2, a userspace-ownable vring plus a device-usable buffer-address export – is wiring already-landed isolation to the driver’s vring, not building isolation from scratch. The two backends and the no-host-physical export discipline are all landed:
- Bounce-buffer path (production default on no-IOMMU shapes). The runtime
DMA-backend probe (
kernel/src/dma_backend.rs,select_and_report/probe_verified_usable_iommu) selects the labeled bounce-buffer fallback fail-closed when no usable guest IOMMU is verified, and the manager-owned bounce-bufferDMAPool/DMABufferlifecycle (kernel/src/device_dma.rs, scrub-before-free, owner/slot generations, quiesce-before-release) is the landed authority a driver uses for device-visible buffer memory (cloud-prod-dmapool-bounce-buffer-grant-proof). - IOMMU-IOVA path (graduates when the probe verifies usable hardware). The
Intel VT-d remapping path (
kernel/src/iommu.rs,cfg(qemu)today) programs per-device domains, maps manager-ownedDMAPoolpages, and exports only a domain-scoped IOVA – never a host-physical address (ddf-iommu-remapping-production-closeout,ddf-iommu-production-dmapool-ledger-integration,ddf-iommu-per-device-domain-granularity,ddf-iommu-production-revoke-teardown-hostile-smokes,ddf-real-dma-iommu-direct-path). - No-host-physical export discipline. The IOVA-export discipline
(
ddf-iommu-iova-export-discipline) and thehost_physical_user_visible = false/iova_export = "disabled-future-only"posture (kernel/src/cap/dma_buffer.rs) guarantee a driver receives only a device-usable address (bounce handle or domain-scoped IOVA), never a raw host physical address.
The accepted contract is docs/dma-isolation-design.md (“Cloud DMA Backend”
runtime-selection rule and the IOVA-export-discipline clause), and the S.11.2
hostile-smoke matrix is already enforced for both backends. The remaining slice-2
work is therefore to let the userspace driver allocate its vring through the
granted DMAPool (bounce, or IOVA-backed when the probe verifies usable
hardware), learn the device-usable address for each ring, and program those
addresses into the queue-address registers over the slice-1 writable window –
under the landed fail-closed / scrub / quiesce / revoke discipline. The
networking-proposal Phase C prerequisites table and S.11.2 are satisfied by
the landed track, not deferred to it.
Bounded Slice Chain
- Userspace status + feature handshake to FEATURES_OK over a writable
selected-write common-config window (the next register in the accepted
notify-doorbell selected-write discipline). [DONE 2026-06-02.] The
cap::devicemmio_grant_source_prodsource stages the virtio-net common-config window as a writable selected-writeDeviceMmiogrant (stage_virtio_net_common_config); the userspace shim drives the handshake overDeviceMmio.read32/write32, the write admission (device_manager::stub::write_devicemmio_u32) admits only the four handshake registers (range-checked + read-back-asserted) and refuses queue-address writes; the unimplementedNicstub is inschema/capos.capnp. Proofmake run-cloud-prod-nic-driver-userspace-features-ok. Task record:docs/tasks/done/2026-06-02/cloud-prod-nic-driver-userspace-features-ok-local-proof.md. - Userspace-ownable vring + device-usable address export – reuses the
landed production DMA isolation (bounce policy +
dma_backendprobe + IOMMU IOVA-export, S.11.2 already enforced); the work is wiring it to the driver’s vring. [DONE 2026-06-03.] Undercloud_virtio_net_userspace_ownable_vring_proof(implies slice 1) the userspace shim co-receives the writable common-configDeviceMmiogrant and a bounce-bufferDMAPoolgrant on the same virtio-net function; it allocates its descriptor / available / used ring pages, learns each buffer’s opaque device-usable handle fromDMABuffer.info(deviceIova, scopebounce-handle), and programsqueue_desc/queue_driver/queue_deviceover the slice-1 window.device_manager::stub::write_devicemmio_u32(admit_virtio_queue_address_write) resolves each handle against the liveDMAPoolgrant ledger (resolve_virtio_vring_device_address) to the real bounce host-physical address, programs that address (never the handle), and read-back-asserts; queue-address reads (0x20..0x38) are refused so the host-physical address is never exposed, and out-of-grant / host-physical / stale-generation writes fail closed.queue_enablestays fail-closed. Proofmake run-cloud-prod-nic-driver-userspace-ownable-vring. Task record:docs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-ownable-vring-local-proof.md. - Userspace queue-program + DRIVER_OK over the (now device-addressable)
vring. [DONE 2026-06-03.] Under
cloud_virtio_net_userspace_queue_enable_driver_ok_proof(implies slice 2) the userspace shim completes device bring-up: after slice 2’s queue-address programming it writesqueue_enable = 1(0x1c) for its programmed TX queue and setsDRIVER_OKover the already-writable device-status register.device_manager::stub::write_devicemmio_u32admits thequeue_enablewrite only when the active queue’s vring memory is live and page-fitting (selected_queue_ready_to_enable): it reads the activequeue_desc/queue_driver/queue_deviceback kernel-side and requires each to currently hold the host-physical address of a live grantedDMABuffer(a freed buffer’s stale address cannot arm a use-after-free DMA target), and requires the activequeue_sizeto fit every split-ring structure inside one granted bounce page; an enable of an unprogrammed, freed, or oversized queue fails closed, and the enable is read-back-asserted. Once enabled, the queue’s vring base registers are immutable – a queue-address repoint is refused so the driver cannot mutate the vring under a running device. TheDRIVER_OKdevice-status write is kernel-asserted: the kernel re-reads device-status and fails closed unless the device latched theACKNOWLEDGE | DRIVER | FEATURES_OK | DRIVER_OKbyte exactly (rejectingFAILEDandDEVICE_NEEDS_RESET). Queue-address reads stay refused; no host-physical is exposed; no new DMA isolation backend. Proofmake run-cloud-prod-nic-driver-userspace-queue-enable-driver-ok. Task record:docs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-queue-enable-driver-ok-local-proof.md. 4a. Userspace RX queue 0 bring-up + buffer-identity binding + ring-buffer pinning, then the first real RX DMA from the shim-owned vring. Split into two landed/ready sub-slices because the real-DMA hybrid bring-up is large:- 4a-i [DONE 2026-06-03]. The shim brings up RX queue 0 over its own
vring (slices 1-3 brought up only the TX queue; the
queue_enableadmission is queue-agnostic).device_manager::stubretains each programmed queue’s vring physes + originatingDMABufferhandle identity onProductionDeviceRecord(admit_virtio_queue_address_write), bindsqueue_enableto that identity (a freed buffer’s stale handle, or a freed-then-reallocated frame at the same host-physical address, fails closed withdevicemmio-queue-enable-identity-mismatch), and pins the ring buffers againstfreeBuffer/ process-teardown release while the queue is enabled (dmabuffer-pinned-enabled-vring), releasing only on disable/reset with quiesce. This completes the vring buffer-lifetime binding slice 3 left point-in-time at the bring-up boundary. No device DMA. Proofmake run-cloud-prod-nic-driver-userspace-rx-bringup. Task record:docs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-rx-bringup-and-buffer-pinning-local-proof.md. - 4a-ii [DONE 2026-06-03]. The first real RX DMA from the shim vring:
the shim also brings up TX queue 1 over its own vring, posts one
device-writable RX receive buffer on queue 0 (
DMABuffer.submitDescriptor), and rings the productionDeviceMmio.notifyDoorbell @5(the previouslyErr(stale_handle)provider_notify_doorbell_write_for_cap, now live;cap::devicemmio_grant_source_prodmaps the notify region kernel-side and captures the per-queue notify slot offsets). The kernel (cap::virtio_net_userspace_rx_dma_proof) drives the RX publish over the shim’s retained RX physes + a kernel-half SLIRP TX ARP stimulus over the shim’s retained TX physes + one real device->host RX DMA (used_len > 0, observed EtherType0x0806), latches the used-ring index (int_injected = 0, noInterruptcap), and resets the device – quiescing the queues and releasing the ring-buffer pins – WITHOUT theNiccap. Proofmake run-cloud-prod-nic-driver-userspace-rx-bringup(extended). The deterministic freed-then-reallocated-frame identity negative is split to a follow-up: the next-fit frame allocator (capos-libFrameBitmap,free_framedoes not rewindnext_hint) never returns a just-freed frame on the next allocation, so a deterministic same-phys realloc – needed to reach the slice-4a identity gate rather than the slice-3 phys gate – requires an allocator reuse seam. The data path is cooperative-shim-safe (kernel authors the descriptor + avail inside the drive window, re-validates the posted buffer live + unmapped before publishing, resets on every post-publish path); the HOSTILE-shim residuals (kernel-exclusive/unmappable enabled vring rings + reset-failure payload quarantine) are closed in 4a-iv; the identity negative is closed in 4a-iii. Task records:docs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-rx-dma-local-proof.md; follow-upsdocs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-rx-dma-identity-realloc-negative-local-proof.md,docs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-rx-dma-hostile-shim-hardening-local-proof.md. - 4a-iv [DONE 2026-06-03]. HOSTILE-shim hardening over the 4a-ii data
path. The shim owns its vring ring buffers (granted bounce DMABuffers it can
DMABuffer.map), so the kernel cannot trust ring contents it did not author and lock. The closed gaps: (1) a programmed/enabled vring ring buffer is made kernel-exclusive –validate_dmabuffer_map_admissionrefuses a NEWDMABuffer.mapof it while the queue is enabled,selected_queue_identity_boundrefusesqueue_enablewhile a ring buffer still carries a live user mapping (a kept pre-enable VMA) and refuses arming a queue whose ring buffers are not pairwise-distinct pages – both within the queue and across the device’s other enabled queues (an aliased desc/driver/device, orrx.desc == tx.desc, would let the kernel-authored ring writes corrupt a descriptor into a non-bounce DMA target), andrecord_dmabuffer_user_mappingrefuses to record a mapping on a programmed/enabled ring (an in-flight SMP map), and the RX-DMA submit/drive admission refuses to publish an enabled ring buffer as the RX DMA payload (the shim cannot point the device at its own ring page); (2) atqueue_enablethe kernel WIPES the queue’s descriptor table slot 0, available ring, AND used ring (virtio_net_userspace_rx_dma_proof::sanitize_enabled_queue_rings), so a shim that pre-wroteavail.idx/ a tampered descriptor / a spoofedused.idxwhile the queue was disabled cannot pre-publish it into the enabled window – the device sees an empty queue with no pre-staged completion until the kernel-authored drive publishes and a real device DMA advancesused.idx; (3) a per-bdf RX-DMA payload drive pin / reset-failure quarantine (device_manager::stubbegin_rx_dma_drive_pin/clear_rx_dma_drive_pin/mark_rx_dma_payload_quarantine_permanent, consulted by themapadmission + the record path + thefreeBuffer/teardown detach) is set atomically with the live+unmapped re-validation under the device-table lock for the drive duration, cleared after a confirmed reset, and promoted to a permanent quarantine on the catastrophic reset-failure path (never downgraded by a later drive). The smoke (make run-cloud-prod-nic-driver-userspace-rx-bringup, extended) proves a still-mapped ring blocksqueue_enable, the post-enable map refusal on the descriptor + driver rings, and that a hostile pre-enable descriptor/avail.idxtamper does NOT survive (RX DMA still completes withused_id=0, ARP EtherType). The drive pin’s SMP map/free race and the reset-failure quarantine are structural fail-closed hardening: the single-CPU cloudboot proof cannot reach the race and QEMU virtio reset always succeeds, so they are not separately QEMU-observable. Residual [CLOSED 2026-06-03 in 4a-v]: two SMP-only, bounce-confined races betweenDMABuffer.mapand a device-authority state transition – the cap-sidemap_page_into_userinstalled the user PTE before the manager-side record (a manager-rejected map left a transient provisional PTE), and thequeue_enableno-live-mapping check was not atomic with the retainedenabledflag flip (a concurrent map could record a mapping in the enable window) – were not reachable by the single-CPU proof and are closed by slice 4a-v below. Predecessor task record:docs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-rx-dma-hostile-shim-hardening-local-proof.md. - 4a-iii [DONE 2026-06-03]. The deterministic
freed-then-reallocated-frame
queue_enableidentity negative carved out of 4a-ii. A proof-only one-shot bounce-frame reuse seam (mem::frame::proof_try_alloc_specific_frame_zeroedconsumed by adevice_dmareuse hint armed on each bounce free, both gated behindcloud_virtio_net_userspace_rx_bringup_proof, never compiled into production) makes a same-host-physicalDMABufferrealloc reachable from the userspace harness despite the production next-fitFrameBitmap. The smoke programs a transient ring buffer into rxqueue_desc, frees it, relands a fresh-handle buffer on the same frame, and provesqueue_enablefails closed on the slice-4a identity gate (authority_result = devicemmio-queue-enable-identity-mismatch) – the recorded phys is live again so the slice-3 phys gate passes, and the marker flipsidentity_realloc_negative=enforced. The seam does not relax the next-fit policy or the identity gate. Proofmake run-cloud-prod-nic-driver-userspace-rx-bringup(extended). Task record:docs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-rx-dma-identity-realloc-negative-local-proof.md. - 4a-v [DONE 2026-06-03]. Closes the two SMP-only, bounce-confined
DMABuffer.map-vs-device-authority races split out of 4a-iv. (1)kernel/src/cap/dma_buffer.rs::map_page_into_usernow takes the manager-side mapping record (record_dmabuffer_user_mapping, which acquires the device-table lock) BEFORE installing the user PTE, with the caller address-space lock held across the manager call (lock order address-space -> device-table -> quarantine, no reverse nesting): a manager-rejected map returns before any PTE is installed, so a concurrent SMP thread can never touch a transient provisional mapping at the deterministic auto-picked base. (2)kernel/src/device_manager/stub.rs::try_enable_selected_queuefolds thequeue_enableno-live-mapping identity check and a retainedenable_pinnedcommit pin into one device-table critical section, set before the MMIO arm write and rolled back on a readback mismatch, so a concurrent map cannot record a mapping in the enable window. The commit pin is kept distinct from the device-armedenabledbit (set after the MMIO write): map admission /freeBufferpinning consultenabled || enable_pinned, while the RX-DMA drive gates onenabledalone, so a concurrent drive cannot run – and reset-clear the retained state – inside the arm window. A per-queueenable_in_progressclaim serializes concurrentqueue_enabletransitions (enable vs enable, enable vs disable) so a racingqueue_enable = 0can never clear an in-flight enable’s commit pin, and the drive’s reset cleanup skips clearing while a transition is in flight so a stale drive cannot clear a newer enable’s pin. The two transitions now serialize on the device-table lock: one fails closed. The interleavings are not single-CPU-reachable and the live kernel statics are not host-testable, so the proof is an exhaustive Loom interleaving model (capos-config/tests/dmabuffer_map_enable_loom.rs, run viacargo test-dmabuffer-map-enable-loom) that asserts no schedule arms a queue with a live user PTE on its ring buffer or installs a PTE without an accepted record; the single-CPUmake run-cloud-prod-nic-driver-userspace-rx-bringupregression continues to pass unchanged. Residual [CLOSED 2026-06-03]: the follow-up fenced the drive’s post-reset cleanup on a per-queue transition generation.RetainedVringQueue::generationis bumped on each completed transition (mark_selected_queue_armedenable arm +finish_selected_queue_disabledisable finish); the drive reads it (retained_queue_generation) IMMEDIATELY beforereset_device– the reset boundary, not the far-earlierenabledsample – and the cleanup (mark_retained_vring_queue_disabled_if_epoch) clears the pins only while the queue is still in the epoch the reset quiesced. A disable + re-enable that fully completes AFTER the reset advances the generation, so the cleanup becomes a no-op instead of clearing the freshly armed epoch’s pins; one that completed BEFORE the reset is included in the captured generation, so its pins clear (no stale over-pin). The only residual is a transition completing in the tiny read->reset-MMIO gap whose arm loses the race: a fail-closed over-pin, never an under-pin. The Loom model gainedfenced_stale_drive_cannot_clear_a_completed_re_enables_pins, which fails with the fence removed (GENERATION_FENCE = false). Task record:docs/tasks/done/2026-06-03/cloud-prod-nic-driver-queue-enable-drive-reset-epoch-fence-local-proof.md. Predecessor (4a-v):docs/tasks/done/2026-06-03/dmabuffer-map-record-before-pte-install-ordering-local-proof.md. 4b.Nic-cap driver process, coupled TX/RX round-trip [DONE 2026-06-03]. Implements the slice-1Nicinterface stub as a liveCapObject(kernel/src/cap/nic_grant_source_prod.rs, granted via the newnicKernelCapSourceregistered incapos-config; clientNicClientincapos-rt). Over the SAME shim-brought-up device (RX queue 0 + TX queue 1 enabled by slices 1-3 + 4a), the cap drives the shim’s retained vring physes throughvirtio_net_userspace_rx_dma_proof::{nic_transmit, nic_receive, nic_quiesce}:receive()internally drives the coupled ARP-TX-stimulus + RX-poll and returns the received frame inline plus the observed EtherType;transmit()stages a frame into a manager-owned TX bounce page over the retained TX vring and rings the doorbell;macAddress()/linkStatus()read the kernel-mapped virtio-net device-config region. Frames cross the cap boundary as inlineDatacopied through manager-owned kernel bounce pages (host_physical_user_visible = 0; no host-physical / device-handle exposure); the device is left live for the cap’s lifetime and quiesced once on cap release (nic_quiesce: reset + queues-cleared + release the enabled-vring pins). Completion stays kernel-latched used-ring polled (int_injected = 0, noInterruptcap). The clean independent TX/RX split is deferred to slice 6; userspace IRQ ownership to slice 5. Proofmake run-cloud-prod-nic-driver-userspace-nic-cap-roundtripboots the device from userspace, round-trips two sequential frames through the typedNiccap (observed EtherType0x0806over QEMU SLIRP), and emits onecloudboot-evidence: nic-driver-userspace-nic-cap-roundtrip <token>marker withroundtrips=2. The same proof releases the parentDMAPoolcap and one pinned ringDMABuffercap beforeNicrelease, then showsNicquiesce replaying the blocked buffer detach and the pending parent pool detach completing after the remaining ring buffers are freed. Depends on 4a-ii (the shim-owned-vring data path). Task record:docs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-nic-cap-roundtrip-local-proof.md.
- 4a-i [DONE 2026-06-03]. The shim brings up RX queue 0 over its own
vring (slices 1-3 brought up only the TX queue; the
- Userspace IRQ ownership [DONE 2026-06-03]. The userspace NIC driver’s
Interruptcap for its device RX route becomes real, replacing slice-4b’s kernel-latched used-ring polled completion (int_injected = 0). Undercloud_virtio_net_userspace_irq_ownership_proof(implies slice 4b) a newInterruptgrant source (kernel/src/cap/virtio_net_userspace_irq_ownership_proof.rs, replacing the admission-onlyinterrupt_grant_source_prodsource via theKernelCapSource::Interruptarm) programs the staged virtio-net function’s RX MSI-X route (entry 0) mask-first through the landed always-builtcap::interrupt_programmed::program_attach_arm_unmaskand issues a cap whose:waitblocks on a real interrupt dispatch through the route’s MSI-X / LAPIC dispatch slot (device_interrupt::wait_kernel_injected_dispatch;delivery_countadvances, soint_injectedflips from 0 – slice 4b had noInterruptcap on the data path at all). The wake is a bounded kernel-injected dispatch through the route’s real deferred-EOI machinery, not yet a device-autonomous MSI-X raise causally tied to a specific frame;Nic.receivestill reads the frame bytes from the used ring, so this slice delivers IRQ-lifecycle ownership (the driver drives the realwait/acknowledge/mask/unmask), not interrupt-coalesced RX completion.acknowledgeretires exactly one deferred LAPIC EOI throughdevice_interrupt::acknowledge_deferred_lapic_eoi_for_route(hardwareDispatchAckDelta = 1); andmask/unmasktoggle the route’s own MSI-X vector-control bit (mask-first per PCI 3.0 §6.8.2) plus the manager-attached route state, throughpci::set_msix_table_entry_mask+device_interrupt::{mask,unmask}_device_manager_attached_route. The route is torn down (interrupt_programmed::teardown) on cap release. The driver holds thisInterruptcap alongside the slice-4bNic/DeviceMmio/DMAPoolcaps on the same function; it brings the device up from userspace, then drives the owned RX-interrupt lifecycle and reads the completed frame back throughNic.receive. The PCI function-level MSI-X enable bit is not toggled and no device-autonomous raise is attempted (device_autonomous_raise=not-attempted,waiter_wake=kernel-injected-dispatch); the landed DMA isolation, the slice-2/3 vring grants, and the buffer-identity / ring-buffer pinning are reused unchanged (host_physical_user_visible = 0, queue-address reads still refused). No newInterruptinterface or method (the existingwait/acknowledge/mask/unmaskbecome real for this route). Proofmake run-cloud-prod-nic-driver-userspace-irq-ownershipemits onecloudboot-evidence: nic-driver-userspace-irq-ownership <token>marker. Task record:docs/tasks/done/2026-06-03/cloud-prod-nic-driver-userspace-irq-ownership-local-proof.md. - Clean TX/RX split [DONE 2026-06-03]. Decouples the coupled receive path
into independent TX and RX submission. Under
cloud_virtio_net_userspace_clean_tx_rx_split_proof(implies slice 5) theNiccap’sreceive @1dispatches to the newvirtio_net_userspace_rx_dma_proof::nic_receive_independentinstead of the couplednic_receive: it posts a manager-owned RX receive buffer on the retained RX vring, waits on the driver’s OWNED RX interrupt route (the slice-5device_interrupt::wait_kernel_injected_dispatchdispatch slot, resolved throughvirtio_net_userspace_irq_ownership_proof::owned_rx_route), retires the deferred LAPIC EOI, and reads the completed frame from the RX used ring – with no internal ARP-TX self-stimulus (it never submits to the TX vring). The RX frame is driven by an external stimulus: the consumer’s preceding independentNic.transmitof a real broadcast ARP request, which QEMU SLIRP answers; the inbound reply is held in the host net queue until the RX buffer is posted.Nic.transmitstays independent (submits the caller’s frame and rings the TX doorbell with no RX poll; surfacesrx_polls=0). The wake stays the bounded kernel-injected dispatch slice 5 owns (waiter_wake=kernel-injected-dispatch,device_autonomous_raise=not-attempted). Reuses the landed owned-vring / owned-IRQ / DMA-isolation unchanged: no new selected-write register, no new MSI-X surface, no newNic/Interruptmethod (make generated-code-checkgreen), no host-physical / handle exposure (host_physical_user_visible = 0, queue-address reads refused). The driver does an independenttransmitthen a separate independentreceive, neither performing the other’s submission (tx_independent=ok,rx_independent=ok,receive_self_stimulus=removed). Proofmake run-cloud-prod-nic-driver-userspace-clean-tx-rx-splitemits onecloudboot-evidence: nic-driver-userspace-clean-tx-rx-split <token>marker. Task record:cloud-prod-nic-driver-userspace-clean-tx-rx-split-local-proof. - Network-stack process + smoltcp relocation – a second userspace process
holding the
Niccap and a bounded time source, runningsmoltcp, implementing the socket caps while preserving thecap/network.rscontract. Slice 6 is now done, so this slice is decomposed into bounded increments rather than attempted as one step:- 7a (first increment, DONE 2026-06-03): network-stack-process skeleton. A
userspace process runs a minimal
smoltcpInterfaceover aphy::Deviceadapter backed by the landed independentNic.transmit/Nic.receive(slice 6), clocked by aTimercap, and drives one observable Ethernet exchange through SLIRP:smoltcp– not hand-rolled frame code – ARPs the gateway out throughNic.transmit, consumes the reply in throughNic.receive, and emits the queued IPv4/UDP datagram, so the neighbour cache observably advances. No socket caps, nocap/network.rsrelocation,virtio_stub.rsunchanged. Proofmake run-cloud-prod-network-stack-process-smoltcp-skeleton. Implementation note: the landedNiccap is not yet self-sufficient – itstransmit/receiveride on the userspace driver shim’s retained vring (the kernel does not own the vring), so the skeleton process performs the slice-1-6 bring-up itself (it also holds theDeviceMmio/DMAPool/Interruptcaps) before runningsmoltcp. Splitting the bring-up into a separate long-lived NIC-driver service, so the network-stack process holds onlyNic+Timer+Console, is folded into the 7c contract-relocation increment; it does not change the proven smoltcp-substrate claim. Task record:cloud-prod-network-stack-process-smoltcp-skeleton-local-proof. - 7b (socket layer, DONE 2026-06-03): socket caps over the userspace
smoltcp stack – a userspace
UdpSocketcap layer (UdpSocketCapLayer) implements theUdpSocketschema’ssendTo/recvFromsemantics over the 7aInterfaceand proves one bounded UDP request/response: a DNS A query forexample.comto SLIRP’s resolver at10.0.2.3:53viasendTo, then the decoded response viarecvFrom. smoltcp drives every frame through theNiccap (ARP reply + DNS reply both fetched throughNic.receive,host_physical_user_visible = 0preserved);Timerclocks the poll. The socket layer is in-process – it does not yet serve the socket interfaces as inter-process transferable capabilities, and it does not touchcap/network.rs(virtio_stub.rsstays fail-closed). Proofmake run-cloud-prod-network-stack-smoltcp-socket-caps(onecloudboot-evidence: network-stack-smoltcp-socket-caps <token>marker). Task record:cloud-prod-network-stack-smoltcp-socket-caps-local-proof. Serving the socket interfaces as inter-process caps (aNetworkManager-like broker) andTcpListener/TcpSocketare folded into the 7c contract relocation. - 7c (contract relocation): preserve the
cap/network.rscontract behind the userspace network-stack process so the production L4 entry points (virtio_stub.rs) stop returningDeviceUnavailablefor the armed manifest. This is the body of the whole-slice record (cloud-prod-userspace-network-stack-smoltcp-local-proof), decomposed because it bundles three independently-large pieces:- 7c-i (inter-process socket cap, DONE 2026-06-03): serve the slice-7b
userspace
UdpSocketCapLayeras a real inter-process transferable capability. A network-stack server process brings the device up, builds the userspacesmoltcpUdpSocketlayer, and serves theUdpSocketschema (sendTo/recvFrom/close) over an exportedEndpoint; a separate client process re-interprets the served cap as aUdpSocketand drives one bounded DNS A query/response through the productionUdpSocketClient, withsmoltcpstill moving every frame through theNiccap (host_physical_user_visible = 0). Proofmake run-cloud-prod-network-stack-smoltcp-udp-socket-cap-ipc. This is the prerequisite for the kernel contract relocation.cap/network.rs/virtio_stub.rsare unchanged. Task record:cloud-prod-network-stack-smoltcp-udp-socket-cap-ipc-local-proof. - 7c-ii (
cap/network.rsrelocation, DONE 2026-06-07): route the armed Phase C manifest’s production L4 socket entry points to the userspace network-stack service so the local proof no longer reachesvirtio_stub.rsDeviceUnavailable. This is itself decomposed (see “7c-ii Mechanism and Decomposition” below) into 7c-ii(a) – serveTcpListener/TcpSocketas inter-process caps with accept-returns-a-result-cap, an architecture-agnostic prerequisite (cloud-prod-network-stack-smoltcp-tcp-socket-cap-ipc-local-proof) – and 7c-ii(b) – the local serve-from-userspace proof. In 7c-ii(b) (DONE 2026-06-07), the network-stack process boots the non-qemucloudboot kernel, spawns an application client with onlyConsoleplus a userspace-servedTcpListenAuthority, returns a userspace-servedTcpListenerfromlisten, returns aTcpSocketresult cap fromaccept, and completes one hostfwd TCP request/response throughrecv/send. The selected architecture remains serve-from-userspace: applications consume the existing typed socket interfaces from a userspace network-stack process instead of extending the legacy kernel-routed socket owner. - 7c-iii (
TcpListener/TcpSocket, DONE 2026-06-04): the userspacesmoltcpstack now serves a realTcpListener/TcpSocketround trip over theNiccap, using the sustained-receiveNic.receivePoll @4(slice 7d) for the multi-frame TCP exchange. A single cloudboot service brings the device up from userspace, runs asmoltcpTCP socket listening on port 8080 driven by the non-resettingreceivePoll @4pump, and – against an external host TCP client over a QEMUhostfwdrelay – completes one bounded TCP handshake + request/response (asserting the received request equals the expected probe and echoing it back).smoltcp, not hand-rolled frame code, moves every frame; the device stays armed across the SYN/SYN-ACK/ACK/request/response/FIN exchange with no per-frame reset.host_physical_user_visible = 0; queue-address reads refused; the bounce RX pool quiesces + scrubs on teardown. This increment proved the generic TCP socket substrate in-process; 7c-ii(a) later servedTcpListener/TcpSocketinter-process with accept-returns-a-result-cap. Neither increment changescap/network.rs/virtio_stub.rs; that final production-manifest wiring is 7c-ii(b). Proofmake run-cloud-prod-network-stack-smoltcp-tcp-listener-roundtrip. Task record:cloud-prod-network-stack-smoltcp-tcp-listener-roundtrip-local-proof. Built on the prerequisitecloud-prod-nic-driver-userspace-sustained-receive-pool-local-proof.
- 7c-i (inter-process socket cap, DONE 2026-06-03): serve the slice-7b
userspace
- 7a (first increment, DONE 2026-06-03): network-stack-process skeleton. A
userspace process runs a minimal
- Kernel
smoltcp/ virtio-net removal (Phase C exit) – done 2026-06-08. The kernel no longer depends onsmoltcp, and the qemu-onlycap/network.rssocket entry points now fail closed instead of reaching an in-kernel TCP/UDP runtime. The retained virtio-net code is a lower-layer QEMU fixture for PCI/MMIO/virtqueue, ARP, ICMP, and descriptor-generation proofs; it is not the production cloud socket owner. Task record:cloud-prod-phase-c-kernel-smoltcp-virtio-net-removal.
Each slice is parented to the appropriate predecessor; slices 7-8 deliver the L4 socket reachability the scope decision sequenced after the cloud milestone.
7c-ii Mechanism and Decomposition
7c-ii is “preserve the cap/network.rs contract behind the userspace
network-stack process.” Before the serve-from-userspace path landed, the
non-qemu cloud kernel could still grant production L4 methods
(NetworkManager.createTcpListener, TcpListenAuthority.listen,
TcpListener.accept, TcpSocket.send/recv, the UDP methods) through
kernel-side CapObjects in kernel/src/cap/network.rs that call
crate::virtio::*. In that build, crate::virtio resolves to
kernel/src/virtio_stub.rs: creation entry points return
NetworkError::DeviceUnavailable, while existing-handle operations fail closed
with invalid-handle errors. Before Phase C exit cleanup, the working smoltcp
runtime existed only in the cfg(qemu) kernel/src/virtio.rs; it has now been
removed, and the qemu socket entry points match the same fail-closed shape.
Relocating the contract means production L4 methods are satisfied by the
userspace network-stack service instead of the stub, and non-qemu bootstrap
grants for the legacy kernel NetworkManager and TcpListenAuthority sources
now fail closed before those CapObjects are minted.
Mechanism constraint. A literal reading – the kernel keeps serving the L4
caps and forwards each call to the userspace service – requires the kernel to
originate a capability Call to a userspace-served Endpoint and complete the
original caller once the service returns. That kernel-as-client-of-a-userspace-
service path does not exist today: the ring/endpoint machinery
(kernel/src/cap/ring.rs, kernel/src/cap/endpoint.rs,
kernel/src/cap/transfer.rs) only dispatches SQEs from a userspace process’s
ring; the kernel never enqueues a Call against a userspace endpoint nor parks
waiting for the Return. Building that inversion is a new kernel IPC subsystem,
not a drive-by, and it adds kernel surface that the Phase C exit deliberately
avoids by retiring the kernel L4 owner instead.
7c-ii(b) Architecture Decision: Serve From Userspace
On 2026-06-07 the operator selected serve-from-userspace for the final
production-manifest wiring. The armed Phase C manifest receives a
userspace-served NetworkManager or TcpListenAuthority cap from the
network-stack service. The legacy kernel cap/network.rs / virtio_stub.rs
socket path is now fenced to cfg(qemu) fixture manifests and stale negative
paths: non-qemu manifests that request kernel network_manager or
tcp_listen_authority are rejected during bootstrap, so missing served
authority does not fall back to the old kernel socket owner.
The rejected alternative for this stage is kernel-brokered forwarding: keeping the kernel as the socket cap server while it calls into a registered userspace network-stack service. That route would add a new kernel-originated Endpoint/deferred-completion subsystem. The selected direction keeps the microkernel boundary cleaner by moving behavior to userspace where that does not compromise security.
Grounding for this decision:
kernel/src/cap/network.rsimplements the current kernel-servedNetworkManager,TcpListenAuthority,TcpListener,TcpSocket, andUdpSocketobjects, including result-cap transfer for listener and socket creation.kernel/src/virtio_stub.rsremains the non-qemunegative-result endpoint for stale kernel networking call sites, but bootstrap no longer grants itsNetworkManagerorTcpListenAuthoritycallers in production manifests.kernel/src/cap/ring.rs,kernel/src/cap/endpoint.rs, andkernel/src/cap/transfer.rsimplement userspace-originatedEndpointcalls, receive/return, and capability transfer. They do not currently implement a kernel-originated call into a userspace endpoint.- The 7c-i and 7c-ii(a) task records prove userspace-served socket caps over
the existing
Endpointand RETURN result-cap path:cloud-prod-network-stack-smoltcp-udp-socket-cap-ipc-local-proofandcloud-prod-network-stack-smoltcp-tcp-socket-cap-ipc-local-proof. docs/research/capability-systems-survey.mdanddocs/research/spritely-captp-ocapn.mdreinforce the local capOS rule used here: explicit object references are authority, and forwarding/proxying should make authority flow and lifetime explicit rather than hiding it behind ambient names.
Current constraints that both options must preserve:
- Existing caller contract.
NetworkManager.createTcpListener,TcpListenAuthority.listen,TcpListener.accept,TcpSocket.send/recv, and the UDP methods keep the same typed Cap’n Proto surfaces unless a later implementation task explicitly changes schema and regenerated bindings. - Manifest authority. Only the armed Phase C manifest receives the new production L4 path. Missing registration, missing served cap, stale service identity, or an unarmed manifest fails closed instead of falling back to a broad kernel escape hatch; the legacy kernel socket sources are qemu-only fixtures.
- No new packet authority. The socket service consumes the existing
Niccap and the landedreceivePoll @4sustained-receive path. This decision does not add DMA, MMIO, IRQ, queue-address, host-physical, GCE, public ingress, TLS, or certificate authority. - Private/local Web UI critical path. The first Web UI proof remains local and then private GCE reachability. Public ingress, TLS, firewall/DNS exposure, and live-cloud runs stay gated by their separate task records.
| Axis | Kernel-brokered forwarding (rejected for 7c-ii(b)) | Serve-from-userspace (selected) |
|---|---|---|
| Cap server identity | The kernel keeps serving cap/network.rs objects for the armed manifest and forwards each socket operation to a registered userspace network-stack service. | The armed manifest receives a userspace-served NetworkManager or TcpListenAuthority cap from the network-stack service, following the 7c-i / 7c-ii(a) serving pattern. |
| Kernel IPC delta | Adds a kernel-as-client forwarding path: service registration, endpoint identity validation, kernel-originated call construction, transfer handling, cancellation, and deferred completion back to the original caller. | Adds no kernel-originated endpoint call path. The existing userspace caller-to-userspace endpoint path carries the socket methods and result caps. |
| ABI and IPC risk | Higher. The existing ring/endpoint code accepts userspace SQEs and endpoint returns; it does not yet encode the kernel as an endpoint caller. The implementation must specify caller-session metadata, transfer rollback, service disappearance, cancellation, and result-cap insertion semantics for forwarded calls. | Lower kernel ABI risk. Schema can remain unchanged if the served cap implements the current socket interfaces. Risk shifts to manifest grant wiring, service startup ordering, endpoint lifetime, and making stale or missing service authority fail closed. |
| Production-surface compatibility | Preserves the literal “kernel-routed cap/network.rs surface” for the armed manifest, which minimizes caller-visible routing change. | Makes the production socket cap userspace-served for the armed manifest. Callers still use the same typed socket interfaces, but the authority source is the network-stack service rather than cap/network.rs. |
| Fail-closed behavior | Requires a bounded registration table and an explicit “no registered service” error path. Forwarding must not silently reach virtio_stub.rs except as a deliberate unarmed-manifest failure mode. | Uses manifest-level grant selection and service liveness checks. If the served cap is absent, stale, or not granted, the non-qemu manifest fails closed; virtio_stub.rs remains only a stale-call/fixture negative path. |
| Validation burden | Must prove forwarding with at least one socket operation, result-cap transfer through the forwarded path, service crash/cancel cleanup, missing-service failure, and no kernel-thread parking. Before Phase C exit, make run-net covered the old qemu-only socket path. | Must prove the armed manifest gets the served network cap, completes the socket round trip through the userspace service, rejects an unarmed or missing-service manifest, and preserves the existing inter-process socket-cap proofs. Phase C exit keeps only lower-layer QEMU virtio-net fixture coverage after the old kernel L4 owner is removed. |
| Phase C exit interaction | Leaves a new kernel forwarding subsystem after kernel smoltcp and virtio-net hot-path removal. Slice 8 would have had to decide whether that subsystem was permanent generic IPC infrastructure or Phase C-specific scaffolding. | Aligns with the Phase C exit direction that kernel networking becomes routing/grant setup while L4 service behavior lives out of kernel. Slice 8 removed kernel smoltcp/virtio-net L4 ownership and stale socket paths. |
Follow-up task-state changes after the selection:
cloud-prod-userspace-network-stack-smoltcp-local-prooflanded the local serve-from-userspace proof: manifest grant wiring, network-stack service startup/liveness, servedTcpListenAuthorityauthority, and the final socket round-trip proof. No kernel-originatedEndpointcall task is implied by this choice.cloud-prod-legacy-kernel-network-socket-path-retirementretires the armed-manifest route through the legacy kernel socket owner by rejecting non-qemukernelnetwork_manager/tcp_listen_authoritygrants, preserving userspace-served authority as the production path.cloud-prod-phase-c-kernel-smoltcp-virtio-net-removalis done. It removes the kernelsmoltcpdependency, retires the qemu-only kernel TCP/UDP runtime behind fail-closed socket entry points, and documents the remaining virtio-net coverage as lower-layer QEMU fixture evidence rather than production cloud socket ownership.network-socket-terminal-session-userspace-migrationis the existing follow-up for moving kernel-owned socket terminal byte handling out ofkernel/src/cap/network.rs.- Keep
cloud-prod-network-stack-dhcp-ipv4-config-local-proofas the done local DHCP/IPv4 config proof feedingcloud-prod-remote-session-web-ui-l4-local-proof, andcloud-gce-private-self-hosted-webui-proofwhile their remaining existing prerequisites land. Keep public ingress and TLS undercloud-gce-public-self-hosted-webui-ingress-tlsuntil separately authorized.
This decision unblocks 7c-ii(b) implementation. It preserves the fact that
7c-ii(a) was sequenced first because serving TcpListener/TcpSocket as
inter-process caps is useful under either architecture, and it now becomes the
direct serving substrate for the selected path.
Result-cap transfer needs no new ABI. The RETURN transfer-descriptor ABI
that lets a served method return a freshly-minted capability already exists: a
CAP_OP_RETURN carries xfer_cap_count transfer descriptors, the kernel inserts
them into the caller’s table and publishes CAP_CQE_TRANSFER_RESULT_CAPS
(kernel/src/cap/ring.rs::dispatch_return -> insert_prepared_transfer_caps ->
write_endpoint_return_result), and the client decodes the returned cap with
capos-rt’s CompletedCall::result_cap. The 7c-i UDP server already serves caps
over an exported Endpoint through exactly this RETURN path, so the
userspace-Endpoint-server side is proven; the kernel’s own
cap/network.rs::handle_accept (via insert_socket_result_cap) and the
telnet-gateway demo (listener.accept_wait; both since retired) proved the complementary
accept-returns-a-TcpSocket-cap shape and its client-consume side. What
7c-ii(a) adds is wiring – a userspace Endpoint server returning a TcpSocket
result-cap from TcpListener.accept – not a new ABI. Its hard part is the
smoltcp-pump-while-serving interleaving for a blocking multi-frame accept,
which is why 7c-iii deferred inter-process TCP serving.
Downstream self-hosted Web UI tasks consume slice 7 without making the Phase C exit cleanup a blocker for first GCE operator proof:
remote-session-self-served-full-ui-bundleis done and provides the reviewed boot-resource UI bundle.cloud-prod-remote-session-web-ui-l4-local-proofprovesremote-session-web-uiover the Phase C L4 path locally after the done DHCP/IPv4 configuration proof.cloud-gce-private-self-hosted-webui-proofproves the Web UI privately on GCE over the live NIC.cloud-gce-public-self-hosted-webui-ingress-tlsis the separate public ingress, TLS, and reviewed-auth posture step.
Network usability and post-smoltcp follow-ups are decomposed in
Network Usability and Post-smoltcp.
They do not change the Phase C critical path above: the local DHCP/IPv4
configuration proof is now done for the first GCE Web UI path, while the system
DnsResolver cap, POSIX getaddrinfo, ping/ping6 tools, packet tracing,
socket readiness policy, and transport tuning/status are follow-on usability or
diagnostics lanes.
Slices 7a-7c are smoltcp relocation: they run the selected smoltcp 0.13.0
build in userspace and preserve the socket contract, adding no new transport
mechanic. Transport policy/status — read-only transport status, keepalive/
timeout policy inputs, and the deferred congestion-control evaluation — is a
distinct control-plane lane decomposed in the backlog under
network-transport-policy-status-decomposition,
not part of the relocation slices. The userspace relocation track now has
landed UDP and TCP substrate proofs, and the TCP build still runs with
CongestionControl::None by build configuration; selecting Reno/CUBIC is a
build-feature flip, and any custom TCP mechanic requires separate workload
evidence and a reviewed task.
The Sustained-Receive Nic ABI (Former Prerequisite For 7c-iii)
7c-iii (TcpListener/TcpSocket) was blocked on the shape of the landed
Nic.receive. This section records the precise constraint, the
sustained-receive primitive that lifted it while keeping the settled DMA
isolation intact, and the ABI decision.
Status: landed. The Nic.receivePoll @4 method and the kernel-owned bounce
RX pool primitive designed below shipped in slice 7d
(cloud-prod-nic-driver-userspace-sustained-receive-pool-local-proof):
cap::virtio_net_userspace_rx_dma_proof::nic_receive_poll arms a pool of
NIC_RX_POOL_SIZE manager-owned bounce RX buffers and recycles them
individually (copy-out + scrub + slot-generation bump + re-post) with no
per-frame device reset; the receivePoll @4 dispatch arm lives in
cap::nic_grant_source_prod. The multi-frame proof
(make run-cloud-prod-nic-driver-userspace-sustained-receive-pool) drains more
than one frame with at least one non-resetting framePresent = false poll and
keeps the DMA-isolation assertions green (host_physical_user_visible = 0,
queue-address reads refused, quiesce + scrub at teardown). receive @1 is
unchanged. The rest of this section is the as-built design record.
The constraint, precisely (cite the real symbols)
The production Nic.receive @1 dispatches (under the frontier slice-6
cloud_virtio_net_userspace_clean_tx_rx_split_proof feature) to
virtio_net_userspace_rx_dma_proof::nic_receive_independent
(kernel/src/cap/virtio_net_userspace_rx_dma_proof.rs:1437, dispatched from the
receive @1 arm in kernel/src/cap/nic_grant_source_prod.rs:309). Each call:
- allocates one fresh kernel bounce frame (
frame::alloc_frame_zeroed), - authors one RX descriptor + avail entry pointing at it and rings the RX doorbell,
- waits on the driver’s owned RX interrupt route
(
device_interrupt::wait_kernel_injected_dispatch) and retires the deferred LAPIC EOI, - polls the RX used ring for one completion, copies the frame out, and
- frees that bounce frame (
frame::try_free_frame).
Two properties make this single-frame primitive unable to serve TCP:
- No pool stays armed between calls. Exactly one RX buffer is posted per
call and freed when the call returns; between
receivecalls the device has no posted RX buffer to master into. A frame that arrives outside a call has nowhere to land. - No non-resetting “no frame yet”. On a successful frame the independent
path keeps the device live (it does not reset – success arm at
virtio_net_userspace_rx_dma_proof.rs:1535), but an empty/timeout poll (RxDmaFailure::UsedRingPollExhausted) takes the error arm and quiesces the device (nic_quiesce_device:reset_device+ assert queues cleared + release pins). The predecessor proof pathdrive_rx_dma(virtio_net_userspace_rx_dma_proof.rs:331) resets on every outcome; the independent path narrowed that to reset-on-empty-poll, but the reset-on-empty remains. So a speculative “is there a frame?” poll with nothing waiting tears the device down.
smoltcp drives the opposite shape: its poll() loop calls the phy::Device
RX token speculatively and frequently, expecting “no frame yet” to be a cheap,
side-effect-free answer against a device that stays armed with multiple
posted buffers, so the asynchronous, multi-frame TCP exchange (SYN-ACK, data
segments, ACKs, retransmits arriving whenever the peer chooses, sometimes
several in quick succession) can be drained as frames arrive. That is why
7c-iii needed the since-landed sustained-receive ABI before the remaining
7c-ii(b) final-wiring task could stay on hold solely for the operator
architecture decision.
The reset-on-empty is not a bug: it is part of the settled DMA-isolation
model (docs/dma-isolation-design.md). Frames cross through manager-owned
bounce pages; userspace never sees a host-physical or device-usable address;
and after a buffer is reclaimed the device must be proven to have stopped
mastering it (“if in-flight DMA cannot be proven stopped, revocation escalates
to device reset”, docs/dma-isolation-design.md DMAPool Invariants ->
Reset). The single-frame path proves that the crude way – reset the whole
device. The design problem is to keep the isolation guarantee without
resetting the device on every empty poll or every reclaimed buffer.
The sustained-receive primitive
Keep the device armed with a kernel-owned bounce RX pool of N buffers and
recycle buffers individually instead of resetting the device:
- Arm. At first use (or at cap setup) the kernel allocates
Nmanager-owned bounce RX buffers from the driver’s grantedDMAPooland posts allNto the RX vring avail ring. The device masters only into these kernel-owned bounce pages; userspace still receives no host-physical or device-usable address (host_physical_user_visible = 0), exactly as the single-frame path proved. - Drain one arrived frame (per poll). The kernel reads the RX used ring. If
used.idxadvanced, a frame landed in bounce slotk; the kernel treats slotkas device-written untrusted input (docs/dma-isolation-design.md“Receive buffers are treated as device-written untrusted input until validated by the driver or stack”), copies the frame bytes out into the inlineDatareply bounded by the posted buffer length, then recycles slotk. - No frame yet. If
used.idxdid not advance, the call returns “no frame” with no reset and the device stays armed. This is the cheap speculative pollsmoltcpneeds. - Teardown.
on_release(and any unprovable-in-flight-DMA error) still quiesces:reset_device, assert both queues cleared, scrub the whole pool, release the enabled-vring pins, and drop the pool – identical to the existingnic_quiescediscipline. Reset remains the escalation path; it is simply no longer the per-frame path.
The per-buffer invariant that replaces “reset before reclaim”: a bounce
slot is re-exposed to the device only after its copy-out completes and its slot
ownership generation is bumped, with the slot scrubbed before the re-post. This
is the production handle-epoch slot identity (slot + slot_generation,
docs/dma-isolation-design.md Production Handle Epoch Invariants and
DMAPool Invariants: “Buffer operations additionally check the buffer slot and
slot generation before descriptor validation, completion accounting, free,
scrub, or reuse”) applied at buffer-recycle granularity instead of
device-reset granularity:
- the device wrote slot
kand signalled completion via the used ring (it is no longer mastering that slot – the per-buffer analogue of “in-flight DMA is proven stopped”); - the kernel copies the bytes out;
- the kernel scrubs slot
k(residual-state rule,docs/dma-isolation-design.mdResidual state) and bumps its slot generation, so a stale descriptor, free, or completion for the prior occupant fails closed; - only then does the kernel re-post slot
kto the avail ring (re-arm).
As built, the pool buffers are kernel-private frame::alloc_frame_zeroed pages
(never a userspace DMABuffer handle), so the slice does not go through the
single-frame device_dma begin_rx_dma_drive_pin drive-pin code path (that pin
guards a userspace-submitted DMABuffer’s live-unmapped re-validation, which
has no analogue for a manager-private frame). Instead the slice applies the
same per-buffer slot-identity discipline modeled on the production
handle-epoch slot identity (slot + slot_generation,
docs/dma-isolation-design.md) as kernel bookkeeping local to the pool: the
device masters only into kernel-owned pages, each slot is scrubbed and its
generation bumped before re-exposure, and teardown still quiesces (reset) before
reclaim. No new isolation backend, no new IOVA-export rule, no host-physical or
device-usable address exported (host_physical_user_visible = 0).
ABI decision: extend Nic with a non-resetting poll receive (option a)
Chosen: option (a) – add a non-resetting poll-receive method to the Nic
schema and keep receive @1 as the legacy single-shot:
interface Nic {
transmit @0 (frame :Data) -> (result, reason, sideEffect);
receive @1 () -> (frame :Data, observedEthertype :UInt16,
result, reason, sideEffect); # legacy single-shot
macAddress @2 () -> (addr :Data, result, reason, sideEffect);
linkStatus @3 () -> (up :Bool, result, reason, sideEffect);
# Sustained, non-resetting receive over the armed kernel-owned bounce RX
# pool. Returns the next arrived frame, or framePresent = false with no
# device reset when none has arrived. The device stays armed.
receivePoll @4 () -> (frame :Data, observedEthertype :UInt16,
framePresent :Bool,
result :Text, reason :Text, sideEffect :Text);
}
How receivePoll @4 reports its two outcomes without a reset:
| Outcome | framePresent | frame | result / reason / sideEffect |
|---|---|---|---|
| frame arrived | true | frame bytes inline | ok / frame-received / buffer-recycled (slot copied-out, generation bumped, re-posted) |
| no frame yet | false | empty | ok / no-frame / device-armed (no reset; pool still posted) |
| fail closed | false | empty | failed / <reason> / device quiesced (escalation only, not the empty-poll path) |
receive @1 semantics are unchanged (it still resets on empty poll), so the
7b UDP request/response proof stays green; only the new method is non-resetting.
Why not option (b) (in-stack sustained-receive loop against a kernel
primitive). Option (b) – the network-stack process drives a bounded
sustained-receive loop against a kernel primitive directly – was rejected
because it still requires the identical new kernel RX-pool machinery (the design
above) and drives it outside the typed Nic cap boundary, fragmenting the
device-facing authority the whole track funnels through one Nic cap (slices
4b-7). Option (a) keeps the network-stack process holding exactly the Nic
cap it already holds, maps the new method one-to-one onto the smoltcp
phy::Device RX token (“give me the next arrived frame or nothing”), and is the
direct extension of the inline-Data receive ABI the track already accepted.
The cost of (a) is bounded and already named in the abi hazard: one new schema
method, a make generated-code-check regeneration of the checked-in capnp
bindings, and updating every exhaustive Nic method match (receivePoll @4
arm in kernel/src/cap/nic_grant_source_prod.rs; the NicClient in
capos-rt). A receiveBatch returning List(Data) was considered and deferred:
the smoltcp RX token consumes one frame at a time, so single-frame poll is the
clean match and batching is a separately-justified later optimization.
Design Grounding
docs/proposals/network-reachable-datapath-scope-decision.md– the parent scope decision (Option A) that opened this track.docs/proposals/networking-proposal.md, Part 3 (Phase C) – the accepted decomposition, its prerequisites table and exit criteria, and theNicdraft this doc adopts (inlineData).docs/dma-isolation-design.mdS.11.2 – the DMA-isolation invariants and the driver-transition gate, already satisfied by the landed bounce / IOMMU-IOVA track that slice 2 reuses.kernel/src/cap/device_mmio.rs(notify_doorbell),kernel/src/cap/dma_buffer.rs(export labels),kernel/src/cap/virtio_net_polled_provider.rs(kernel-owned vring) – the primitives whose current limits define the cap-surface gap.
Real-Filesystem Decision: Role-Split, Not One Format
Decision
capOS does not adopt a single general-purpose on-disk filesystem. It adopts a role-split in which each storage role uses the format that fits it, behind the same capability interfaces:
- (A) capOS-managed data and state stays capnp-native. Evolve the existing
CAPOSWF1writable-filesystem andCAPOSST1persistent-store fixed layouts (kernel/src/cap/writable_fs.rs,kernel/src/cap/persistent_store.rs); do not replace them with a general-purpose format. These already have a crash-consistency proof in tree (make run-storage-writable-recovery), so a format swap would discard a tested durability story for no consumer benefit. - (B) Host-populated and interop images gain READ-ONLY FAT32. Add a
read-only FAT32
Directory/Filebacker over the existingBlockDevice, using thefatfsno_std crate. FAT32 is the one standard interop format with a maintained no_std read crate and zero licensing risk (the FAT long-name patents have expired;fatfsis MIT). It is already structurally part of the boot path – the EFI System Partition Limine reads is FAT32 (docs/backlog/hardware-boot-storage.md). - (C) Host tooling consolidates onto one capnp image tool. Retire the
per-format
tools/mkstorage-*.pybyte-offset scripts (each hand-encodes a fixed layout at literal offsets) in favor of one schema-driven image tool, so the on-disk layout has a single typed source of truth instead of N parallel offset hazards.
Why the Capability Layer Is Unchanged
The Directory, File, and Store interfaces in schema/capos.capnp are the
contract; the on-disk format lives below them as another CapObject backer, so
adding FAT32 adds no schema surface and no new caller-visible behavior. The
interfaces already model every operation a format backer must answer:
These kernel backers (readonly_fs.rs, writable_fs.rs, persistent_store.rs,
and the RAM file/directory/store/namespace caps) are proof/fixture
surface, not production storage routes – they are gated behind the qemu
feature (with storage_fat_read / cloud_*_over_nvme_proof variants) and fail
closed in the default production kernel. Production storage is userspace-served
by the demos/storage-fs-service, demos/storage-persist-service, and
demos/store-service services; see
Kernel Storage Cap Backers Are Fixtures.
The role-split below still governs which on-disk format sits beneath the cap
interfaces in those proofs and in any future userspace format backer.
Directory:open @0,list @1,mkdir @2,remove @3,sub @4,create @5,rename @6(schema/capos.capnp:1824).File:read @0,write @1,stat @2,truncate @3,sync @4,close @5(schema/capos.capnp:1793).Store:put @0,get @1,has @2,delete @3(schema/capos.capnp:1857).
A read-only backer answers the read/list/open/stat methods and fails closed on
every mutation, exactly as readonly_fs.rs does today
(kernel/src/cap/readonly_fs.rs:618 rejects mkdir/remove/sub/create/
rename). Attenuation is structural, not a rights bitmask: a read-only File is
a wrapper that rejects write/truncate/sync, per the schema comment at
schema/capos.capnp:1798.
Known caveat (partially lifted): stat/info timestamps were originally
stubbed to zero in every filesystem backer. The Slice 4 timestamp increments
lift this for the CAPOSWF1 writable filesystem only – it now persists real
created/modified timestamps in the node record, carries the corresponding
ClockProvenance label from the same WallClock source, and returns the
timestamp values from File.stat (proof make run-storage-writable). The
read-only CAPOSRO1 and persistent_store CAPOSST1 backers still expose
zero/unknown timestamp state, and FAT32 read can surface real FAT
directory-entry timestamps later; those remain named Slice 4 follow-ups.
Why Not ext4 / exFAT / littlefs / FAT-Write
- ext4-read: deferred under an explicit trigger. capOS reads no real
third-party filesystem today and does not need to for boot: Limine reads the
FAT32 ESP, the kernel image is
include_bytes!or read from ISO 9660 (kernel/src/iso/), and the cloud boot disk is a capOS-authored GPT + FAT-ESP, never a provider ext4 root. That collapses the usual “must read the provider’s ext4 root” argument. ext4-read is deferred behind a single explicit trigger: capOS must read a disk it did not format. Until that exists, ext4’s large read-only parser surface buys nothing. - ext4-write: rejected. It would be the first writable real-disk format and
has no crash-consistency story in tree; landing it without a recovery proof
regresses the durability bar
CAPOSWF1already meets. - exFAT: rejected. Patent surface, no role advantage over FAT32 for the host-interop slot.
- littlefs / SimpleFS: rejected. FFI plus vendoring cost with no winning role – managed state is already served by the capnp-native layouts, and host-interop wants a format the host actually writes (FAT32).
- FAT-write: rejected for now. No crash-consistency story; it would be the first writable format landing without a recovery proof. FAT32 stays read-only in this decision.
Decision Matrix
Axes: host-interop fit; no_std read/write implementation cost; crash-consistency story; capability/capnp fit; cloud-disk-read need today; licensing; available crates.
| Format | Host-interop | no_std read / write cost | Crash-consistency | capnp fit | Cloud-disk-read need | Licensing | Crates |
|---|---|---|---|---|---|---|---|
| FAT32 (read-only) | High (host writes it; ESP already FAT32) | Read: low (fatfs) / write: out of scope | n/a (read-only) | Backer below Directory/File | n/a (capOS authors its disks) | Clean (FAT patents expired; fatfs MIT) | fatfs no_std |
| exFAT | Medium | High / High | n/a | Same | n/a | Patent surface | None no_std mature |
| ext4-read | Low (no consumer today) | High (large parser) / — | n/a (read-only) | Same | None today (trigger only) | Clean | None mature no_std |
| ext4-write | Low | Very high / very high | None in tree | Same | None | Clean | None mature no_std |
| littlefs / SimpleFS | Low | Medium (FFI+vendor) / medium | Has its own story | Same | None | Clean | FFI/vendor |
capnp-native (CAPOSWF1/CAPOSST1) | None (capOS-only) | Already in tree | Proven (run-storage-writable-recovery) | Native | n/a | Clean | In tree |
Phased Plan
- Slice 0 (this doc). Record the role-split decision and the matrix.
- Slice 1 (landed 2026-06-02 20:59 UTC). Vendored
fatfs(withVENDORED_FROM.md,vendor/fatfs-no_std/) and added a read-only FAT32Directory/Filebacker over virtio-blk:kernel/src/cap/fat_fs.rs, aBlockStorageadapter over the virtio-blkBlockDevicedriving the vendoredfatfsread path. Host image built with realmkfs.fat+mcopy(2 files, one multi-cluster). Smokemake run-storage-fat-readreads the multi-cluster file back throughDirectory.open->File.readand asserts the bytes plus the fail-closed mutations. Grant-source realization deviation: the task text proposed a newfat_fs_rootKernelCapSource, butKernelCapSourceis aschema/capos.capnpenum (andcapos-configdecode) outside the task’swrite_scope. The backer is instead selected under a newstorage_fat_readkernel feature on the existingread_only_fs_rootsource – mirroring how that source already selects itsVirtiovs NVMe backend – so it needs no newKernelCapSourceand no schema change, keeping the conflict surface disjoint from the in-flight NVMe graduation (which editsreadonly_fs/writable_fs/persistent_store). Provenance map: FAT32 (read-only backer). Task record:cloud-prod-fat32-readonly-over-virtio-blockdevice-local-proof. - Slice 2 (landed 2026-06-03 01:44 UTC). FAT32 read over the NVMe
BlockDevicearm. Its prerequisite – the NVMe read-arm graduation (cloud-prod-nvme-storage-graduate-readarm-local-proof) – had landed, so the slice stacks on an always-built read arm rather than a per-proof feature: it added anNvmeBlockSourcevariant tofat_fs.rs(deferred mount viaFatMount, mirroringreadonly_fs’s NVMe arm) and proves a host-authoredmkfs.fatimage (the pre-populated NVMe medium content, no manager seed) read back over the graduated NVMe read arm behind the unchangedDirectory/Filecap contract. Selected by a new non-qemucloud_fat_read_over_nvme_prooffeature on the existingread_only_fs_rootsource (no newKernelCapSource, no schema change); its cap-waiterInterruptroute +provider-fat-read-over-nvmemarker come fromkernel/src/cap/fat_read_over_nvme_proof.rs. Because the FAT cluster-chain walk issues many single reads per boot, the proof raises the I/O queue depth to 64. Proof:make run-cloud-provider-fat-read-over-nvme. Task record:cloud-prod-fat32-readonly-over-nvme-blockdevice-local-proof. - Slice 3 (first increment landed 2026-06-03 03:36 UTC; second increment
landed 2026-06-03 04:08 UTC; third increment landed 2026-06-03 05:47 UTC;
fourth increment landed 2026-06-03 08:25 UTC; seeded installable writable
increment landed 2026-06-06 13:38 UTC at
ac0c5e2d; final fixture retirement; CAPOSST1 + empty/seeded co-located CAPOSWF1 + CAPOSRO1 + NVMe-writable CAPOSWF1). The host capnp image tool retired the hand-encoded capnp-layout Python fixtures one layout at a time. The first increment ported theCAPOSST1persistent-Storeimage producer from the retired byte-offset scripttools/mkstore-image.pyto a typed Rust host tool (tools/mkstore-image/, a standalone host crate built on the host target viacargo test-mkstore-image, liketools/mkmanifest/). Later increments added--writable,--readonly-fs,--writable-nvme, and seeded--writablemodes for the empty co-locatedCAPOSST1+CAPOSWF1image,CAPOSRO1read-only filesystem image, fixed-size (NVME_NAMESPACE_BLOCKS= 32768-block / 16 MiB) NVMe-writableCAPOSWF1namespace image, and installable-system seeded writable variants. The kernelCAPOSST1/CAPOSWF1/CAPOSRO1layouts (includingNVME_NAMESPACE_BLOCKS), theStore/Directory/Filecontracts, and the disk bytes the kernel reads are all unchanged: the earlier migration proved byte identity against the retired Python outputs, andcargo test-mkstore-imagenow pins the maintained Rust outputs with golden byte checks. The re-pointed reboot/recovery/read-only proofs stay green reading the tool-produced image. The host-authored FAT image path (tools/mkstorage-fat-read-image.py) stays on realmkfs.fat/mcopytooling — it is not a hand-rolled capnp byte-offset layout, so it is not a target for the typed capnp image tool. The Python capnp-layout builders have been retired; the Rust tool is the maintained capnp-native fixture path. - Slice 4 (decomposed; FAT and capnp-native increments landed in part). capnp-native
enhancements: real
stattimestamps and store compaction on the managed layouts. The first bounded increment landed – theCAPOSWF1writable filesystem now persistscreated/modifiedtimestamps in the node record’s reserved trailing bytes (no field moved, record stays 128 bytes, format version unchanged) and returns them fromFile.stat, sourced from theWallClocktimebase, with the on-disk layout and the forced-poweroff recovery proof held byte-stable (cloud-prod-fs-capnp-native-stat-timestamps-local-proof, proofsmake run-storage-writable/make run-storage-writable-recovery). The provenance increment threads the sameWallClocksource into the writable backer and uses the node-record provenance bytes to carry theClockProvenancelabel alongsidecreated/modified;File.statremains schema-stable and the local proof records the stored labels through the storage smoke log. The FAT increment now surfaces valid FAT directory-entrycreated/modifiedvalues from the host-authored read-only image through the same schema-stableFile.statfields over both virtio-blk and NVMe. The proof logs distinguishmetadata_provenance=fat-directory-entryfromCAPOSWF1’sWallClockprovenance and keep FAT’s timezone-free/two-second-modified-time limits explicit. The second bounded increment landedCAPOSST1persistent-Storecompaction: when a newputwould exhaust the entry table or data cursor and tombstones exist, the kernel rewrites live entries through a shadow generation before recommitting the canonical front generation;make run-storage-persistproves pre-compaction write, delete/tombstone, compaction-triggered write, reboot, post-reboot reads, and tombstone absence (storage-caposst1-store-compaction-local-proof). Remaining follow-ups: timestamps and timestamp provenance on the other managed/read-only layouts (CAPOSST1Store,CAPOSRO1). - Slice 5 (deferred). ext4-read, only once the explicit trigger (“must read a disk capOS did not format”) materializes.
Relationship to the NVMe Graduation
The NVMe BlockDevice graduation and real-FS work are stacked, not competing:
- The graduation sits below
BlockDevice– it moves the NVMe read/write/flush arms into always-built production behind fail-closed runtime probes (cloud-prod-nvme-storage-graduate-readarm-local-proof). - Real-FS sits above
BlockDevice– it adds newCapObjectbackers (fat_fs.rs) that read through whateverBlockDeviceprovides.
Slice 1 deliberately reads over virtio-blk and adds a new file, so its
conflict surface is disjoint from the graduation’s edits to the existing storage
modules. Slice 2 is the join point, sequenced after the graduation landed: it
consumes the always-built NVMe read arm (it does not modify it) by adding the
Nvme BlockSource arm to the same fat_fs.rs.
Design Grounding
kernel/src/cap/readonly_fs.rs– the read-onlyDirectory/FileoverBlockSourcepattern Slice 1 mirrors, including the fail-closed mutation arm.kernel/src/cap/writable_fs.rs,kernel/src/cap/persistent_store.rs– the capnp-native managed layouts (CAPOSWF1/CAPOSST1) the decision evolves rather than replaces.schema/capos.capnp– theDirectory/File/Storecontract the format backers serve.docs/backlog/hardware-boot-storage.md– the storage track and the FAT32 ESP/GPT boot-disk facts that collapse the ext4 argument.
Proposal: capos-service
Renamed from
libcapos-servicetocapos-serviceto keep the planned Rust framework crate name distinct from the C-substrate staticlib (libcapos.a, built from thelibcapos/crate). The two layers are unrelated:libcaposis the C ABI for C consumers, andcapos-serviceis the Rust framework userspace services link against.
Define a userspace service framework above capos-rt for long-running capOS
services. The library should provide common lifecycle, endpoint, readiness,
shutdown, context, metrics, and budgeting mechanics without adding a generic
kernel Service capability or a kernel-level phase machine.
Current State
Slice 1 is implemented. capos-service/ is a standalone no_std crate, not a
root workspace member, and depends on capos-rt without modifying the runtime.
It exposes a minimal ServiceMain/ServiceRuntime framework with ordered
initialize, dependency-wait, ready, run, drain, shutdown, and cleanup phases.
The first converted proof was demos/telnet-gateway: the gateway performed
CapSet validation and scoped listener setup through the lifecycle framework
and printed a capos-service readiness marker. That demo is since removed
with the kernel socket owner (make run-telnet is retired); the crate
currently has no in-tree consumer, and compile coverage comes from
make capos-service-check until the next service adopts the framework.
The initial crate deliberately does not add metrics, resource-budget hooks, endpoint serve-loop helpers, graceful handoff, or generic shutdown authority. Those remain later slices, grounded by Resource Accounting and Quotas and the error-boundary rules in Error Handling.
The immediate target is terminal/networking lifecycle: byte-stream terminal hosting, Telnet/TLS/SSH gateway plumbing, listener accept loops, shell launch, proxying, cleanup, and observable shutdown. HTTP/fetch services come later.
Problem
Current services duplicate the same shape:
- discover bootstrap caps;
- wait for dependencies;
- mark readiness through log output or implicit behavior;
- run accept or endpoint receive loops;
- spawn children or proxy byte streams;
- release result caps and temporary state;
- log or count failures;
- shut down after EOF, error, process exit, or supervisor request.
Duplicating that lifecycle is tolerable for proofs, but it is a poor foundation for production gateway, storage, agent, monitoring, and network services. Repeated hand-rolled loops are also where capability leaks, stuck children, incorrect close ordering, and hidden unbounded work appear.
Layering Decision
The stack remains:
schema/capos.capnp
stable authority-bearing interfaces
capos-rt
raw runtime and transport:
bootstrap, CapSet, ring client, typed handles, completion matching,
release flushing, exception decoding
capos-service
generic userspace service container:
lifecycle, endpoint loops, readiness, shutdown, background tasks,
metrics, context, resource hooks
domain libraries
HTTP/fetch, terminal host, storage, supervisor, agent tools
init/supervisors
compose services by passing capabilities, not global names
capos-service is not a new authority source. It wraps and narrows
capabilities the process already holds. The kernel still sees ordinary typed
capability calls and ordinary process lifecycle.
Core Surface
Initial framework pieces:
- Service lifecycle: initialize, dependency wait, ready, run, drain, shutdown, and final cleanup.
- Endpoint serve loops: generated or handwritten helpers for
RECV, decode, dispatch,RETURN, exception return, cancellation, and release. - Readiness handles: typed local handles or service-exported readiness caps, not global service names.
- Shutdown and drain: cancellable waits, child/process-handle cleanup, listener stop, in-flight request drain, bounded force-close.
- Background tasks: timers, periodic health checks, metrics export, and discovery loops with explicit cancellation.
- Request/session context: owned context object per request or session containing caller-session metadata, derived policy, resource reservations, transfer state, timing, and audit correlation.
- Metrics hooks: bounded counters and summaries; no unbounded per-user, per-cap-id, or per-method labels by default.
- Resource budgeting: reservation/donation hooks that call into the relevant ledger owner; the framework records what was reserved and releases it on every exit path.
- Error boundary: preserve the error-handling split from
error-handling-proposal.md: CQE status for transport/kernel dispatch failure,CapExceptionfor capability infrastructure failure, and schema result unions for normal domain outcomes. - Graceful handoff hooks: transfer or drain listeners, endpoint loops, child handles, background tasks, and in-flight request state during upgrade or supervisor-directed replacement. Handoff must be explicit; silent cloning of authority or abandoning in-flight work is a bug.
First Target: Terminal And Networking
The first useful slice should be:
TerminalSessionFromByteStream/ byte-stream terminal host.- Lifecycle wrapper around accept, session minting, proxying, and cleanup.
- Request/session context and metrics hooks.
- Network service container for listener-backed services.
- HTTP/fetch lifecycle only after terminal/networking proves the cleanup and authority model.
This ordering deliberately exercises the hard lifecycle edges before adding HTTP convenience: authenticated session creation, shell spawn, bidirectional byte proxying, EOF/close/error ordering, repeated connect/disconnect, and release of terminal/session/process result caps.
Authority Rules
- The framework must not accept ambient service names, raw global handles, or stringly typed service discovery.
- Hooks receive narrow capabilities, not ambient process authority.
- Request/session context is lifecycle-owned and cannot outlive the request/session that created it.
- Background tasks are budgeted, cancellable, and observable during shutdown.
- Retry policy must encode side-effect safety through idempotency, operation ids, or a domain-specific no-retry rule.
- Pool keys for reusable resources include every authority and identity field that changes policy: target, protocol, TLS identity, cap/object epoch, caller/session reference, namespace, tenant, and transformation policy.
- Cache keys must include tenant, session, and authority dimensions where those dimensions affect disclosure or correctness.
- Protocol parsers must drain or close before stream reuse.
- Readiness means the service can actually accept authorized work; config parse success is not enough.
- Shutdown must either drain, cancel, or explicitly transfer all in-flight work.
Non-Goals
- No generic kernel
Servicecapability. - No kernel callback registry or phase machine.
- No plugin ABI that passes
phase_idand bytes through a single generic cap. - No global service discovery namespace.
- No HTTP-first framework that delays terminal/networking lifecycle cleanup.
- No replacement for
capos-rttransport primitives.
Implementation Sequence
- Implemented: draft shared
ServiceMain/ServiceRuntimeshape for one process and convert the plaintext Telnet gateway to prove the lifecycle wrapper without changing its QEMU behavior. - Factor byte-stream terminal host lifecycle around
TerminalSessionFromByteStream. - Convert another focused terminal or gateway proof only after the byte-stream terminal host split is ready.
- Add request/session context and bounded metrics hooks.
- Add readiness and shutdown/drain helpers.
- Add endpoint serve-loop helpers that preserve typed schema authority.
- Add resource reservation/donation hooks.
- Consider HTTP/fetch domain library only after terminal/networking proofs pass.
Verification
Initial proof gates:
make docs
make run-terminal
make run-telnet or qemu-telnet-harness
focused close/reconnect proof
hidden password behavior remains byte-identical
child shell receives no raw network/spawn/listener authority
gateway cleanup releases terminal/session/process handles on EOF/error/shutdown
Later endpoint-helper gates should add targeted tests for exception return, result-cap release, cancellation, and resource rollback.
Related
- Service Architecture defines the
capability-based service composition, authority-at-spawn, and service graph
policy that
capos-serviceconsumers must respect; the framework wraps capabilities granted through that model rather than minting new authority. - Cloud Deployment describes the cloud VM
surface (provider storage/NIC drivers, cloud clocking, instance bootstrap)
that future
capos-servicelistener and gateway services will run on top of once the userspace DeviceMmio/DMAPool/Interrupt authority gate exists. - Pingora research records the framework precedent and rejects importing Pingora’s HTTP proxy model into the kernel.
- Telnet over TLS Shell and SSH Shell Gateway define the terminal factory and remote-ingress boundaries.
- Error Handling defines the three error layers that generated clients and service helpers must preserve.
- Resource Accounting and Quotas defines the ledger vocabulary for budgeting/donation hooks.
Proposal: Capability-Based Binaries, Language Support, and Compatibility Adapters
How userspace binaries receive, use, and compose capabilities, from the native Rust runtime through future language runtimes and compatibility adapters.
Current State
The init binary (init/src/main.rs) and smoke services are no_std Rust
binaries over capos-rt. The runtime owns _start, fixed heap initialization,
CapSet parsing, exit/cap_enter syscall wrappers, typed clients, result-cap
adoption, queued release flushing, and panic output. Init reads the BootPackage
manifest, validates the metadata-only service graph, spawns child services
through ProcessSpawner, waits on ProcessHandles, and exits. The former raw
bootstrap syscall and demo-support runtime shims are historical; demo support
now keeps only low-level transport helpers for intentionally malformed SQE/CQE
smokes.
Userspace now has a checked-in targets/x86_64-unknown-capos.json custom
target that exposes target_os = "capos" while preserving the current static
ELF, soft-float, no_std baseline. The kernel remains on the repository default
x86_64-unknown-none target. init, demos, shell, and the capos-rt
smoke binary build through custom-target Cargo aliases, and checked-in CUE
manifests embed userspace from target/x86_64-unknown-capos/release paths.
The remaining future work is hardening this target contract into a broader
toolchain and packaging interface rather than treating it as a probe.
The kernel-side roadmap provides the capability ring (SQ/CQ shared memory plus
cap_enter, implemented), scheduling, and IPC. This proposal covers the
userspace half: what binaries look like, how they are built, and how existing
software can be adapted to a system with no ambient authority.
Part 1: Native Userspace Runtime (capos-rt)
The Historical Problem
Before capos-rt, every userspace binary had to:
- Define
_startand a panic handler - Set up an allocator
- Construct raw syscall wrappers
- Manually serialize/deserialize capnp messages
- Know the syscall ABI (register layout, method IDs)
That was acceptable for one proof-of-concept binary. It does not scale to
dozens of services, and the current tree has moved those mechanics into
capos-rt.
Solution: A Userspace Runtime Crate
capos-rt is a no_std + alloc Rust crate that every native capOS binary
depends on. It provides:
1. Entry point and allocator setup.
#![allow(unused)]
fn main() {
use capos_rt::{Console, ConsoleClient, Runtime};
fn service_main(mut runtime: Runtime) -> i64 {
let console = match runtime.capset().get_typed::<Console>(b"console") {
Ok(cap) => cap,
Err(_) => return 1,
};
let mut ring = match runtime.ring_client() {
Ok(ring) => ring,
Err(_) => return 2,
};
let mut client = ConsoleClient::new(console);
match client.write_line_wait(&mut ring, "Hello from capOS", u64::MAX) {
Ok(()) => 0,
Err(_) => 3,
}
}
capos_rt::entry_point!(service_main);
}
2. Syscall layer. Raw syscall asm wrapped in safe Rust functions.
The entire syscall surface is 2 calls – new operations are SQE opcodes, not
new syscalls:
sys_exit(code)– terminate the current thread; the process exits when this was its last live thread (syscall 1)sys_cap_enter(min_complete, timeout_ns)– flush pending SQEs, then wait until N completions are available or the timeout expires (syscall 2)
The accepted in-process threading contract preserves this two-syscall surface:
thread exit is available through both the raw terminal syscall and the typed
ThreadControl.exitThread capability call.
Capability invocations go through the per-process SQ/CQ ring. capos-rt
provides helpers for writing SQEs and reading CQEs:
#![allow(unused)]
fn main() {
/// Submit a CALL SQE to the capability ring and wait for the CQE.
pub fn cap_call(
ring: &mut CapRing,
cap_id: u32,
method_id: u16,
params: &[u8],
result_buf: &mut [u8],
) -> Result<usize, CapError> {
ring.push_call_sqe(cap_id, method_id, params);
sys_cap_enter(1, u64::MAX);
ring.pop_cqe(result_buf)
}
}
3. Cap’n Proto integration. The current runtime uses handwritten typed
clients over schema-defined method ids and message shapes. Shared generated
schema bindings live through capos-config; broad generated client bindings
for capos-rt remain future work. The runtime owns transport lifetime and
completion matching, while each typed client owns its interface-specific
message encoding.
4. CapSet – the initial capability environment.
At spawn time, the kernel writes the process’s initial capabilities into the
read-only CapSet page and passes its address to _start. capos-rt parses
this into a typed lookup surface over name, local CapId, and interface id.
#![allow(unused)]
fn main() {
struct CapEntry {
cap_id: u32, // authority-bearing slot in the process CapTable
interface_id: u64, // Cap'n Proto interface TYPE_ID for type checking
}
impl CapSet {
/// Get a typed capability by manifest name.
pub fn get_typed<T: CapabilityType>(
&self,
name: &[u8],
) -> Result<Capability<T>, CapSetError> { ... }
/// Iterate manifest-order entries for diagnostics and shell inspection.
pub fn iter(&self) -> impl Iterator<Item = CapSetEntryRef> { ... }
}
}
interface_id is not a handle. It is metadata carrying the Cap’n Proto
TYPE_ID for the interface expected by the typed client. The handle is
cap_id. A typed client constructor must check that
entry.interface_id == T::TYPE_ID, then store the local CapId. Normal CALL
SQEs do not need to repeat the interface ID because each capability table entry
exposes one public interface. The ring SQE keeps fixed-size reserved padding
for ABI stability, not a required interface field for the system transport.
This matters for the system transport because several capabilities can expose
the same interface while representing different authority: a serial console, a
log-buffer console, and a console proxy all have the Console TYPE_ID, but
different CapId values.
Crate Structure
capos-rt/
Cargo.toml # no_std + alloc, depends on capnp
build.rs # userspace linker arguments
src/
lib.rs # type markers, owned handles, entry_point! macro
entry.rs # _start, Runtime, bootstrap validation
syscall.rs # raw asm syscall wrappers
capset.rs # CapSet lookup and iteration helpers
client.rs # handwritten typed clients
ring.rs # single-owner ring client and completion matching
alloc.rs # userspace heap allocator setup
capos-rt is NOT a workspace member (same as init/ – needs different
target/linker handling from the kernel). It’s a path dependency for userspace
crates.
Init On The Current Runtime
init/src/main.rs is already a capos-rt user. Its init_main(Runtime) entry
is registered with capos_rt::entry_point!, obtains typed bootstrap caps from
the runtime CapSet, reads the BootPackage manifest, validates the service graph,
resolves spawn grants, launches children through ProcessSpawnerClient, waits
on ProcessHandleClient, and reports failures through the Console client.
Part 2: Capability-Based Binary Model
Binary Format
ELF64, same as now. The kernel’s ELF loader (kernel/src/elf.rs) already
handles PT_LOAD segments. No changes to the binary format itself.
What changed from the early prototype to the current runtime baseline is the ABI contract between kernel and binary:
| Aspect | Historical prototype | Current capos-rt baseline |
|---|---|---|
| Entry point | crate-local _start(), no args | runtime-owned _start(ring_addr, pid, capset_addr) |
| Syscall ABI | ad-hoc (rax=0 write, rax=1 exit) | SQ/CQ ring + sys_cap_enter + sys_exit |
| Capability access | none | read-only CapSet page validated by capos-rt |
| Serialization | none | Cap’n Proto messages encoded by typed clients |
| Allocator | none or crate-local | runtime-owned fixed heap |
Initial Capability Passing
The kernel communicates bootstrap state through _start arguments and fixed
userspace mappings. The implemented shape is:
ring_addr: the process capability ring, expected to equalRING_VADDR.pid: the process identifier for diagnostics/runtime bookkeeping.capset_addr: read-only bootstrap CapSet page populated from the manifest and spawn grants.
Earlier options considered:
Option A: Well-known page. Kernel maps a read-only page at a fixed virtual
address (e.g., 0x1000) containing a capnp-serialized InitialCaps message:
struct InitialCaps {
entries @0 :List(InitialCapEntry);
}
struct InitialCapEntry {
name @0 :Text;
id @1 :UInt32;
interfaceId @2 :UInt64;
}
Option B: Register convention. Pass pointer and length in rdi/rsi at
entry. Simpler, but the data still needs to live somewhere in user memory.
Option C: Stack. Push the cap descriptor onto the user stack before iretq.
Similar to how Linux passes auxv to _start.
Option A is cleanest – the page is always there, no calling-convention dependency, and it naturally extends to passing additional boot info later.
Service Binary Lifecycle
1. Kernel loads ELF, creates address space, populates cap table
2. Kernel maps InitialCaps page at well-known address
3. Kernel enters userspace at _start
4. capos-rt _start:
a. Initialize heap allocator
b. Parse InitialCaps page into CapSet
c. Call user's main(CapSet)
5. User main:
a. Extract needed caps from CapSet
b. Do work (invoke caps, serve requests)
c. Optionally export caps to parent once ProcessHandle export lookup exists
6. On return from main (or sys_exit):
a. Kernel destroys process
b. All caps in process's cap table are dropped
c. Parent's ProcessHandle receives exit notification
Part 3: Language Support Roadmap
The current manual status page for this subject is Programming Languages. This proposal owns the longer roadmap and should not be read as implemented support for every language listed below.
Implemented Baseline: Rust (no_std + alloc)
Rust is the only implemented booted language path. Native services use
#![no_std], alloc, capos-rt, static ELF binaries, and the
targets/x86_64-unknown-capos.json userspace target. This fits the current
kernel because it does not require a libc, dynamic linker, process environment,
global filesystem, or ambient socket namespace.
Rust remains the default implementation language for core capOS services until the runtime, schema, and packaging contracts are stable. That is a project priority, not a rule that every future service must be written in Rust.
Future: Rust std
Rust std support is not implemented. It requires an operating-system backend
for filesystem, networking, threads, time, standard I/O, process, environment,
and synchronization APIs. On capOS those APIs must get authority from granted
capabilities such as Directory, File, TcpSocket, Timer,
ThreadSpawner, ThreadControl, ParkSpace, StdIO, and ProcessSpawner.
The project has not selected whether Rust std should be implemented directly
over native capOS capabilities, through a POSIX compatibility adapter, or in a
hybrid form. Until that decision is made, native no_std + alloc Rust over
capos-rt remains the supported Rust path.
C via libcapos
The C substrate is in tree at Phase 0. The libcapos/ crate compiles to
libcapos.a, a thin Rust staticlib that exposes the capos-rt syscall, ring
CALL, CapSet lookup, and global allocator under an extern "C" ABI. C
binaries link statically against the archive, share the userspace ELF layout
used by Rust demos, and run inside the existing capos-rt _start chain.
make run-c-hello boots a C main() that calls Console.writeLine,
Timer.now, EntropySource.fill, and VirtualMemory wrappers through
libcapos and exits cleanly. make run-c-pipe boots a second native C smoke
that creates a kernel pipe through the typed ProcessSpawner.createPipe
wrapper, writes and reads a marker through typed Pipe wrappers, closes the
writer, observes EOF, and exits cleanly.
The current substrate is intentionally narrow: capability primitives,
hand-written typed wrappers (capos_console_write_line, capos_timer_now,
capos_entropy_fill, the capos_virtual_memory_{map,unmap,protect} trio,
capos_process_spawner_create_pipe, and capos_pipe_{read,write,close}),
raw syscalls, and the heap shim. The Pipe wrapper is a typed bridge over the
existing transferred-result-cap path; it does not make capos_cap_call() a
general transfer ABI, which still refuses transfer-bearing completions with
CAPOS_E_TRANSFER_NOT_SUPPORTED. Anything POSIX-shaped (errno, fd table,
open/read/write, signals, fork/exec, sockets) belongs in the separate
libcapos-posix layer above libcapos.
Generated typed wrappers for the remaining capabilities (NetworkManager,
Endpoint, etc.), a stable C ABI for cap-transfer (today the v0 surface
refuses transfer-bearing completions with CAPOS_E_TRANSFER_NOT_SUPPORTED),
and per-thread runtime routing are also future work. Until that routing or a
POSIX pthread layer lands, libcapos v0 is fail-closed for C-created capOS
threads: capos_cap_call rejects bootstrap ThreadSpawner capabilities with
CAPOS_E_THREADING_UNSUPPORTED, and concurrent or re-entrant runtime borrows
return CAPOS_E_RUNTIME_BUSY.
The target libcapos shape is a static library providing:
#include <capos.h>
// Ring-based capability invocation (synchronous wrapper around SQ/CQ ring)
int cap_call(cap_ring_t *ring, uint32_t cap_id, uint16_t method_id,
const void *params, size_t params_len,
void *result, size_t result_len);
// Typed wrappers (generated from .capnp schema)
int console_write(cap_t console, const void *data, size_t len);
int console_write_line(cap_t console, const char *text);
// CapSet access
cap_t capset_get(const char *name);
uint64_t capset_interface_id(const char *name);
// Syscalls (the entire syscall surface -- 2 calls total)
_Noreturn void sys_exit(int code); // terminate current thread
uint32_t sys_cap_enter(uint32_t min_complete, // flush SQEs + wait
uint64_t timeout_ns);
Implementation: libcapos is Rust compiled to a static .a with a C ABI
(#[no_mangle] extern "C"). The capnp message construction happens in Rust
behind the C API. This avoids requiring a C capnp implementation.
C binaries would link against libcapos.a and use the same static userspace
ELF model as Rust binaries. Startup, allocator setup, CapSet access, and ring
submission should be owned by libcapos, not repeated in every C program.
Future: C++
C++ support waits on the C substrate and explicit ABI decisions: exceptions, RTTI, TLS, allocator behavior, unwind policy, static initialization, and the scope of any standard-library subset. A freestanding arena/container subset is plausible earlier than hosted C++.
The previously inspected pg83/std library remains a later experiment, not a
shortcut to full C++ support. Its low-level arena/container pieces are relevant;
its hosted/POSIX assumptions still require the same capOS adapter work as other
C++ libraries.
Future: Go (GOOS=capos)
Go is the next high-priority runtime after regular Rust. It needs in-process threading, futex-like wait/wake, TLS/runtime metadata support, GC integration, and a network poller mapped to capOS capabilities. See Go Runtime for the dedicated plan.
Go has higher priority than C++ because it unlocks CUE and a large practical tooling/runtime ecosystem. Go via WASI may be useful for CPU-bound CUE evaluation before native Go exists, but it is not a substitute for native Go network services or full runtime behavior.
Future: Python
Python is not implemented on booted capOS. It has three plausible paths:
- Native CPython through a POSIX compatibility adapter. This depends on the C/libc substrate plus file, stdio, timer, networking, and process adapters. It is the likely path for trusted system scripts and Python tools that need capOS storage or networking.
- MicroPython through the native C substrate. This is a smaller early scripting option with less runtime surface than CPython.
- WASI or Emscripten-hosted Python. This is useful for sandboxed or compute-oriented Python. It still runs a Python interpreter; WebAssembly is the sandbox and host ABI, not a way to avoid Python runtime work.
As of this review, upstream CPython support helps only the WebAssembly path:
PEP 11 lists
wasm32-unknown-wasip1 as Tier 2 and wasm32-unknown-emscripten as Tier 3,
and PEP 776 records Emscripten support
for Python 3.14. Those facts do not provide native capOS bindings for files,
sockets, threads, process launch, or capabilities.
Future: Lua
Lua is a future capability-scoped scripting runner. The dedicated
Lua Scripting proposal defines capos-lua as an
ordinary userspace process with exact grants, curated standard libraries,
unforgeable capability userdata, and no raw CapIds exposed to scripts. Upstream
PUC Lua is a C implementation, so the native path waits on the C/libcapos
substrate unless the project uses a pure-Rust Lua-like VM as a bootstrap proof.
Future: JavaScript / TypeScript
JavaScript support means running an engine as an ordinary capOS process. A small QuickJS-style native runner is the likely first experiment after C support. V8 or SpiderMonkey are much larger C++ runtime ports. TypeScript is normally compiled before execution and should not imply a kernel or base-system TypeScript compiler.
Partially landed: WASI and WebAssembly
The WASI host adapter Phase W.4 closed 2026-05-07 20:09 UTC
(docs/proposals/wasi-host-adapter-proposal.md,
docs/proposals/wasi-host-adapter-proposal.md). Languages that compile to WASI
Preview 1 can now run on capOS through the wasm-host process
(capos-wasm/, vendored wasmi 1.0.9), with imports backed by
granted capOS capabilities. The current Preview 1 surface covers
stdout/stderr writes, manifest-granted argv, bounded manifest-granted
environment entries through initConfig.init.wasiEnv,
monotonic clock time/resolution, no-op sched_yield, stdio fd
metadata, stdio seek refusal as ERRNO_SPIPE, clean shutdown, and
random_get when the manifest grants EntropySource. The regression
smokes are make run-wasi-hello-rust (Rust wasm32-wasip1 payload),
make run-wasi-hello-c (C wasm32-wasi payload),
make run-wasi-cli-args, make run-wasi-env, make run-wasi-random,
make run-wasi-random-ungranted, and make run-wasi-stdio-fd.
Filesystem (W.5), sockets (W.6), and Preview 2 / Component Model
(W.7+) remain future phases; make run-wasi-preview1-refusals
keeps proving representative blocked storage/socket imports return
ERRNO_NOSYS = 52 without authority.
Important distinction: WASI works differently for compiled vs. interpreted languages:
- Compiled languages (Rust, C) compile directly to
.wasm— no interpreter in the loop. WASI is a clean, efficient execution path. - Interpreted languages (Python, JS, Lua) still need their interpreter
(CPython, QuickJS, etc.) — it’s just compiled to
.wasminstead of native code. The stack becomes: script → interpreter.wasm → WASI runtime → kernel. You pay for a wasm sandbox layer on top of the interpreter you’d need anyway.
For interpreted languages, WASI sandboxing is valuable when running untrusted plugins or user-submitted scripts. For trusted system scripts, native CPython, QuickJS, or Lua over a POSIX or capability-native adapter may be simpler and faster once the native C substrate exists.
Future: Managed Runtimes
Languages with large managed runtimes such as Java and .NET need their runtime ported or a WASI-style host path. This is large effort and low priority.
Part 4: POSIX Compatibility Adapter
Status note: the full design lives in POSIX Adapter proposal and the implementation decomposition in POSIX Adapter, which are the canonical source for phase status. Phases P1.1 (libcapos C-substrate v0 + C hello smoke, closed
2026-05-05 13:28 UTC), P1.2 Phase A (UDP cap surface +capos-rtUdpSocketClient, closed2026-05-05 18:02 UTC), P1.2 Phase B (kernel UDP path,libcapos-posixcrate,dns.cvendoring, demo + manifest, closed2026-05-05 21:21 UTC), and P1.3 (Pipe cap + recording-shim fork-for-exec +posix_spawnsuccessor, closed2026-05-07 09:55 UTC) have landed. The remaining open phase is the dash port successor (Task 4). The Namespace + File cap surface from Storage and Naming proposal has landed far enough for the v0 smoke; current POSIX-adapter work is now dash vendoring/patching, the multi-translation-unit C build, and therun-posix-shell-smokeharness. The signal/time stub slice is closed bymake run-posix-signal-time. The sketch below remains for context; the dedicated proposal and plan are the source of truth for FdTable shape, supported-function matrix, and open questions.
Why POSIX at All?
capOS is not POSIX and doesn’t want to be. But:
-
Existing software. Most useful software assumes POSIX. A DNS resolver, an HTTP server, a database – all speak
open()/read()/write()/socket(). Without an adapter, every piece of software must be rewritten. -
Developer familiarity. Programmers know POSIX. A compatibility adapter lowers the barrier to writing capOS software, even if native caps are better.
-
Gradual migration. Port software first with POSIX-shaped APIs, then incrementally convert to native capabilities for tighter sandboxing.
The goal is not full POSIX compliance. It is a pragmatic adapter that maps selected POSIX concepts to capabilities so existing software can run with bounded modification while preserving capability-based authority.
Architecture: libcapos-posix
Application (C/Rust, uses POSIX APIs)
│
│ open(), read(), write(), socket(), ...
│
v
libcapos-posix (POSIX-to-capability adapter)
│
│ Maps fds to caps, paths to granted directory/namespace lookups
│
v
libcapos (native capability invocation)
│
│ SQ/CQ ring + cap_enter syscall
│
v
Kernel (capability dispatch)
libcapos-posix is a static library that provides POSIX-like function
signatures over granted capabilities. It is not an authority source and should
not be described as “Linux compatibility.” A process without file/directory
authority cannot open files; a process without socket authority cannot create
sockets; a process without launcher or spawner authority cannot create
children.
Current v0 surface (shipped as libcapos-posix.a alongside
libcapos.a; see libcapos-posix/ and the canonical
POSIX Adapter proposal):
- Static-array fd table with a 32-fd cap (P1.2 Phase A decision §5).
- Single-thread
__errno_location()TLS cell (P1.2 Phase A decision §4). socket(AF_INET, SOCK_DGRAM, 0)/sendto/recvfrom/closeover the kernelUdpSocketcapability (P1.2 Phase B).pipe/read/write/dup/dup2/closeover the kernelPipecapability viaProcessSpawner.createPipe(P1.3).fork/execve/waitpid/_exit/posix_inherit_stdiovia the recording-shimProcessSpawner.spawnMove-grant path (P1.3 §6 decision: Variant A).fork()returns 0 unconditionally and opens a TLS recording window;dup2()/close()between fork and execve record into the window;execve()drains the recording intostdio_<N>spawn grants and returns the synthetic child pid (a deliberate v0 deviation from POSIX).- Direct
posix_spawn/posix_spawn_file_actions_init/_destroy/_adddup2/_addcloseover the same Move-grant action-replay code path;argv/envpare accepted but ignored until aLaunchParameterssurface lands. open/read/write/close/lseekover the bootstrap rootDirectoryand mintedFilecaps;opendir/readdir/closedirover mintedDirectorycaps.- Console/Terminal stdio adoption, focused
printf/ string / ctype helpers, manifest-backedgetenv/setenv/putenv/unsetenv, and single-identitygetpid/getuid/getgidstubs. clock_gettime(CLOCK_MONOTONIC, ...)/gettimeofday(&tv, NULL)/time/nanosleep/sleepover the kernelTimercapability.signal/sigactionstore handlers without delivery;killandraisefail closed until typed process-control authority exists.
C headers ship under libcapos-posix/include/capos/posix/ (errno.h,
dirent.h, fcntl.h, signal.h, spawn.h, stdio.h, stdlib.h,
string.h, sys/socket.h, sys/wait.h, time.h, unistd.h, and focused
subsets such as ctype.h). libcapos-posix reuses libcapos’s installed
Runtime through the renamed extern crate libcapos_::runtime::with(...) to
avoid colliding with libcapos’s C-side capos_* exports.
Not yet implemented for the dash-port successor: file metadata/remove
calls such as stat / fstat / access / unlink, TCP socket wrappers,
select / poll / epoll, real asynchronous signal delivery, job control,
chdir / cwd-relative path resolution, and broad FILE * stream semantics.
These remain on the dash port successor track (Task 4 of
docs/proposals/posix-adapter-proposal.md) or later typed-authority work.
File Descriptor Table
POSIX programs think in file descriptors. capOS has capabilities. The
implemented v0 translation is a fixed 32-slot per-process fd table inside
libcapos-posix. Slots may be backed by Console, UDP socket, Pipe, File,
Directory, TerminalSession, or a moved-out sentinel used by the recording-shim
execve() path.
Fd 0/1/2 are initialized only from explicit authority:
stdio_<N>Pipe grants seeded by a parent spawn action take precedence.- A bootstrap
TerminalSessioncap may adopt empty stdio slots when the program callsposix_inherit_stdio(). - A bootstrap
Consolecap fills empty fd 1 and fd 2 for simple smokes. - Fd 0 stays closed unless the process received pipe or terminal input authority.
Path Resolution
POSIX open("/etc/config.toml", O_RDONLY) becomes:
libcapos-posixlooks up the bootstrap-granted rootDirectorycap namedroot.- It rejects relative paths,
.., and non-UTF-8 or oversized path segments. - It walks intermediate components with
Directory.sub(). - It opens the leaf with
Directory.open()orDirectory.sub(). - It installs a File or Directory fd slot with per-fd position / iteration state.
The future Namespace + Store resolver remains documented in the POSIX adapter
proposal, but the shipped v0 dash-port proof uses the RAM-backed root
Directory capability because that is the implemented kernel authority.
Supported POSIX Functions
Grouped by what capability backs them:
Console cap -> stdio:
| POSIX | capOS translation |
|---|---|
write(1, buf, len) | console.write(buf[..len]) |
write(2, buf, len) | console.write(buf[..len]) (or log cap) |
read(0, buf, len) | Pipe or TerminalSession-backed stdin when granted |
Directory + File caps -> file I/O:
| POSIX | capOS translation |
|---|---|
open(path, flags) | root Directory walk -> Directory.open() -> fd |
read(fd, buf, len) | File.read(offset, len) using per-fd position |
write(fd, buf, len) | File.write(offset, bytes) using per-fd position |
close(fd) | drop/release the backing cap slot |
lseek(fd, off, whence) | update per-fd file position |
opendir/readdir/closedir | Directory.list() plus per-fd iteration |
Pipe + ProcessSpawner caps -> subprocess I/O:
| POSIX | capOS translation |
|---|---|
pipe(fds) | ProcessSpawner.createPipe() -> two Pipe-backed fds |
fork() + execve() | recording shim -> ProcessSpawner.spawn() |
posix_spawn() | direct action replay -> ProcessSpawner.spawn() |
waitpid(pid, &status, 0) | ProcessHandle.wait() |
UdpSocket caps -> networking:
| POSIX | capOS translation |
|---|---|
socket(AF_INET, SOCK_DGRAM, 0) | NetworkManager.createUdpSocket() -> fd |
sendto / recvfrom | UdpSocket.sendTo() / UdpSocket.recvFrom() |
close(fd) | release the owned UdpSocket cap |
Timer + local stubs:
| POSIX | capOS translation |
|---|---|
clock_gettime / gettimeofday / time | Timer.now() |
nanosleep / sleep | Timer.sleep() |
signal / sigaction | store handler locally, never deliver |
kill / raise | validate signal number, then fail closed |
Not supported or still partial:
| POSIX | Why not |
|---|---|
bare fork() state cloning | No address space cloning; only fork-for-exec is recorded |
in-place exec() replacement | Spawn creates a fresh process |
| real signal delivery / job control | Needs typed process-control and terminal authority |
chmod/chown | No permission bits. Authority is structural |
mmap(MAP_SHARED) | No shared memory yet (future: SharedMemory cap) |
ioctl | No device files. Use typed capability methods |
ptrace | No debugging interface yet |
select/poll/epoll | Requires async cap invocation (Stage 5+). Initial version is blocking only |
Process Creation Compatibility
capOS process creation is spawn-style, not fork/exec-style. A new process is a
fresh ELF instance selected by ProcessSpawner, with an explicit initial
CapSet assembled from granted capabilities. The parent address space is not
cloned, and an existing process image is not replaced in place.
posix_spawn() is the compatibility primitive for subprocess creation.
libcapos-posix (P1.3, closed 2026-05-07 09:55 UTC) maps it to
ProcessSpawner.spawn(), translates posix_spawn_file_actions into
fd-table setup and Move-grant stdio_<N> capability grants on the
spawn ABI. argv / envp are accepted but ignored until a
LaunchParameters surface lands. make run-posix-spawn-smoke is the
end-to-end proof.
Full fork() is intentionally not a native kernel primitive. Supporting it
would require copy-on-write address-space cloning, parent/child register return
semantics, fd-table duplication, a per-capability inheritance policy, safe
handling for outstanding SQEs/CQEs, and defined behavior for endpoint calls,
timers, waits, and process handles that are in flight at the fork point.
Threaded POSIX processes add another constraint: only the calling thread is
cloned, while locks and async-signal-safe state must remain coherent in the
child.
P1.3 also shipped a narrow recording-shim fork() for the common
fork-for-exec pattern that does not require general address-space
cloning. fork() returns 0 unconditionally and opens a TLS recording
window; dup2() / close() between fork and execve record into the
window without mutating the parent fd table; execve() drains the
recording into Move-grant stdio_<N> spawn grants and returns the
synthetic child pid as its own return value. The pseudo-child branch
is still the parent process, so a failed execve() MUST NOT call
_exit() – it must surface the error to the parent’s normal error
path. The user pattern is pid_t child = fork(); if (child == 0) { dup2(); close(); child = execve(...); } /* parent flow */. Earlier
iterations used x86_64 setjmp/longjmp to fake fork-return-twice;
that was replaced because longjmp back into fork()’s already-
returned stack frame was undefined behaviour. make run-posix-pipe-smoke
is the end-to-end proof.
make run-posix-dns-smoke exercises socket(AF_INET, SOCK_DGRAM, 0) /
sendto / recvfrom against the kernel UdpSocket capability through
a hand-rolled DNS A query in demos/posix-dns-resolver/. The current
smoke does not compile the vendored dns.c whole because the v0
libcapos-posix POSIX surface is narrower than dns.c expects
(poll.h, netinet/in.h, arpa/inet.h, netdb.h, sys/select.h,
sys/un.h); widening that surface is follow-on work on the dash port
track.
Security Model
The POSIX compatibility adapter does not weaken capability security. Every POSIX call translates to a capability invocation on caps the process was actually granted:
open("/etc/passwd")fails if the process lacks a bootstraprootDirectorycap or that directory tree does not containetc/passwd– not because of permission bits, but because no granted authority resolves the path.socket(AF_INET, SOCK_DGRAM, 0)fails if the process was not granted aNetworkManagercap; TCP stream wrappers remain future work.fork()only opens the recording window for the supported fork-for-exec pattern; bare address-space cloning remains unsupported.
A POSIX binary on capOS is more constrained than on Linux, not less. The compatibility adapter provides familiar function signatures, not familiar authority.
Building POSIX-Compatible Binaries
my-app/
Cargo.toml # depends on capos-posix (which depends on capos-rt)
src/main.rs # uses libc-style APIs
Or for C:
#include <capos/posix/fcntl.h> // open, O_RDONLY
#include <capos/posix/sys/socket.h> // socket, sendto, recvfrom
#include <capos/posix/unistd.h> // read, write, close
int main() {
// Works -- stdout is mapped to Console cap
write(1, "hello\n", 6);
// Works -- if the process was granted a root Directory cap
int fd = open("/config.toml", O_RDONLY);
char buf[4096];
ssize_t n = read(fd, buf, sizeof(buf));
close(fd);
// Works -- if NetworkManager cap was granted; TCP is not in v0
int sock = socket(AF_INET, SOCK_DGRAM, 0);
close(sock);
}
The linker pulls in libcapos-posix.a -> libcapos.a -> startup code.
Same ELF output, same kernel loader.
musl as a Base (Optional, Later)
For broader C compatibility (printf, string functions, math), libcapos-posix
can be layered under musl libc. musl has a clean
syscall interface – all system calls go through a single __syscall() function.
Replacing that function with capability-based dispatch gives you full libc on
top of capOS capabilities:
// musl's syscall entry point -- we replace this
long __syscall(long n, ...) {
switch (n) {
case SYS_write: return capos_write(fd, buf, len);
case SYS_open: return capos_open(path, flags, mode);
case SYS_socket: return capos_socket(domain, type, protocol);
// ...
default: return -ENOSYS;
}
}
This is the same approach Fuchsia uses with fdio + musl, and Redox OS uses
with relibc. It works and it gives you printf, fopen, getaddrinfo, and
most of the C standard library.
Priority: after native capos-rt and libcapos are stable. musl integration is a significant engineering effort and should only be done when there’s actual software to port.
Part 5: WASI Host Adapter
Note: the full design lives in WASI Host Adapter proposal and the implementation decomposition in WASI Host Adapter. The sketch below remains for context; the dedicated proposal is the source of truth for runtime selection (wasmi for v0; wasmtime / WAMR as W.7+ migration), capability-mapping surface, per-instance CapSet plumbing, phase decomposition, and open questions.
Why WASI Fits capOS Better Than POSIX
WASI (WebAssembly System Interface) was designed from the start as a capability-based system interface. Its concepts map almost directly to capOS:
| WASI concept | capOS equivalent |
|---|---|
fd (pre-opened directory) | Namespace cap |
fd (socket) | TcpSocket/UdpSocket cap |
fd_write on stdout | Console.write() |
| Pre-opened dirs at startup | CapSet at spawn |
| No ambient filesystem access | No ambient authority |
path_open scoped to pre-opened dir | namespace.resolve() scoped to granted prefix |
WASI programs already assume they get no ambient authority. A WASI binary compiled for capOS still needs a host adapter, but the security model is closer to capOS than POSIX because preopened handles are explicit.
Architecture: Wasm Runtime as a capOS Service
WASI binary (.wasm)
│
│ WASI syscalls (fd_read, fd_write, path_open, ...)
│
v
wasm-runtime process (Wasmtime/wasm-micro-runtime, native capOS binary)
│
│ Translates WASI calls to capability invocations
│ Each wasm instance gets its own CapSet
│
v
libcapos (native capability invocation)
│
v
Kernel
The wasm runtime is itself a native capOS process. It receives caps from its parent and partitions them among the wasm modules it hosts. This gives you:
- Language independence. Any language with a useful WASI target can be evaluated through the same host adapter.
- Extra sandboxing. Wasm memory isolation combines with capOS capability scoping.
- Less porting effort for software that already targets WASI, assuming its required imports are implemented by the host adapter.
- Density. Multiple wasm modules in one process, each with different caps
WASI vs Native Performance
Wasm adds overhead: bounds-checked memory, indirect calls, and host-call marshalling. For foundational system services, native Rust remains the default choice until there is a concrete reason to choose otherwise. For application code and portable tools, the sandboxing and reuse may be worth the overhead.
WASI Implementation Phases
The current shipped state is owned by WASI Host Adapter and WASI Host Adapter proposal; the phase status summary below is a pointer, not the source of truth.
Phase W.0 (planning, closed): runtime decision recorded as wasmi
for v0; WAMR / wasmtime are W.7+ migration candidates. The earlier
“wasm-micro-runtime as a C binary via libcapos” sketch is superseded
by wasmi-as-a-Rust-crate inside the standalone capos-wasm/ package.
Cross-cutting Open Questions §1 (per-instance vs per-process) and §3
(poll_oneoff semantics over the capOS ring) resolved
2026-05-13 16:46 UTC: one wasm instance per capos-wasm process,
and poll_oneoff stays ERRNO_NOSYS in v0 with subscription kinds
extended one at a time through W.5/W.6 against a single blocking
cap_enter.
Phase W.1 (host scaffold, closed 2026-05-05 19:12 UTC):
capos-wasm/ standalone userspace crate over vendored wasmi 1.0.9
(vendor/wasmi-no_std/wasmi-1.0.9/); make capos-wasm-build.
Phase W.2 (Preview 1 stdout-only, closed 2026-05-07 10:53 UTC):
wasm-host userspace binary, empty-instantiation smoke
(make run-wasm-host), Preview 1 stdout-only import resolver
(args_get / environ_get empty, clock_time_get(MONOTONIC),
proc_exit, fd_write(1, …) / fd_write(2, …); everything else
including random_get returns ERRNO_NOSYS), manifest-payload load
path through an optional BootPackage cap, Rust hello, wasi
(make run-wasi-hello-rust), and C hello, wasi
(make run-wasi-hello-c).
Phase W.3 (per-instance argv grant, closed
2026-05-07 18:25 UTC): bounded initConfig.init.wasiArgs text
grant on top of the existing manifest CapSet, validated against
WASI_ARGS_MAX_COUNT = 32, WASI_ARGS_MAX_ARG_BYTES = 4096, and
WASI_ARGS_MAX_TOTAL_BYTES = 8192. The wasm-host installs the bundle
on HostState before instantiation, and Preview 1 args_get /
args_sizes_get reflect it. make run-wasi-cli-args is the
end-to-end proof. A 2026-05-13 follow-up adds the same bounded-text
shape for initConfig.init.wasiEnv (WASI_ENV_MAX_COUNT = 32,
WASI_ENV_MAX_ENTRY_BYTES = 4096, WASI_ENV_MAX_TOTAL_BYTES = 8192)
with make run-wasi-env and make wasi-env-negative-check.
Phase W.4 (random_get production + clocks production-ready,
closed 2026-05-07 20:09 UTC): Preview 1 random_get routed
through the kernel EntropySource cap when the manifest grants it,
chunked at the cap’s MAX_ENTROPY_FILL_BYTES = 64 ceiling and capped
per Preview 1 invocation at RANDOM_GET_MAX_BYTES = 65_536 bytes;
ungranted variant refuses with ERRNO_NOSYS = 52.
make run-wasi-random and make run-wasi-random-ungranted are the
granted/ungranted proofs. clock_time_get(CLOCKID_REALTIME) keeps
returning ERRNO_NOSYS until a typed WallClock cap exists.
A 2026-05-13 compatibility-import slice promotes authority-free
Preview 1 imports (clock_res_get(MONOTONIC), sched_yield, stdio
fd_fdstat_get metadata, stdio fd_seek returning ERRNO_SPIPE)
through make run-wasi-stdio-fd. make run-wasi-preview1-refusals
keeps representative blocked storage and socket imports failed closed
with ERRNO_NOSYS = 52.
Phase W.5 (filesystem against Namespace / File / Store,
blocked): waits on the storage cap surface from
Storage and Naming proposal. Until
then, make run-wasi-preview1-refusals is the refusal evidence.
Phase W.6 (sockets against TcpSocket / UdpSocket, blocked):
waits on a userspace network stack process (or an interim
Fetch / HttpEndpoint shim) from
Networking proposal. Same refusal evidence
as W.5 in the interim.
Phase W.7 (Preview 2 / Component Model + wasmtime migration,
blocked): waits on the std-userspace decision (same blocker as
the capnp-rpc remote-session rewrite). When it lands, WIT resources
map to typed OwnedCapability<T> slots in the host adapter and the
schema gains the Component Model resource bridging variants.
Phase W.8 (TinyGo / Go-on-WASI CUE evaluator, blocked): waits on
the same std-userspace decision; native GOOS=capos remains the path
for full Go runtime semantics.
Part 6: Putting It All Together – Porting Strategy
Spectrum of Integration
Most native Most compatible
| |
v v
Native Rust C with libcapos POSIX adapter WASI binary
(capos-rt) (typed caps) (libcapos-posix) (wasm runtime)
- Best perf - Good perf - Familiar API - Any language
- Full cap - Full cap - Auto sandboxing - Auto sandboxing
control control via cap scoping via wasm + caps
- Most work - Moderate work - Less rewrite - Less rewrite
to write to write for existing C for WASI targets
Example: Porting a DNS Resolver
Native Rust: Rewrite using capos-rt. Receives UdpSocket cap, serves
DNS lookups as a DnsResolver capability. Other processes get a
DnsResolver cap instead of calling getaddrinfo(). Clean, typed, minimal
authority.
C with POSIX adapter: Take an existing DNS resolver (e.g., musl’s
getaddrinfo implementation or a standalone resolver). Compile against
libcapos-posix. Give it a UdpSocket cap and a Namespace cap for
/etc/resolv.conf. It calls socket(), sendto(), recvfrom() – all
translated to cap invocations. Works with minimal changes, but can’t export
a typed DnsResolver cap (it speaks POSIX, not caps).
WASI: Compile a Rust DNS resolver to WASI. Run it in the wasm runtime. Same capability scoping, but through the wasm sandbox.
Recommended Approach for capOS
-
Foundational services: native Rust by default. Drivers, network stack, store, and init are the foundation and should use capabilities natively unless a concrete reviewed reason justifies another runtime.
-
First applications: native Rust. While the ecosystem is young, applications should use
capos-rtdirectly. This validates the cap model. -
C compatibility: when porting specific software. Do not build the POSIX adapter speculatively. Build it when there is a specific C program to port (e.g., a DNS resolver, an HTTP server, a database). Let real porting needs drive which POSIX functions to implement.
-
WASI: as the general-purpose application runtime. Once the native runtime is stable, the wasm runtime becomes the “run anything” answer. Lower priority than native Rust, but higher priority than full POSIX/musl compat, because WASI’s capability model is a natural fit.
Part 7: Schema Extensions
New schema types needed for the userspace runtime:
# Extend schema/capos.capnp
struct InitialCaps {
entries @0 :List(InitialCapEntry);
}
struct InitialCapEntry {
name @0 :Text;
id @1 :UInt32;
interfaceId @2 :UInt64;
}
interface ProcessSpawner {
spawn @0 (name :Text, binaryName :Text, grants :List(CapGrant)) -> (handleIndex :UInt16);
}
struct CapGrant {
name @0 :Text;
capId @1 :UInt32;
interfaceId @2 :UInt64;
}
interface ProcessHandle {
wait @0 () -> (exitCode :Int64);
}
These definitions now live in schema/capos.capnp as the single source of
truth. spawn() returns the ProcessHandle through the ring result-cap list;
handleIndex identifies that transferred cap in the completion. The first
slice passes a boot-package binaryName instead of raw ELF bytes so spawn
requests stay inside the bounded ring parameter buffer; manifest-byte exposure
and bulk-buffer spawning remain later work. kill, post-spawn grants, and
exported-cap lookup are deferred until their lifecycle semantics are
implemented.
Implementation Status And Future Phases
Implemented Baseline: capos-rt
capos-rt/exists as a standaloneno_std + allocruntime crate.capos-rtowns_start, heap initialization, panic output, raw syscall wrappers, bootstrap validation, CapSet parsing, the entry-point macro, the single-owner ring client, typed clients, result-cap adoption, and owned handle release.init/,shell/,demos/, and the runtime smoke binary build fortargets/x86_64-unknown-capos.json.- QEMU proofs cover typed Console calls, exception decoding, spawn/wait, runtime VirtualMemory, Timer, ThreadControl, ThreadSpawner, ThreadHandle, terminal sessions, and release behavior.
Deliverable: completed. See Userspace Runtime and Programming Languages for current validation.
Future Phase: broader generated/native clients
- Add generated clients after the schema surface stabilizes.
- Preserve the existing split where
capos-rtowns transport lifetime and interface-specific wrappers own message encoding. - Establish the out-of-tree service-binary packaging pattern once the internal userspace target contract is stable.
Deliverable: ordinary native capOS services can depend on generated typed clients without copying runtime transport logic.
libcapos for C – Phase 0 closed
extern "C"API exposingcapos_cap_call,capos_capset_get,capos_sys_exit,capos_sys_cap_enter,capos_console_write_line,capos_timer_now,capos_entropy_fill,capos_virtual_memory_*,capos_process_spawner_create_pipe,capos_pipe_read,capos_pipe_write,capos_pipe_close, andmalloc/free/calloc/reallocheap shims over the capos-rt global allocator.- Public header at
libcapos/include/capos/capos.h. - Build system:
make libcaposproduceslibcapos/target/x86_64-unknown-capos/release/libcapos.a;make c-helloandmake c-pipelink native C smokes with clang + lld using the shareddemos/linker.ld. - C “hello world” smoke at
demos/c-hello/main.ccallsConsole.writeLinethroughcapos_console_write_line, exercises Timer, EntropySource, and VirtualMemory typed wrappers, verifiescapos_cap_callrejects a bootstrapThreadSpawnercap locally, and exits cleanly.make run-c-hellobootssystem-c-hello.cueand the smoke greps for the[c-hello] hello from c-hello, entropy, VM, and ThreadSpawner rejection markers plus the kernelprocess N exited with code 0line. - Native C pipe smoke at
demos/c-pipe/main.cusescapos_process_spawner_create_pipe, writes and readsnative-c-pipe-markerthrough typed Pipe wrappers, closes the write end, observes EOF, and exits cleanly.make run-c-pipebootssystem-c-pipe.cueand checks the create, read, EOF, and clean-exit markers.
Deliverable: complete – C binary boots, calls Console.writeLine, and
exits cleanly through capos_sys_exit.
Deferred to later libcapos phases: generated typed wrappers per
interface, transferred result-cap propagation across the C ABI,
per-thread routing of the runtime ring, and a libcapos-posix layer.
Future Phase: POSIX compatibility adapter
- Implement FdTable and path resolution
- Start with file I/O (open/read/write/close over Namespace + Store)
- Add socket wrappers when networking is userspace
- Optionally integrate musl for full libc
Deliverable: an existing C program (e.g., a simple HTTP server) runs on capOS with minimal source changes.
WASI runtime (partially landed)
The WASI host adapter is its own track owned by
docs/proposals/wasi-host-adapter-proposal.md and
docs/proposals/wasi-host-adapter-proposal.md. Phase decomposition:
- W.1 (host scaffold; landed
2026-05-05 19:12 UTC):capos-wasm/standalone crate over vendored wasmi 1.0.9 (vendor/wasmi-no_std/wasmi-1.0.9/),make capos-wasm-build. - W.2 (Preview 1 stdout-only; closed
2026-05-07 10:53 UTC): wasm-host userspace binary,make run-wasm-hostempty-instantiation smoke, Preview 1 stdout-only import resolver, manifest-payload load path, Rusthello, wasismoke (make run-wasi-hello-rust), and Chello, wasismoke (make run-wasi-hello-c). Capabilities backing the host imports today: Console + Timer + BootPackage. v0 chose wasmi-as-Rust-crate overwasm-micro-runtime-as-C-binary; wasmtime / WAMR remain W.7+ migration candidates. - W.3 (per-instance CapSet plumbing + LaunchParameters) closed
2026-05-07 18:25 UTC. - W.4 (
random_getagainst the in-treeEntropySourcecap, plus clocks production-ready) closed2026-05-07 20:09 UTC. - 2026-05-13 compatibility/refusal smokes:
make run-wasi-stdio-fdproves promoted authority-free imports no longer returnERRNO_NOSYS;make run-wasi-preview1-refusalskeeps storage and socket imports failed closed without authority. - W.5 (filesystem against
Namespace/File/Store), W.6 (sockets againstTcpSocket/UdpSocket), and W.7+ (Preview 2 / Component Model) remain future phases.
Deliverable status: hello.wasm runs on capOS today (both Rust
and C payloads), argv and entropy grants are implemented, and
authority-free stdio fd compatibility imports are covered by a direct
smoke. Filesystem/socket phases are queued behind their authority
surfaces.
Open Questions
-
Allocator strategy. Should the userspace heap be a fixed-size region (simple, but limits memory), or should it grow by invoking a FrameAllocator cap (flexible, but every allocation might syscall)? Likely answer: fixed initial region + grow-on-demand via cap.
-
Async I/O. The SQ/CQ ring is inherently asynchronous (submit SQEs, poll CQEs), but the initial
capos-rtwrappers provide blocking convenience (submit one CALL SQE +cap_enter(1, MAX)). Real services need batched async patterns. Options:- Submit multiple SQEs, poll CQEs in an event loop (io_uring style)
- Runtime green threads or tasks multiplexed through one ring dispatcher;
the 7.1 threading contract keeps at most one blocked
cap_enterwaiter per process ring until a sharded or per-thread ring ABI exists - Userspace executor (like tokio) driving the ring
-
Cap passing in the POSIX adapter. POSIX has
SCM_RIGHTSfor passing fds over Unix sockets. Should the POSIX adapter support something similar for passing caps? Or is this native-only? -
Dynamic linking. Currently all binaries are statically linked. Should capOS support shared libraries? Probably not initially – static linking is simpler and the binaries are small. Revisit if binary size becomes a concern.
-
WASI component model integration. WASI preview 2 components have typed imports/exports that could map to capnp interfaces. Should the wasm runtime auto-generate capnp-to-WIT adapters from schemas? This would let wasm components participate natively in the capability graph.
-
Build system. How are userspace binaries packed into the boot image? Currently the Makefile builds
init/separately. With multiple service binaries, need a more scalable approach (build manifest that lists all binaries, Makefile target that builds and packs them all).
Relationship to Other Proposals
- Service architecture proposal – defines what services exist and how they compose. This proposal defines how those service binaries are built, what runtime they use, and how non-Rust software fits in.
- Storage and naming proposal – the POSIX
open()/read()/write()translation targets the Store and Namespace caps defined there. - Networking proposal – the POSIX socket translation targets the TcpSocket/UdpSocket caps from the network stack.
Proposal: Native Shell and POSIX Shell
How interactive operation should work on capOS without reintroducing ambient authority through a Unix-like command line.
Problem
capOS deliberately avoids global paths, inherited file descriptors, ambient network access, and process-wide privilege bits. A conventional shell assumes all of those. If capOS copied a Unix shell model directly, the shell would either be mostly useless or become an ambiently privileged escape hatch around the capability model.
The system needs two related, but distinct, shell layers:
- Native shell: schema-aware capability REPL and scripting language.
- POSIX shell: compatibility personality for existing programs and scripts.
Both must be ordinary userspace processes. Neither should receive special kernel privilege. The kernel and trusted capability-serving processes remain the enforcement boundary.
Model-driven interaction on top of the native shell is a separate concern and is defined in Language Models and Agent Runtime. The model runs as its own service with no session authority; the native shell (in “agent mode”) is the runner: it holds the session caps, exposes them to the model as typed tool descriptors with per-tool permission modes, executes tool calls on behalf of the model, streams results back, and keeps the user in the loop.
The first boot-to-shell milestone is text-only: local console login/setup and, later in the same family, a browser-hosted terminal gateway. Graphical shells, desktop UI, compositors, and GUI app launchers are a later tier. See Boot to Shell.
Design Principles
- A shell starts with only the capabilities it was granted.
- A shell command compiles to typed capability calls, not stringly syscalls.
- Child processes receive explicit grants. There is no implicit inheritance of the shell’s full authority.
- Elevation is a capability request mediated by a trusted broker, not a flag inside the shell.
- Shell startup is a workload launch from a
UserSession, service principal, or recovery profile. Session metadata informs policy and audit; it is not authority. - Default interactive cap sets are broker-issued session bundles, not hard-coded shell privileges.
- POSIX behavior is an adapter over scoped
Directory,File, socket factory, and process capabilities. It is not the native authority model.
User identity and policy sit above this shell model. A shell session may be
associated with a human, service, guest, anonymous, or pseudonymous principal,
but the session’s capabilities remain the authority. RBAC, ABAC, and mandatory
policy decide which scoped caps a broker may grant; they do not create a
kernel-side uid, role bit, or label check on ordinary capability calls. See
User Identity and Policy.
Federated sessions (OIDC-authenticated principals, service accounts using
OAuth2 workload identity) are one input shape for this model. OAuth scopes
and OIDC claims from a session’s issuer feed AuthorityBroker as ABAC
attributes. They never authorize capability calls directly, and raw bearer
tokens never appear in shell state. The token-typed capabilities,
OAuthClient, OidcIdentityProvider, and the broker-side token handling
are defined in
OIDC and OAuth2.
Layering
flowchart TD
Input[Login, guest, anonymous, or service request] --> SessionMgr[SessionManager]
SessionMgr --> Session[UserSession metadata cap]
Session --> Broker[AuthorityBroker / PolicyEngine]
Broker --> Bundle[Scoped session cap bundle]
Bundle --> Native[Native shell]
Bundle --> Posix[POSIX shell]
Posix --> Compat[POSIX compatibility runtime]
Native --> Ring[capos-rt capability transport]
Compat --> Ring
Ring --> Kernel[Kernel cap ring]
Ring --> Services[Userspace services]
Native --> Approval[Approval client cap]
Approval --> Broker
Broker --> Services
Broker --> Audit[AuditLog]
The native shell is the primitive interactive surface. The POSIX shell is a
compatibility consumer of capOS capabilities, not the model other shells are
built on. A language-model service, when present, is invoked through a
LanguageModel cap from the native shell running in “agent mode”; the
shell is the tool runner, not the model. That flow is defined in
Language Models and Agent Runtime and is not expanded
in this diagram.
A shell may display a principal name, profile, role set, label, or POSIX UID,
but those values are descriptive unless a trusted broker uses them to return a
specific capability. Losing a home, logs, launcher, or approval cap
cannot be repaired by presenting the same session ID back to the kernel.
Native Shell
The native shell is a typed capability graph operator. Its job is to inspect, invoke, pass, attenuate, release, and trace capabilities.
Current implementation status as of 2026-05-16 21:36 UTC: capos-shell is
the standalone no_std crate at shell/ and ships the anonymous-first
interactive flow. Focused shell/login manifests still launch it directly as
initConfig.init; the default make run manifest now runs it as an
init-started service under standalone init, together with the chat /
adventure binaries and the remote-session CapSet gateway. On boot the shell
mints an anonymous UserSession via SessionManager.anonymous() and
receives an empty-allowlist anonymous bundle from AuthorityBroker.
login and setup commands use
CredentialStore/SessionManager/AuthorityBroker to verify or create the
password, mint an operator session, request the operator shell bundle, and
swap session/launcher in place. Login prompts for a username as well as a
password through a username-aware SessionManager.login() request that
carries method, selector, proof, and source metadata. A guest command
mints a guest session via SessionManager.guest() and swaps to a
broker-issued guest bundle (guest sessions require an explicit manifest seed;
no broad authority is granted to guest profiles). Shell exit calls
UserSession.logout() to clean up the session context. The default make run manifest includes the native shell, chat/adventure binaries, terminal,
console, stdio, chat, adventure, creds, sessions, audit,
broker, and system_info caps; its MOTD shows the concrete spawn / run
commands for the adventure demo. The current command set is help, caps,
binaries, motd, inspect <name>, session, login, setup, guest,
spawn, blocking run, wait, and exit, with a launcher-backed
binaries command that lists binaries available to the current session
(anonymous and guest launcher policies return an empty list).
The session-scoped TerminalSession substrate now exists behind
make run-terminal, and the bounded SSH terminal-host proof can launch
capos-shell over a socket-backed TerminalSession with a public-key
UserSession through RestrictedShellLauncher. The generic
call @cap.method(...) REPL, schema reflection, richer daily shell profiles,
and the full OpenSSH gateway remain future work.
Example init or development session with explicit spawn authority:
capos:init> caps
log Console
spawn ProcessSpawner
boot BootPackage
vm VirtualMemory
capos:init> call @log.writeLine({ text: "hello" })
ok
capos:init> spawn "tls-smoke" with {
log: @log
} -> $child
started pid 12
capos:init> wait $child
exit 0
Values
Native shell values should include:
@name: a named capability in the current shell context.$name: a local value, result, promise, or process handle.- structured values: text, bytes, integers, booleans, lists, and structs.
- result-cap values returned through the capOS transfer-result path.
- trace values representing CQE and call-history slices.
The shell should preserve interface metadata with every capability value. A method call is valid only if the target cap exposes the method’s schema.
Commands
Initial commands should be small and explicit:
caps
binaries
inspect @log
methods @spawn
call @log.writeLine({ text: "boot complete" })
spawn "ipc-server" with { log: @log, ep: @serverEp } -> $server
wait $server
run "ipc-client" with { log: @log, ep: client @serverEp }
release @temporary
trace $server
bind scratch = @store.sub("scratch")
derive readonly = @home.sub("config").readOnly()
inspect should show the interface ID, label, transferability, revocation
state when available, and callable methods. It should not imply that two caps
with the same interface ID are the same authority.
The current prototype intentionally does not yet provide the generic
call @cap.method(...) REPL. Until the schema registry and structured value
parser exist, native-shell exposes only narrow typed commands and should make
that gap visible through planning docs rather than accepting raw method IDs and
opaque byte blobs.
Syntax
The syntax should be structured rather than shell-token based. A CUE-like or Cap’n-Proto-literal-like shape fits capOS better than POSIX word splitting:
spawn "net-stack" with {
log: @log
nic: @virtioNic
timer: @timer
}
The shell can still provide abbreviations, but the executable representation
should be an ActionPlan object with typed fields.
Composition
Native composition should pass typed caps or structured values, not inherited byte streams by default:
pipe @camera.frames()
|> spawn "resize" with { input: $, width: 640, height: 480 }
|> spawn "jpeg-encode" with { input: $, quality: 85 }
|> call @photos.write({ name: "frame.jpg", data: $ })
If a byte stream is desired, it should be explicit through a ByteStream,
File, or POSIX adapter capability. This keeps the “pipe” operator from
silently turning every interface into untyped bytes.
Namespaces
There is no global root. A native shell may have a current Directory or
Namespace capability, but that is just a default argument:
capos:user> ls @config
services
network
capos:user> cd @config.sub("services")
capos:@config/services> ls
logger
net-stack
The shell cannot traverse above a scoped directory or namespace unless it holds another capability that names that authority.
Session Context
A session-aware shell may hold a self or session cap for UserSession.info()
and audit context. That cap is metadata. It can identify the principal, auth
strength, expiry, quota profile, and audit identity, but it cannot widen the
shell’s CapSet or authorize kernel operations by itself.
The launcher or supervisor starts the shell with a CapSet returned by
AuthorityBroker(session, profile). For interactive work, that bundle should
usually include scoped terminal, home, logs, launcher, status, and approval
caps. For service accounts, guest sessions, anonymous workloads, and recovery
mode, the broker returns different bundles under explicit policy profiles.
Shell-launched children inherit only the caps named in the spawn plan. A child
may receive a UserSession or session badge for audit, per-client quotas, or
service-side selection, but object access still comes from the scoped object
caps passed to that child.
Interactive Command Surfaces
Application-specific interactions must stay out of the native shell command
set. A chat client, adventure client, or other interactive application should
run as an ordinary shell-spawned application or resident service session, not
as a builtin such as chat or play adventure.
The near-term target is a prototype bridge, not the final app protocol:
capos-shell launches clients with spawn or run, grants them explicit
endpoint clients such as stdio: client @stdio, and services StdIO while
waiting. That proves exact grants, process handles, child completion, and the
terminal bridge without giving a child the shell’s move-only TerminalSession.
Legacy badge N syntax is retired from normal client @... grants; delegated
client endpoints preserve their service identity by default, and service object
capabilities replace badged chat/adventure identity. Explicit selector fixtures
remain only in low-level and hostile-path tests.
That StdIO bridge is intentionally limited. It is acceptable for focused
QEMU smokes and textual compatibility, but it is the wrong long-term semantic
boundary for capOS-native applications. If an adventure client receives a line
from StdIO and parses go north, take key, or say hello internally,
capOS has only moved string command parsing out of the shell and into the app.
That is still weaker than typed capability invocation.
Native interactive applications should expose a command surface:
path=["go"], args={direction:"north"}
path=["take"], args={item:"brass-key"}
path=["say"], args={text:"hello there"}
path=["chat","join"], args={channel:"#lobby"}
The user may still type familiar command <args> forms. The shell or terminal
host parses them through generic command metadata, including nested
subcommands, argument kinds, completions, and redaction rules. The app receives
a structured invocation and converts it to typed service calls. The shell does
not hardcode application verbs, and the application does not parse unstructured
terminal text for normal operations.
StdIO remains an explicit text I/O capability for transcript output, simple
programs, POSIX compatibility, and test harnesses. It should not be the primary
command interface for native chat/adventure-style applications. The focused
design is in
Interactive Command Surfaces.
Remote Session CapSet Clients
Not every remote interaction should become a shell session. A regular host
application – CLI, native GUI, Tauri backend, webapp gateway, or service
client – should be able to authenticate to capOS, receive a broker-issued
remote view of its session CapSet, and call the capabilities it was granted
over Cap’n Proto RPC. That path is a programmatic peer of the native shell:
both consume a session bundle from AuthorityBroker, but only the shell adds
command parsing, terminal state, and child-process workflow.
The remote client must not receive the kernel’s local CapSet page, local
cap-table indexes, endpoint selectors, result-cap indexes, or global session
identifiers. It receives typed RPC object references backed by a capOS
per-session worker. Chat, Paperclips, Adventure, command sessions, and future
service APIs should therefore be callable by generated clients without routing
through capos-shell. The owning design is
Remote Session CapSet Clients.
That proposal also covers bidirectional UI composition for web/Tauri/GUI
sessions: services can propose task-specific panes or command surfaces through
explicit UI caps, but cannot take arbitrary control of the host UI.
Terminal Host Separation
The shell should not be the terminal host forever. The component that owns a UART, web socket, GUI pane, line editing, history, paste handling, resize state, and render policy can be a separate terminal host process. The shell then runs against a terminal entity and can be reused unchanged from local console, GUI, web, and scripted hosts.
TerminalSession remains the foreground text-session authority, but it is an
interface between terminal host and shell, not proof that the shell implements
the terminal. Shell-spawned applications should normally receive command
sessions or explicit StdIO adapters, not the shell’s move-only
TerminalSession.
Remote text transports follow the same rule. The Telnet Shell Demo in
Networking is a demo-only plaintext
terminal host: it accepts a host-loopback QEMU-forwarded TCP connection and
gives the shell a socket-backed TerminalSession. The kernel-side socket
terminal silently consumes IAC option negotiation in its line discipline, so
no userspace pre-handoff recv is required. It must not turn the shell login path into a
raw ByteStream, raw TcpSocket, or StdIO substitute, because password
entry, echo policy, cancellation, and shell launch authority are defined at the
TerminalSession boundary. The QEMU harness for that demo binds the host
forward to 127.0.0.1:2323 only and runs caps to prove the child shell did
not receive raw NetworkManager, ProcessSpawner, TCP, or unknown capability
interfaces. The gateway itself remains a trusted demo bootstrap service until
scoped listener and manifest-declared shell-launch grants exist; production
remote CLI shell access waits for the SSH gateway layer. The SSH path is
specified separately in SSH Shell Gateway: it
keeps the same TerminalSession and broker-issued shell-bundle boundary, while
adding SSH host authentication, encrypted transport, public-key user
authentication, channel policy, and remote-session audit. Its initial schema
stubs name the terminal construction and authority surfaces as
SshTerminalFactory, TcpListenAuthority, and RestrictedShellLauncher; they
now have focused QEMU proofs for scoped listen authority, public-key session
minting, restricted shell launch, and a bounded plain-TCP terminal-host handoff.
A focused development-only host-key proof grants an explicitly labeled
non-production SshHostKey cap in QEMU that performs bounded fixture
exchange-hash signing. The full runnable OpenSSH gateway still waits on
encrypted transport, SSH packet/channel handling, persistent production
key-management-backed signing, and the final run-ssh-shell host harness.
Agent Mode
Model-driven interaction is defined in
Language Models and Agent Runtime. This proposal does
not describe a separate “agent shell” process. The native shell, running
in “agent mode”, is the tool runner: it holds the session cap bundle,
exposes caps to a LanguageModel service as typed ToolDescriptor
values with per-tool permission modes (auto / consent / stepUp /
forbidden), executes the model’s tool calls against its own caps,
streams results back into the conversation, and keeps the user in the
loop through consent prompts, streaming, and interrupt. There is no
separate PlannerAgent or ActionPlan pipeline.
Long-lived OpenClaw-like hosted agents, swarms, background tasks, external channel ingress, agent-maintained memory/wiki stores, and MCP/A2A-style interoperability are intentionally separate from the shell surface; see capOS-Hosted Agent Swarms. The shell can launch, inspect, approve, or cancel hosted tasks, but it should not own the hosted-agent control plane.
Approval and Authentication
Elevation belongs in a trusted broker service that the shell can consult but cannot impersonate.
Conceptual interfaces:
interface ApprovalClient {
request @0 (
reason :Text,
plan :ActionPlan,
requestedCaps :List(CapRequest),
durationMs :UInt64
) -> (grant :ApprovalGrant);
}
enum ApprovalState {
pending @0;
approved @1;
denied @2;
expired @3;
escalated @4;
}
interface ApprovalGrant {
state @0 () -> (state :ApprovalState, reason :Text);
claim @1 () -> (caps :List(GrantedCap));
cancel @2 () -> ();
}
interface AuthorityBroker {
request @0 (
session :UserSession,
plan :ActionPlan,
requestedCaps :List(CapRequest),
durationMs :UInt64
) -> (grant :ApprovalGrant);
}
ActionPlan is the structured description of the work the request will
perform. Free-form text it carries is for the approval UI; the broker
decides authority from the typed step list, never from the summary string.
struct ActionPlan {
# Brief, redactable, human-readable summary. Used by the approval UI;
# not used as an authority input by the broker.
summary @0 :Text;
# Structured action steps. The broker decides whether each step is
# representable for the bound session/profile; an unrepresentable step
# fails the whole request.
steps @1 :List(ActionStep);
# True if any step modifies durable state, terminates a service,
# releases storage, sends external traffic, or is otherwise hard to
# reverse. Brokers may require step-up authentication and longer
# review windows when this is set.
destructive @2 :Bool;
# Stable identifier the requester sets so it can correlate the resulting
# grant or queue entry. Brokers must not interpret this as authority.
requestId @3 :Data;
}
struct ActionStep {
union {
spawn :group {
# Manifest entry name or trusted launcher alias. The broker
# resolves the alias to a binary identity before grant.
target @0 :Text;
# Cap names the spawned process needs from the launcher's
# advertised set. Each name maps to a concrete `CapRequest`
# in the enclosing `ActionPlan.requestedCaps`.
capNames @1 :List(Text);
}
serviceControl :group {
service @2 :Text;
verb @3 :ServiceVerb;
}
storageOpen :group {
namespace @4 :Text;
path @5 :Text;
mode @6 :StorageMode;
}
# Free-form structured payload describing a step the broker
# recognises by name. Lets new step kinds land without re-issuing
# the schema; brokers refuse unknown `kind` values.
custom :group {
kind @7 :Text;
payload @8 :Data;
}
}
}
enum ServiceVerb {
start @0;
stop @1;
restart @2;
reload @3;
}
enum StorageMode {
read @0;
readWrite @1;
append @2;
}
CapRequest describes a single capability the plan needs. The broker
matches each request against the principal’s role bundle and ABAC
context; the response either narrows the request and mints the cap, or
denies. There is no widening path.
struct CapRequest {
# Capability interface name advertised by the broker
# (`ServiceSupervisor`, `Directory`, `TcpProvider`, ...). The broker
# refuses unknown interfaces.
interface @0 :Text;
# Identifier of the target object inside that interface. For
# `ServiceSupervisor` this is the service name; for `Directory` it
# is the namespace path; for `TcpProvider` it is an address-policy
# selector. The broker validates the target against policy.
target @1 :Text;
# Per-cap maximum duration. The grant returns the lesser of this and
# the plan-level `durationMs` after policy narrowing. Zero means
# "use plan-level default".
maxDurationMs @2 :UInt64;
# Optional attenuation hints (subdirectory, method allow-list,
# address filter). The broker may further narrow these but must
# never widen them.
attenuation @3 :Data;
}
GrantedCap is the same transport-level result-cap concept used by
ProcessSpawner – a typed reference to an attenuated, leased
capability the broker has minted. It is not a separate authority
encoding; reading the granted cap is the only way to use the granted
authority.
The native shell holds only a session-bound ApprovalClient. It does not
submit arbitrary PrincipalInfo, role, UID, label values, or authentication
proofs as authority. The ApprovalClient forwards the bound UserSession
and typed request to AuthorityBroker. The broker or a consent service
wrapping it holds powerful caps, drives any trusted consent or step-up
authentication path, and mints attenuated temporary caps after policy and
authentication checks.
The conceptual API intentionally has no authProof argument on the
shell-visible path. If a proof is needed, it is collected by
SessionManager, the broker, or a trusted approval UI and reflected back
to the shell only as pending, approved, denied, expired, or
escalated.
Approval Inbox
Synchronous approval is not always available. Step-up authentication, a dual-control destructive action, or a deferred review (for example a service-restart change-window) all need a durable queue: the request must be listable later, persistent across reconnects, and triageable in batch.
The broker exposes that queue through an ApprovalInbox cap minted
into the session bundle of whoever may approve. The inbox is not a
shell cap; the native shell uses ApprovalClient to submit requests,
and a separate principal (a security operator, the same operator under
step-up, or a multi-party reviewer set) holds the inbox cap that
decides them. Remote workspaces (the CapSet UI) treat
ApprovalInbox as the canonical pending-actions surface, which lets a
browser session show “you have pending approvals” without granting the
browser any of the requested authority.
interface ApprovalInbox {
# List entries currently awaiting decision. Bounded; the broker
# enforces a per-inbox visible-window cap and may return fewer than
# `limit` rows. `truncated` distinguishes "broker capped this page"
# from "no further rows".
list @0 (
cursor :Data,
limit :UInt32
) -> (
entries :List(ApprovalEntry),
nextCursor :Data,
truncated :Bool
);
# Look up a specific entry by id. Useful when a UI deep-links to
# an entry past the listed window.
entry @1 (entryId :Data) -> (entry :ApprovalEntry);
# Approve, deny, or escalate a single entry. `approve` returns the
# `ApprovalGrant` minted by the broker; `deny` and `escalate`
# transition the entry without minting caps. The decider's reason
# text is bounded and recorded in audit.
decide @2 (
entryId :Data,
decision :ApprovalDecision,
reason :Text
) -> (grant :ApprovalGrant);
# Bulk-decide entries that share shape (same requester principal,
# same plan summary fingerprint, same destructive flag). The broker
# rejects mixed shapes with an explicit diagnostic instead of
# silently approving heterogeneous requests.
batchDecide @3 (
entryIds :List(Data),
decision :ApprovalDecision,
reason :Text
) -> (grants :List(ApprovalGrant));
# Subscribe to inbox change events. The listener cap is held by
# the broker; logging out of the inbox session revokes the
# subscription.
watch @4 (listener :ApprovalListener) -> ();
}
enum ApprovalDecision {
approve @0;
deny @1;
escalate @2;
}
struct ApprovalEntry {
# Broker-minted opaque id, stable across reconnects.
entryId @0 :Data;
# Opaque audit-only principal id of the requester.
requesterId @1 :Data;
# Display name; not authoritative.
requesterName @2 :Text;
plan @3 :ActionPlan;
requestedCaps @4 :List(CapRequest);
durationMs @5 :UInt64;
state @6 :ApprovalState;
# Last decider reason or denial detail; bounded.
reason @7 :Text;
createdAtMs @8 :UInt64;
expiresAtMs @9 :UInt64;
escalation @10 :EscalationInfo;
}
struct EscalationInfo {
# Number of additional reviewers the broker has notified. Zero when
# the entry has not been escalated.
reviewerCount @0 :UInt32;
# Role names of the additional reviewers; never principal ids.
reviewerHints @1 :List(Text);
}
interface ApprovalListener {
appended @0 (entry :ApprovalEntry) -> ();
decided @1 (entryId :Data, state :ApprovalState) -> ();
expired @2 (entryId :Data) -> ();
}
The ApprovalClient itself does not change shape: a request that the
broker cannot decide synchronously still returns an ApprovalGrant
immediately, with state == pending and a stable handle. The broker
adds an entry to the corresponding inbox; the requester polls or
watches its grant; the inbox holder drives the decision. When the
inbox holder calls decide(approve), the existing grant transitions
to approved and claim returns the minted caps – the requester
does not learn an entry id, and the inbox does not learn the
requester’s ApprovalGrant cap. The two surfaces meet only at the
broker.
Inbox entries are durable across reconnects because entryId is
broker-minted and the inbox cap is session-bound rather than
transport-bound. Closing a transport does not delete entries;
re-presenting the same session-scoped inbox cap rebinds the listener
without losing pending state. Entries expire on the broker timer at
expiresAtMs and produce an expired listener event; expired
entries remain visible to entry() for a bounded audit window
defined by broker policy, after which they move to the audit log
only.
Elevation Flow
User request (typed directly, or produced by agent-mode tool-use as an
ActionPlan before invoking the broker):
restart the network stack
Requested action presented to the broker:
- stop service "net-stack"
- spawn "net-stack"
- grant: nic, timer, log
- wait for health check
Missing authority:
- ServiceSupervisor(net-stack)
Requested duration:
- 60 seconds
Broker decision:
- Which
UserSessionand profile is this request bound to? - Is that principal/profile allowed to restart
net-stack? - Is the requested binary allowed?
- Are the requested grants narrower than policy permits?
- Do mandatory confidentiality and integrity constraints allow the grant?
- Is there fresh user presence?
- Does this require step-up authentication?
If approved, the broker returns a narrow leased capability:
supervisor: ServiceSupervisor(service="net-stack", expires=60s)
It should not return broad ProcessSpawner, BootPackage, or
DeviceManager authority when a scoped supervisor cap can do the job.
Authentication
Authentication proof should be consumed by the SessionManager or broker
boundary, not exposed as a secret to the shell. Suitable mechanisms include:
- password or PIN for medium-risk local actions.
- hardware key or WebAuthn-style challenge for administrative actions.
- TPM-backed local presence for device or boot-policy operations.
- OIDC step-up: broker requests a fresh ID token from the session’s IdP
with
prompt=login,max_age, or strongeracr_valuesbefore returning a leased cap. The IdP andSessionManagerdrive the user interaction; the shell sees onlypending→approved/denied. - multi-party approval for destructive policy, storage, or recovery actions.
The shell should never receive raw tokens (including OAuth access or refresh
tokens), private keys, recovery codes, or full environment dumps. When the
broker must delegate outbound authority to a session — for example, “read
from this company’s HR API” — it returns a wrapper capability that holds
the AccessToken internally; the shell invokes the wrapper without seeing
the bearer string.
Shell Hardening
The shell must treat files, logs, web pages, service output, model output, and CQE payloads as untrusted data. They are not instructions.
Required behavior:
- show an executable typed plan before authority-changing actions.
- keep elevated caps leased, narrow, and short-lived.
- release temporary caps after the plan finishes or fails.
- audit every approval request, grant, cap transfer, and release.
- require exact targets for destructive actions.
- refuse broad phrases such as “give it everything” unless a trusted policy explicitly allows a named emergency mode.
- keep any model-derived context separate from secrets and authentication proofs; see the LLM/agent-runtime proposal for the model-service side.
The enforcement rule is simple: users and models may propose, explain, and request. Capabilities decide what can happen.
POSIX Shell
The POSIX shell is a compatibility layer for existing software and scripts. It should be useful, but it should not define native capOS administration.
The C-ABI substrate for porting POSIX programs (including a POSIX shell) is
specified separately in
POSIX Adapter. libcapos exposes the
capability ring, CapSet, raw syscalls, and heap to C; libcapos-posix layers
the POSIX shape (fd table, errno, pipe / read / write / dup / dup2,
fork / execve / waitpid / _exit, posix_spawn and the file-action
shims, clock_gettime, UDP socket calls, console-backed stdio) on top. Phases
P1.1, P1.2, and P1.3 of that proposal are landed; the C-substrate, pipe cap,
recording-shim fork-for-exec, direct posix_spawn path, and Console-backed
stdio are proven by QEMU smokes (make run-c-hello, make run-posix-dns-smoke,
make run-posix-pipe-smoke, make run-posix-stdio-smoke). The POSIX shell port
itself depends on Namespace and File caps, which are tracked in that
proposal as gating work after the current phases close.
Mapping
POSIX concepts map onto granted capabilities:
| POSIX concept | capOS backing |
|---|---|
/ | synthetic root built from granted Directory or FileServer caps |
| cwd | current scoped Directory cap |
| fd | local handle to File, ByteStream, pipe, terminal, or socket cap |
| pipe | ByteStream pair or userspace pipe service |
PATH | search inside the synthetic root or a command registry cap |
exec | ProcessSpawner or restricted launcher cap |
| sockets | socket factory caps such as TcpProvider or HttpEndpoint |
uid, gid, user, group | synthetic POSIX profile derived from session metadata |
$HOME | path alias backed by a granted home directory or namespace cap |
/etc/passwd, /etc/group | profile service view, scoped to the compatibility environment |
| env vars | data only; never authority by themselves |
If a POSIX process has no network cap, connect() fails. If it has no
directory mounted at /etc, opening /etc/resolv.conf fails. If it has no
device cap, /dev is empty or synthetic.
A POSIX shell is launched with both a CapSet and compatibility profile metadata. The profile controls what legacy APIs report. The CapSet controls what the process can actually do.
Compatibility Limits
Exact Unix semantics should not be promised early.
- Prefer
posix_spawnover fullforkfor the first implementation. forkwith arbitrary shared process state can be emulated later if needed.setuidcannot grant caps. At most it asks a compatibility broker to replace the POSIX profile or launch a new process with a different broker-issued cap bundle.- Mode bits and ownership metadata do not create authority.
chmodcan modify filesystem metadata exposed by a filesystem service, but it cannot grant caps outside that service’s policy./procis a debugging service view, not kernel ambient introspection.- Device files exist only when a capability-backed adapter deliberately exposes them.
This is enough for many build tools and CLI programs without making POSIX the security model.
POSIX Session Caps
A normal POSIX shell session might receive:
terminal TerminalSession
session UserSession metadata
profile POSIX profile view
root Directory or FileServer synthetic root
launcher restricted ProcessSpawner/command launcher
pipeFactory ByteStream factory
clock Timer
Optional caps:
tcp scoped socket provider
home writable user Directory
tmp temporary Directory
proc read-only process inspection tree
Administrative caps still require broker-mediated approval.
Recovery Shell
A recovery shell is a separate policy profile, not the normal interactive shell with hidden extra privileges. It may receive a larger cap set, but only after strong local authentication and with full audit logging. Guest and anonymous profiles must not fall into recovery authority by omission.
Possible recovery bundle:
console
boot package read
system status read
service supervisor for critical services
read-only storage inspection
scoped repair caps
approval client
Destructive recovery operations should still go through exact-target approval. The recovery shell should be local-only unless a separate remote recovery policy explicitly grants network access.
Required Interfaces
This proposal implies several service interfaces beyond the current smoke-test surface:
UserSession/SessionManager: principal/session metadata, audit context, and guest or anonymous profile creation (user identity proposal).TerminalSession: session-scoped interactive terminal I/O. The first boundary is line-orientedwrite,writeLine, and boundedreadLinewith per-call echo control andsubmitted/cancelled/closedoutcomes; resize and paste framing can layer on later.StdIO: explicit text I/O capability serviced by the shell, a test harness, a web gateway, or another UI adapter. It has namedstdout,stderr, andstatusstreams plusline,block, andhiddenread modes; it does not imply inherited POSIX file descriptors and should not be the semantic command interface for native interactive applications.CommandSession: generic interactive command surface for native applications. It describes command paths, nested subcommands, argument shapes, completions, prompts, redaction metadata, render events, and typed invocation results.TerminalHost/ terminal entity: process and session object owning raw terminal transport, line discipline, presentation state, history, resize, and GUI/web framing while granting a foreground session to the shell.SchemaRegistry: maps interface IDs to method names and parameter schemas.CommandRegistry: optional registry of native command capabilities.SystemStatus: read-only process and service status.LogReader: scoped log access.ServiceSupervisor: restart/status authority for one service or subtree.AuthorityBroker/ApprovalClient: session-bound base bundles, plan-specific leased grants, and policy/authentication mediation.CredentialStore,ConsoleLogin, andWebShellGateway: boot-to-shell authentication services for password-verifier setup, passkey registration, federated OIDC login, and text terminal launch (boot-to-shell proposal).OAuthClient,OidcIdentityProvider,TokenVerifier,WorkloadIdentityFederation: OAuth2/OIDC primitives for federated login, outbound service authentication, and inbound resource-server token validation (OIDC and OAuth2 proposal).SshGateway,SshHostKey,AuthorizedKeyStore,SshTerminalFactory,TcpListenAuthority, andRestrictedShellLauncher: production remote CLI terminal ingress, SSH host-key proof, public-key login mapping, scoped TCP listen authority, shell-only launch authority, and SSH-backedTerminalSessionlaunch. The current development host-key proof exposes non-production public metadata and performs bounded fixture signing in QEMU; production host keys still require persistent key management (SSH shell proposal).AuditLog: append-only record of plans, approvals, grants, and releases.POSIXProfile/ compatibility broker: synthetic UID/GID, names,$HOME, cwd, and profile replacement without treating POSIX metadata as authority.ByteStream/ pipe factory: explicit byte-stream composition for POSIX and selected native pipelines.
These should be ordinary capabilities. A shell only sees the subset it has been granted.
Implementation Plan
-
Native serial shell
- Built on
capos-rt. - Lists initial CapSet entries.
- Invokes typed methods on the capabilities it was actually granted,
including
TerminalSessionfor ordinary interactive sessions. - When launched with a restricted launcher or other scoped spawn authority,
spawns and waits on exact-grant children without assuming broad
BootPackageorProcessSpawneraccess. - Provides
caps,inspect,call,spawn,run,wait,release, andtrace. - Runs interactive applications as ordinary spawned commands or resident
command sessions.
StdIOrequests may be serviced for text-stream programs, but native app commands should flow through structured command surfaces.
- Built on
-
Session-aware shell profile
- Use the
SessionManager -> UserSession metadataandAuthorityBroker(session, profile) -> cap bundlesplit. - Add
self/sessionintrospection without making identity metadata authoritative. - Start with guest, local-presence, and service-account profiles before durable account storage exists.
- Use the
-
Structured native scripting
- Add typed variables, result-cap binding, and plan serialization.
- Add schema registry support for method names and argument validation.
- Add a generic command-surface parser so
command <args>and nested subcommands compile to typed invocations without app-specific shell matches. - Add explicit byte-stream adapters for commands that need text streams.
-
Approval broker
- Define
ActionPlan,ActionStep,CapRequest,ApprovalClient,ApprovalInbox,ApprovalEntry, and leased grant records. - Add local authentication and audit logging.
- Make administrative native-shell operations request scoped caps through the broker instead of running from a permanently privileged shell.
- Wire
ApprovalInboxinto the operator session bundle so deferred, stepped-up, and multi-party approvals have a durable triage surface instead of relying on synchronous return-from-request.
- Define
-
Boot-to-shell integration
- Add local console login/setup in front of the native shell.
- Require a configured password verifier when one exists.
- Enter setup mode when no console password verifier exists.
- Treat guest as an explicit local profile and anonymous as a separate remote/programmatic profile, not as missing-password fallbacks.
- Support passkey-only web terminal setup through local/bootstrap authority, not unauthenticated remote first use.
- The local console login/setup half of this step is landed; the full boot-to-shell flow (durable multi-verifier accounts, passkey paths, federated OIDC login, web text shell gateway, production SSH shell gateway) is tracked in Boot to Shell.
-
Agent mode (out of scope here)
- Defined in Language Models and Agent Runtime:
no separate “agent shell” process. The native shell, running in
“agent mode”, is the tool runner: it gains a
LanguageModelclient cap plus a per-tool permission table (auto/consent/stepUp/forbidden), exposes its own session caps as typedToolDescriptorvalues to the model service, executes the model’s tool calls against those caps, streams results back into the conversation, and keeps the user in the loop through consent prompts and interrupts. There is noPlannerAgentor staticActionPlanpipeline.
- Defined in Language Models and Agent Runtime:
no separate “agent shell” process. The native shell, running in
“agent mode”, is the tool runner: it gains a
-
POSIX shell
- Implement after
Directory/File,ByteStream, and restricted process launch exist. - Start with
posix_spawn, fd table emulation, cwd, scoped root, pipes, and terminal I/O, plus synthetic POSIX profile metadata. - Add broader compatibility only as real workloads demand it.
- Implement after
Non-Goals
- No global root namespace.
- No shell-owned root/admin bit.
- No model-visible secrets.
- No default inheritance of all shell caps into children.
- No authorization from
PrincipalInfo, UID/GID, role, or label values alone. - No promise that POSIX scripts observe exact Unix behavior without a compatibility profile that grants the needed caps.
Open Questions
- Should the native shell syntax be CUE-derived, Cap’n-Proto-literal-derived, or a smaller custom grammar?
- How should schema reflection be packaged before a full runtime
SchemaRegistryexists? - How should later
TerminalSessionextensions such as resize and paste framing fit without exposing raw transport authority to ordinary shells? - How should the broker fingerprint plans for
ApprovalInbox.batchDecideshape-equivalence? A direct hash ofActionPlan.stepsis enough for identical plans submitted by the same requester profile, but near-identical plans differing only inrequestIdor summary text must still batch; near-identical plans differing in step targets or attenuation must not. The broker design needs an explicit fingerprinting rule beforebatchDecidecan be enabled. - How should audit logs be stored before persistent storage exists?
- How should interactive terminal UX scale beyond the planned
“one typed capability per command” native-shell surface? The current
prototype only exposes narrow typed commands; the questions below apply
to the proposed surface, not just what already runs. Several concrete
pain points are open:
- Cap management is manual. A shell user holds a CapSet and must
inspect, name, attenuate, pass, andreleasecaps explicitly per command. That is the right model for trust, but it is hostile for everyday work compared with a Unix prompt where$PWD,$PATH, open fds, and ambient credentials disappear from the user’s mind. The question is what affordances (named bindings, scoped session “workspaces”, broker-issued bundles bound to a task, auto-release on plan completion, undo/redo on cap moves, a visible “current authority” indicator) the shell should provide so the typical user is not hand-curating a cap graph for every line. None of this should re-introduce ambient authority; the goal is ergonomics over an already typed graph, not hiding it. - No agreed convention for passing parameters to programs. The
manifest currently launches binaries with a named CapSet and no
positional
args, noargv, no environment block, and no structured parameter struct (seesystem.cueandSystemManifestinschema/capos.capnp); init’sProcessSpawner-driven children inherit only the caps named in the spawn plan. Shellspawn ... with { ... }syntax is similarly cap-only. That is consistent, but it leaves “what does this program need to know besides its caps?” unanswered: where do free-form values (a chat channel name, an adventure save slot, a resize width) live? Options range from a typedLaunchParameterscapnp struct passed through the spawn plan, to a convention that every program declares a parameter schema discovered viaSchemaRegistry, to letting parameters always travel as fields on the first method call against aCommandSession/service cap rather than at launch time. The proposal should pick a single shape and describe how the manifest, shellspawn/run, native applications, and POSIXargvadapters all map onto it. - No replacement for Unix pipes. The native composition example uses
|>but defers byte-stream semantics toByteStream/StdIO, which is a strictly weaker pipe and not a data-processing model. Real workloads on Unix lean on text streams precisely because they are cheap and structured-enough; capOS can do better with typed records. The open question is whether to standardize a higher-level data-processing primitive — for example, YTsaurus-style map/reduce operators where each stage declares input and output schemas (RecordStream<T>?), the runtime negotiates a wire format (capnp records, framed JSON, columnar, raw bytes) at the boundary, and the shell’s|>becomes a pipeline planner rather than a byte pump. That would give native shell pipelines first-class typed composition without making every interface look likeByteStream. The question is whether this belongs in shell scope, in a separate data-processing proposal, or as aRecordStreamcapability in the schema registry that the shell merely consumes. - No story for ordinary shell programming constructs. The proposed
surface is one typed call per line plus
|>; the prototype is even narrower. Real interactive and scripted use needs conditionals (branch on a cap call result, onCapExceptionkind, on a value field), loops (iterate aList, fold aRecordStream, retry-with-backoff against a Timer), local variables and assignment beyond the implicit$from|>, user-defined functions/procedures that take typed parameters and capability arguments, early-return / break, and structured error handling that distinguishes transport-levelCapExceptionfrom application-level result variants. Each of these has capability-graph consequences that POSIX shells never had to face: does a function body close over the caller’s CapSet by reference or by an explicit captured set, are caps bound inside a loop iteration auto-released at the end of that iteration, does atry/recoverblock release leased broker grants on the failure path, can a function be saved and re-invoked across sessions (i.e. does it become a persistentActionPlantemplate), and how does the shell present a partial failure mid-pipeline without leaving orphan caps. The proposal should decide whether the native shell language defines these constructs itself, borrows them from a host language (CUE, a small embedded Rust-like DSL, an existing scripting runtime exposed as a capability), or stays deliberately non-Turing-complete and forces non-trivial control flow into spawned programs that expose typedCommandSessioninterfaces back to the shell. - No environment-variable concept, and no clear replacement. Unix
$VAR/exportdoes three jobs at once: ambient configuration inherited by every child, a per-process key-value scratchpad, and a side channel for caller-supplied tweaks (PATH,LANG,TZ,HTTP_PROXY,XDG_*). capOS deliberately has none of this — the manifest passes only a CapSet, and the shell does not synthesize a process-wide string-keyed table. There is also no obvious immediate need: configuration that should be authoritative belongs in aConfigcapability, locale/timezone are policy state on a session or service cap, and per-invocation tweaks fit the still-undecided parameter-passing convention above. The open question is whether capOS ever needs an explicit environment-like primitive (e.g. aKeyValueScopecapability bound to a session, an inheritable structured “ambient context” attached to a spawn plan, or a typedConfigOverlaychannel) for the cases where Unix would have used an environment variable, or whether each historical use case should instead be replaced by a dedicated capability (Locale,Clock,ProxyPolicy,XdgPaths,LogLevel) and the absence of an environment table treated as a feature rather than a gap. POSIX compatibility still has to exposegetenv/environ, but that is a separate per-process synthetic view inside the POSIX profile, not a native-shell concept.
- Cap management is manual. A shell user holds a CapSet and must
Proposal: Remote Session CapSet Clients
Let a regular host application connect to a capOS instance, authenticate through the same session machinery as shells and gateways, receive a broker-issued remote view of its CapSet, and invoke the granted capabilities over standard Cap’n Proto RPC. The first proof can be a Linux Rust CLI because it is easy to script, but the design is for host applications generally: native GUI apps, Tauri apps with Rust backends, server-side webapp gateways, desktop tools, and agent runners can all consume the same remote session CapSet model.
The important correction is that this is not a special “remote chat client” and not another shell transport. Chat, Paperclips, Adventure, system-info, command surfaces, and future service APIs should be ordinary capabilities in a remote session bundle. A shell is one possible client of that bundle; it is not the universal protocol.
Current State
The tree has several local interop and UI proofs:
demos/capnp-chat-interopruns inside capOS, accepts one scoped TCP connection, decodes a schema-framedChat.sendparameter message, calls the resident chat endpoint, returns a schema-framed result, and exits.- The host harness uses a Linux Python script plus the pinned
capnptool to encode/decode request and result messages. demos/remote-session-capset-gatewayruns inside capOS, listens through a manifest-scopedTcpListenAuthorityon guest port2327, authenticates a remote session throughSessionManager, returns a broker-shaped remote CapSet view, calls session/system-info DTO operations, and proves wrong-interface, unknown-cap, and stale-session denials. It derives login source metadata from the accepted socket and a gateway-generated connection event id.tools/remote-session-clientis a regular Linux Rust client crate. Its library is UI-neutral so the same client logic can back a CLI harness, native GUI, Tauri backend, or trusted web gateway.remote-session-uiis a trusted loopback web bridge in that crate. Its Rust backend holds the TCP connection and remote session state, serves a browser UI, and exposes only view models, call results, denial diagnostics, and redacted transcript rows to browser JavaScript. The focusedmake run-remote-session-capset-uiharness drives that UI against a gateway-only QEMU fixture.remote-session-web-uiis a capOS-served browser UI backend. Defaultmake runstarts it on guest port8080with loopback host forwarding, andmake run-remote-session-self-served-web-uiproves the full boot-resource UI bundle is served from the capOS-owned origin while preserving the same browser-safe view-model boundary. This remains local/QEMU evidence; the cloudboot L4, private GCE, and public ingress proofs are separate tasks.
Those proofs are useful because they show external Cap’n Proto data can cross the QEMU TCP boundary and reach capOS-hosted services through narrowed listener caps. The remote-session proof is the first target-shaped slice, but it is not the final RPC API. It still lacks:
- standard
capnp-rpcmessage transport; - live typed RPC proxy objects rather than DTO-mediated gateway operations;
- live endpoint-backed proxy objects beyond the current authenticated
per-session DTO worker slices for
Chat.send, Adventurestatus/look/inventory/go(direction), and the Paperclips Path B bridge-internalinitial/command/status/projectssynthesis from cachedserviceLaunchstate; - Paperclips service-runner launch on the default
make runmanifest (Path B wires the gateway worker, bridge dispatch, UI launch slot, and thesystem-remote-session-paperclips.cuefocused manifest now declares its AuthorityBroker launch policy, but default-manifest Paperclips launch wiring remains future work); - the on-wire Paperclips control-plane (Path C): extending
RemoteGatewayRequest/RemoteGatewayResponsewith paperclips arms so the bridge no longer synthesizes responses from cached launch state and the gateway worker drivesPaperclipsGameClientover a real DTO arm rather than the manifest-staticgameendpoint fallback; - rich Adventure/Paperclips client controls and broader service-specific worker/client implementations beyond the current Chat, Adventure, and Path B Paperclips slices;
- complete object lifetime and exception behavior;
- broader revocation and object-drop propagation beyond the current kernel-backed DTO logout and connection-teardown path;
- TLS/mTLS and expanded auth adapters beyond password, anonymous, and guest;
- resource accounting for remote references, in-flight calls, and result sizes.
Goals
- Support a normal host client built and run outside capOS. A Linux Rust CLI is the smallest harness; native GUI and Tauri/webapp-backed clients should not need a different capOS protocol.
- Authenticate through capOS session/admission services, not through an application-specific service secret.
- Support multiple admission methods: local password where policy enables it, public-key signatures, OIDC/OAuth browser or device flows, passkey/WebAuthn through the web gateway path, mTLS client identity, guest/anonymous profiles where explicitly enabled, and future service/workload credentials.
- Return a live remote CapSet view whose entries are typed RPC client objects, not serialized local cap-table slots.
- Let the client call any granted remote-proxyable capability by name and expected interface ID.
- Let a host UI discover broker-approved service profiles, start allowed game server processes through a restricted service-runner, and attach the capabilities those processes export or receive without exposing local spawn authority.
- Support bidirectional session UI composition: a host UI can call capOS capabilities, and capOS-side services or agents can propose bounded changes to the host session’s panes, command palette, visualizations, density, theme, and workflow-specific controls through explicit UI capabilities.
- Keep local-only authority local: cap IDs, endpoint generations, receiver selectors, session-global identifiers, and kernel result-cap indexes never become portable remote authority.
- Preserve session-bound invocation context. Remote calls run under the gateway/worker session created for that remote client.
- Make logout, disconnect, transport breakage, session expiry, policy revocation, and object drop observable and fail closed.
Non-Goals
- General network transparency across arbitrary capOS hosts.
- OCapN compatibility or third-party handoffs.
- Browser JavaScript receiving capOS capability objects directly. A webapp may be a front end, but a trusted server, gateway, or Tauri Rust backend holds the remote CapSet.
- Letting capOS services execute arbitrary host UI code, inject unreviewed JavaScript/CSS, spoof trusted browser/desktop chrome, or persist UI changes outside the granted session UI scope.
- Replacing SSH, WebShellGateway, native shell, or interactive command surfaces.
- Exposing raw
ProcessSpawner, raw process handles, endpoint owner caps, local cap ids, result-cap slots, raw network factories, broad storage roots, key material, or browser-held capOS capability objects as a default remote bundle. Process handles stay backend-local. - Treating a browser or webview as a capOS capability host. Browser code sees view models, launch forms, command descriptors, user events, diagnostics, and rendered results; the trusted Rust/backend side holds the remote session and any remote capability proxies.
- Treating password authentication as the only or preferred remote path.
- Serializing the kernel CapSet page or local cap table to the client.
UI Scope And Architecture
This section is the single-page synthesis future contributors should read
before changing anything in tools/remote-session-client/ or the gateway.
The detailed mechanics live in the rest of this proposal, the backlog
(docs/backlog/remote-session-capset-client.md), and the plan
(docs/backlog/remote-session-capset-client.md); this section captures
what the UI is for, what it must hold, and how the pieces decompose.
Goal
A remote operator, after authenticating to a capOS gateway, can drive every remote-proxyable capability the broker grants their session – directly, with typed UI, without a shell, without webview-held capOS handles, without leaking session-id hex, cap slots, or process handles to the browser. The CapSet UI is not a shell, not a generic API explorer, and not a browser; it is a peer client of the same broker bundle a shell would consume, over TCP/RPC instead of the ring page, with a backend-held authority boundary and a typed UI on top.
What the UI is for
Grouped by intent, not by panel. Each item is constrained by the corresponding section later in this proposal.
- Sign in to a remote capOS host. OS-style login surface with a
visible username field, secondary endpoint/auth controls, no full
persistent technical header. The gateway advertises the auth methods
the system makes available (narrowed only by explicit manifest
policy); disabled methods stay listed and clearly marked so the
protocol is not password-shaped. The web UI’s username field is
empty by default – the bridge does not pre-fill from
CAPOS_REMOTE_SESSION_USER, hostUSER, or any other host-side identity hint, because a pre-fill leaks operator/account hints to anything observing the page before authentication. The CLI may take--useras an explicit operator override; the web UI does not. Denials surface with explicit codes, never as silent transport errors. - Understand who/what the operator is. Session view: principal,
profile, auth method, auth strength, freshness/expiry, logout.
Redacted session-id only. Lifecycle states observable:
live/logged_out/ futureexpired/revoked/recovery_only. Stale-call attempts must visibly fail closed. (See## Invocation Contextanddocs/proposals/session-bound-invocation-context-proposal.md.) - Discover what was granted. CapSet view as the inspection surface
(name, interface id, transfer policy, lease expiry, get-by-name+id);
service catalog view as the task-oriented surface (broker- and
launcher-advertised runnable profiles, required grants, exported
descriptors, launch/probe/status). See
## Service Catalog And Game Server Launch. - Use what was granted. For every cap the broker bundles, the UI
must offer at least a generic invocable form – not just inspection.
Service-specific rich clients (Adventure rich client, real Chat
panel, Paperclips client, future agent-shell-services) layer on top
of the same backend-held caps. Where a service exposes a typed
CommandSurface(seedocs/proposals/interactive-command-surface-proposal.md), the UI renders typed buttons/inputs/selectors driven by that surface’s metadata rather than hand-coded controls. Where a service exposes text/audio/video surfaces, the UI consumes them through the Chat substrate (docs/proposals/chat-multimedia-substrate-proposal.md): listener caps for incoming text/audio/video, capnp-> streammethods for outgoing media, capability-mediated peer/channel granting, and a WebRTC mapping for the browser-to-backend audio/video path. The CapSet UI never holds the listener caps directly; the trusted Rust backend owns them and emits redacted view-model events plus WebRTC handles for the browser. - Host a terminal panel when granted. The CapSet UI is not
defined as a terminal emulator and works without one. But when
the broker grants a
TerminalSessioncap – for a native shell, a POSIX shell, or any StdIO-based service that expects a terminal on the other side – the UI may host a terminal panel for that cap. The boundary stays: terminal bytes flow through a backend-heldTerminalSession; the browser renders frames it receives, never opens a raw shell or holds aProcessSpawner. - Surface agent-shell-exposed capabilities as first-class. The
CapSet UI does not contain the LLM loop, model client, or
tool-execution runner – those live in the agent shell process (see
docs/proposals/llm-and-agent-proposal.md). But agent-shell-exposed services (e.g. “send message to running agent”, “approve queued action”, “audio stream to/from agent”) are services the broker can bundle. When bundled, the CapSet UI exposes them through the same per-session worker / typed view-model pattern as Chat or Adventure. Action-approval queues are the canonical capability-driven UI surface here – the policy engine asks, the operator sees a queue and approves/denies per item. - Launch services where policy allows. Service-runner launch flow:
select profile → see required grants → side-effect-free probe →
confirm → backend launches restricted server graph (e.g.
adventure-server+ NPC companions) → backend attaches/retains exported descriptors in the backend-held remote CapSet. Browser sees launch form, status, denials, descriptors – never rawProcessSpawneror process handles. - Diagnose / audit. Low-level probes (denied-chat, stale-call,
system MOTD, session-summary diff) live in a Diagnostics or Session
panel, not interleaved with normal service use. Redacted transcript
export in its own view; redaction status visible; raw authority
material absent. UI smoke checks for forbidden markers
(
processhandle,capabilitymanager,capslot, …). - Bidirectional UI composition (later). A capOS service may, only
when granted a
RemoteUiSurfacecap, propose bounded layout/theme/command/visualization patches and receive typed user events back. Cannot inject JS/CSS, spoof login chrome, persist UI state without a separate settings cap, or exceed quota/size bounds. See## Bidirectional UI Composition.
Design invariants the UI must hold
The proposals don’t specify pixel layout; they specify a small number of hard invariants. Every UI design choice has to fit these:
- Authority boundary. Trusted Rust backend holds: TCP connection, remote session state, per-session worker proxies, capOS cap references, broker bundle policy, raw snapshots used to compute view models. Browser holds: view models, command descriptors, launch forms, redacted transcript rows, theme state.
- Session-bound invocation. Every post-auth call runs under the
immutable
SessionContextof the per-session worker. The browser cannot select identity by request field; the backend cannot construct a freshSessionContextfrom request bytes. Logout, disconnect, expiry, revocation must break all session-bound proxies and fail closed before result bytes reach the caller. - Privacy-preserving disclosure. Default endpoint metadata is
opaque (
scoped_ref+ freshness). Subject fields (principal, profile, auth strength) appear in the UI only because the broker policy explicitly disclosed them for that service. - Capability = invoke gate; UI surface = render gate. A button on the screen is not what authorizes a call. The cap held in the backend is. UI controls that aren’t currently invocable must say “planned / not remote-proxyable yet” rather than imply they work.
- Interface = permission. Method-level access lives in the schema, not in a per-cap rights bitmask. Narrowing what a remote client can do means a narrower wrapper cap from the broker – not a flag on the same cap.
- Side-effect-free probes are real. A probe response that says “supported / required grants accepted / message” did not spawn anything, allocate endpoint owners, or attach caps.
- Redaction is structural, not after-the-fact. Sensitive fields are dropped or redacted on the way into view models, not stripped from logs after the fact. Backend tests assert browser envelopes never contain raw session-id hex or password material.
- UI smoke fails if any visible button is unexercised. This prevents the UI from accumulating decorative controls.
- Theme/layout state is local UI state, not capOS state. Persistence requires an explicit settings cap.
Architecture decomposition
flowchart LR
subgraph host[Host machine]
subgraph browser[Browser / webview / Tauri webview]
js[Browser JS - view models, forms, results, redacted transcript, theme state]
end
subgraph rust[Trusted Rust backend - tools/remote-session-client]
bridge[HTTP bridge - /api/* endpoints]
app[AppState - session VM, caps VM, snapshots, transcript, automation]
tcp[Gateway TCP connection - schema-framed DTOs today, capnp-rpc planned]
lib[remote-session-client lib - protocol, frame, session_diff, transcript]
end
cli[CLI binary - same lib backend]
end
subgraph capos[capOS guest in QEMU or future hardware]
subgraph gw[Remote-session gateway process]
tcplisten[TcpListenAuthority on guest port 2327]
authflow[Auth flow - password, anonymous, future adapters]
sm[SessionManager.login -> UserSession]
broker[AuthorityBroker.remoteClientBundle]
end
subgraph workers[Per-session RPC workers]
chatw[Chat worker - holds Chat client facet]
advw[Adventure worker - holds Adventure endpoint]
futurew[Future workers per service - terminal, agent, voice...]
end
subgraph services[Backing services]
cs[chat-server]
ad[adventure-server + NPCs]
pc[paperclips-server - future]
end
kernel[Kernel - SessionManager, CapTable, Endpoints, ring, audit]
end
js -- HTTP JSON --> bridge
bridge --> app --> lib --> tcp
cli --> lib
tcp -- TCP / DTO today / capnp-rpc planned --> tcplisten
tcplisten --> authflow --> sm --> broker
broker -- backend-held descriptors / caps --> app
app -- worker spawn requests --> broker
broker --> workers
chatw --> cs
advw --> ad
workers <--> kernel
Key seams:
-
Gateway boundary (
demos/remote-session-capset-gateway/): scopedTcpListenAuthority,SessionManager,AuthorityBroker, narrowly approved backend launch authority. No rawNetworkManager, rawProcessSpawner, broad endpoint authority. -
Per-session worker boundary (
demos/remote-session-chat-worker/,demos/remote-session-adventure-worker/, future workers): each endpoint-backed remote method runs in a worker that holds the live session-bound caller context. Worker spawn is validated; logout/connection-close tears down workers; release flushing happens on shutdown. -
Trusted Rust backend boundary (
tools/remote-session-client/src/): theAppStatekeepsgateway: Option<GatewayConnection>,current_snapshot: RemoteSessionSnapshot(raw), and view-model fields (redacted). The HTTP bridge’s/api/*surface is the only path the browser has into capOS authority. -
Browser boundary (
tools/remote-session-client/ui/): pure client of/api/stateview models,/api/call/*typed calls,/api/capset/*,/api/probe/*,/api/transcript/*. JS state is presentation: theme, active tab, login form values, click coverage report. -
Transport evolution. Today: bespoke schema-framed Cap’n Proto DTOs, length-prefixed frames, request/response sequence numbers. Planned: standard
capnp-rpcwith live proxy objects, exception mapping, release/drop, promise pipelining. The backend boundary stays the same; the wire shape changes.Standard
capnp-rpc(thecapnp-rpcRust crate, v0.25 at the time of writing) isstd-only and requires a futures executor; the QEMU-side gateway is#![no_std]#![no_main]with a synchronousloop { accept; loop { recv_frame; handle; send_frame } }shape (demos/remote-session-capset-gateway/src/main.rs). The wire-level replacement is therefore gated on either bringing an async runtime to capOS userspace or shipping a sync-friendly capnp-rpc adapter. Until then, transport-lifetime / exception behavior carries the contract documented next, which the eventual rewrite must preserve.Runtime decision for the first proxy layer: use a temporary dual-stack. The Linux host backend now has a local
capnp-rpcChatfacade/proxy layer because that side already hasstdand can run a futures executor. The facade translates backend-held typed proxy calls into the existingRemoteGatewayRequest/RemoteGatewayResponseDTO transport, so the guest gateway remains synchronous and#![no_std]. This proves host-backend proxy semantics, denial/disconnect mapping, and browser-safe view-model integration; it does not claim standardcapnp-rpcframing or live RPC vats inside capOS. Gateway-wire replacement waits for the userspace runtime decision above, and the dual-stack must be removed after the reviewed guest-side RPC path carries live service traffic.
Transport lifetime and exception contract
The bespoke transport’s lifetime contract is what the future
capnp-rpc proxy layer has to preserve. The host-side test module
in tools/remote-session-client/src/bin/remote_session_ui.rs pins
each rule end-to-end:
- Connection close mid-call clears state, returns
gatewayDisconnected. A TCP FIN observed during a request surfaces as503 gatewayDisconnectedwithview.lastResult.code = "gatewayDisconnected",view.connected = false,session = null, emptycaps/services/launchers, and adisconnecttranscript row scoped to the operation that failed. Covered byauthenticated_gateway_close_during_call_clears_view_with_reconnect_guidance,oversized_gateway_response_during_call_clears_view_with_reconnect_guidance,password_denial_then_closed_tcp_resets_before_retry,http_password_denial_then_closed_tcp_preserves_backend_error_and_clears_view. - Half-open transport (write succeeds, read stalls) times out
cleanly. The bridge’s
read_timeout(endpoint.io_timeout()) must fire and surface the samegatewayDisconnectedshape; no hang or partial-state leak. Both the post-request stall case and the partial-frame-header stall case are covered:half_open_response_read_times_out_as_disconnect,partial_response_header_then_stall_treated_as_disconnect. - Protocol-level decode errors (sequence mismatch, malformed
payload) yield
500 internalwithout tearing down the connection. This documents current behavior; the future capnp-rpc rewrite is expected to tighten this to a connection- level abort once the proxy layer is in place. Covered byresponse_with_wrong_seq_yields_internal_error,malformed_response_payload_yields_internal_error. - Immediate re-login after transport failure succeeds. No
retry / cooldown gate; the recovered session must not echo the
prior call’s failure as
lastResult. Covered byimmediate_relogin_after_mid_call_close_succeeds. disconnectrows survive into the operator-visible exported transcript (GET /api/transcript/redacted) scoped to the operation that failed and free of stream-level metadata (peer addresses, frame sizes, rawos errorstrings, secrets). Covered bydisconnect_recorded_in_exported_transcript_after_mid_call_close.- Gateway-side teardown calls kernel
UserSession.logouton both the explicit-logout DTO path and the connection-close path. Verified by the QEMU-driven harness intools/qemu-remote-session-capset-smoke.sh, which asserts thatUserSession.logout cap call succeeded; remote session staleandconnection teardown UserSession.logout cap call succeededboth appear during the multi-cycle interop run. - Post-logout calls fail closed. The bridge keeps the gateway
socket alive after logout so a stale-call probe gets an explicit
staleSessiondenial rather than a transport failure. Covered byrepeated_stale_calls_after_logout_remain_fail_closedand the worker-targetedstale_chat_proxy_after_logout_returns_typed_denial. - Worker/proxy lifetime failures preserve the same split.
Worker-targeted
Chat.sendtransport loss and oversized worker responses clear backend gateway/session state and surfacegatewayDisconnectedwith reconnect guidance, while post-logout worker calls remain typedstaleSessiondenials on the still-open gateway socket. The backend-onlycapnp-rpcfacade maps transport breakage toErrorKind::Disconnected, and maps DTO denials or unexpected worker/proxy responses toFailedCapException-like errors rather than panics or silent broader authority. Covered bychat_worker_transport_breakage_clears_state_and_redacts_export,oversized_chat_worker_response_maps_to_disconnect_without_frame_leak,generated_chat_client_transport_breakage_maps_to_disconnected_exception,generated_chat_client_dto_denial_maps_to_failed_cap_exception_like_error, andgenerated_chat_client_unexpected_worker_response_maps_to_failed_exception. - Revoked leases are not yet separately observable. The current
DTO surface carries
leaseExpiresAtMson cap entries, but it has no explicit revoke/lease-expired call path or denial code that can distinguish a revoked lease fromstaleSessionormethodDenied. Tests must not fake this coverage; add it with the standard RPC object lifetime path or a reviewed DTO denial shape. - Redacted transcript export does not expose exception/lifetime internals. Worker-targeted disconnect, oversized response, and stale-session exports are asserted free of raw socket addresses, OS error strings, frame-size diagnostics, local cap ids, result-cap labels, proxy table positions, raw session-id hex, passwords, and host endpoint hints.
Resource and revocation bounds
Each per-session resource class has an explicit named ceiling and maps over-cap conditions to a typed denial diagnostic that reuses the transport-error envelope from above. Operators tuning these bounds should re-audit the per-session memory budget and the operator-multitool scenario before changing them; raw observed counters are not exposed to browser-facing view models.
| Resource | Constant | Default | Where enforced | Denial code |
|---|---|---|---|---|
| Outstanding worker calls per session | MAX_OUTSTANDING_WORKER_CALLS_PER_SESSION | 4 | tools/remote-session-client/src/bin/remote_session_ui.rs::transact (gates Adventure / Chat-shaped requests before submission) | tooManyWorkerCalls (HTTP 503) |
| Transcript ring per session | TRANSCRIPT_ROWS_CAP (4096), TRANSCRIPT_DETAIL_BYTES_CAP (1 MiB) | row + byte caps | AppState::push_transcript / enforce_transcript_caps in the same file | drop-oldest plus a single audit "transcript truncated; ..." row |
| Backend cap holders per session | MAX_BACKEND_CAP_HOLDERS_PER_SESSION (64), MAX_BACKEND_SERVICE_CATALOG_ENTRIES (64), MAX_BACKEND_LAUNCHER_CATALOG_ENTRIES (32) | per-Vec entry caps | capset_list / service_catalog / launcher_catalog in the same file | tooManyCapHolders (mirrors transport-error envelope) |
| Browser-session owner slot | one tentative or authenticated owner | first-wins bridge owner | login-route preflight reserves before gateway authentication; success finalizes on cookie rotation, failure releases the reservation | sessionAlreadyInUse (HTTP 409) |
| Local HTTP request parser | request line 8 KiB, header line 8 KiB, 96 headers, aggregate headers 32 KiB, body 64 KiB, fixed read/write timeout | loopback bridge input bounds | read_http_request and handle_connection reject before route dispatch, JSON parsing, auth, or gateway I/O | httpLineTooLong, tooManyHeaders, headersTooLarge, requestBodyTooLarge, requestTimeout |
| Local HTTP handler slots | MAX_HTTP_HANDLER_THREADS (32) | concurrent request handlers | accept loop acquires a bounded slot before spawning a handler thread | handlerLimitExceeded (HTTP 503) |
| Concurrent gateway logins per principal | MAX_CONCURRENT_LOGINS_PER_PRINCIPAL (4), PRINCIPAL_TABLE_SLOTS (32) | per-principal counter, distinct-principal table ceiling | demos/remote-session-capset-gateway/src/lib.rs::PrincipalLoginTable::try_admit, called from both password and anonymous login paths | serviceUnavailable with “per-principal concurrent-session cap reached…” |
The bridge-side bounds are exercised by host tests in
remote_session_ui.rs::tests (transcript_row_count_cap_drops_oldest_with_truncation_marker,
transcript_byte_cap_drops_oldest_with_truncation_marker,
transcript_at_exact_row_cap_does_not_truncate,
capset_list_at_max_holders_bound_stores_all_entries,
capset_list_over_max_holders_returns_typed_denial,
service_catalog_at_max_entries_bound_stores_all_entries,
service_catalog_over_max_entries_returns_typed_denial,
launcher_catalog_at_max_entries_bound_stores_all_entries,
launcher_catalog_over_max_entries_returns_typed_denial,
outstanding_worker_calls_at_bound_still_allow_one_more_after_completion,
outstanding_worker_calls_over_bound_returns_typed_denial,
concurrent_first_wins_login_reservations_allow_one_post_login_owner,
failed_login_reservation_releases_for_later_owner,
http_parser_rejects_oversized_request_line_before_route_work,
http_parser_rejects_oversized_header_line,
http_parser_rejects_too_many_headers,
http_parser_rejects_aggregate_headers_too_large,
http_parser_rejects_oversized_body_from_content_length,
http_parser_times_out_incomplete_request_line,
handler_slots_bound_concurrent_request_threads).
The gateway-side bound is exercised by host tests in
demos/remote-session-capset-gateway/src/lib.rs::tests
(admits_up_to_max_concurrent_logins_per_principal,
rejects_over_cap_admission_with_typed_denial,
release_reopens_a_slot_for_the_same_principal,
distinct_principals_have_independent_counters,
release_to_zero_drops_the_slot,
release_unknown_principal_is_a_noop,
table_full_admission_does_not_grow_past_slot_ceiling).
Two contracts the future capnp-rpc rewrite must preserve:
fail-closed bound exhaustion never panics or leaks raw counters into
browser envelopes (only typed denial codes plus a backend audit row);
and operator-visible audit material (bound-exhausted transcript
rows, drop-oldest truncation markers) is recorded backend-side
through the existing redacted-transcript path, not surfaced through
new untyped error channels.
Layer map for future iterations
| Layer | Owner | Today | Heading toward |
|---|---|---|---|
| Wire | gateway ↔ backend | length-prefixed schema-framed DTOs | standard capnp-rpc over TCP, then TLS/mTLS |
| Auth | gateway | password, anonymous, guest; disabled methods advertised | + public key, OIDC (device-code + PKCE), passkey, mTLS, service credential |
| Bundle | broker | shell-bundle-shaped wrapper for remote | first-class remoteClientBundle profile shape |
| Worker | per-session | Chat.send, Adventure status/look/inventory/go | broader Adventure verbs, real Chat panel, Paperclips worker, generalized lifecycle, terminal-session host, agent-shell services |
| Backend (Rust) | trusted | AppState, snapshot, view models, transcript, automation, first-wins BrowserSession ownership, local HTTP parser/handler bounds, per-session resource bounds (worker-calls, transcript rows + bytes, cap holders, gateway logins per principal) | live RPC proxy state, RemoteUiHost cap holder |
| Browser | untrusted UI | login + Services / CapSet / Diagnostics / Transcript / Session SPA | richer service-specific clients, generic CommandSurface-driven forms, agent-shell mode, terminal panel for granted TerminalSession, RemoteUiSurface rendering |
| Host packaging | trusted | CLI, make remote-session-ui, make remote-session-tauri check/dev wrapper | distributable Tauri package sharing the same Rust backend |
Self-served capOS web UI boundary
The first self-served browser UI is a capOS-hosted application service, not the
host remote-session-ui development bridge moved into the guest. A new
capOS userspace service, remote-session-web-ui, owns the HTTP listener,
serves the UI bundle, runs the authenticated web-session backend, holds the
remote session CapSet/proxy state, and projects browser-safe view models.
Static assets are boot-package resources. The implementation should reuse the
reviewed host UI asset source or a smaller reviewed subset, but the served copy
is an immutable, fixed-name bundle embedded in the capOS boot package and
granted by manifest resource name with a pinned digest or equivalent build-time
integrity label. remote-session-web-ui serves only that bundle and a small
generated bootstrap document; it does not expose a host directory, capOS
storage root, asset traversal, or development hot-reload path.
The first listener surface is HTTP/1.1 on a manifest-scoped
TcpListenAuthority for a dedicated UI port such as guest port 8080.
HTTP serves static assets plus same-origin JSON API routes. WebSocket,
server-sent events, and terminal/media streaming remain later extensions that
need separate route-level bounds; the first proof should avoid them so the
authority and validation surface is small.
The manifest grants for remote-session-web-ui are narrow: scoped
TcpListenAuthority for the UI port, SessionManager, AuthorityBroker, the
immutable UI asset bundle, and the same restricted remote-client
service-runner/backend-launch authority needed to expose approved service
descriptors. It must not receive raw NetworkManager, raw socket factories,
broad storage roots, raw ProcessSpawner, shell launcher authority, endpoint
owner caps, or arbitrary endpoint creation authority.
The service is the trusted backend and holds remote CapSet/proxy state
server-side. Browser JavaScript receives only view models, launch forms,
user-event commands, typed results, denial diagnostics, and redacted transcript
rows. It never receives raw capOS caps, raw ProcessSpawner, process handles,
endpoint owner authority, local cap IDs, result-cap slots, session-global
identifiers, remote CapSet handles, host usernames, host environment variables,
host paths, or QEMU-forwarding identity hints.
Login remains session-manager shaped. The browser submits credentials or
guest/anonymous intent to the capOS-served JSON endpoint; the service derives
source metadata from its accepted socket and service-generated event id, asks
SessionManager for a UserSession, asks AuthorityBroker for the
remote-client bundle, and only then exposes disclosed session/service fields
as browser-safe models. The browser cannot select principal, profile, worker
session context, or backend cap holder by replaying a request field.
Gate 1B is now an evidence ladder rather than a single proof name. The landed local/QEMU layer is:
remote-session-self-served-web-ui: a focused manifest bootsremote-session-web-ui, browser automation loads assets from the capOS-owned origin, logs in, calls at least one granted capability through the service-held backend state, proves logout/stale failure stays closed, and checks forbidden authority markers are absent from browser-visible envelopes and transcripts.remote-session-self-served-web-ui-default-run: defaultmake runstarts the capOS-served UI on guest port8080and forwards it to a loopback host port for local operator use.remote-session-self-served-full-ui-bundle: the capOS service now serves the reviewed fixed-name boot-resource bundle, including the operator workspace assets and/bundle/manifest.json, with explicit content types, no directory traversal, and digest-pinned build evidence.
Those proofs do not close the selected GCE Web UI path by themselves. The local
service proof
cloud-prod-remote-session-web-ui-l4-local-proof
is done: it runs remote-session-web-ui through the non-qemu cloudboot socket
path using the Phase C userspace network stack and configured IPv4 route, not
the older QEMU-only kernel socket fixture or the host remote-session-ui
bridge.
After that, cloud-gce-private-self-hosted-webui-proof
proves private GCE reachability over the live NIC without public IP or public
firewall exposure.
cloud-gce-public-self-hosted-webui-ingress-tls
is the later public operator-access task; it remains on hold for explicit
public-ingress/TLS authorization even though the ingress policy design is
recorded.
Rollback is manifest/build-target selection: remove the focused target and the
remote-session-web-ui listener/asset grants while keeping the host-served
make remote-session-ui bridge and the remote-session CapSet gateway
unchanged.
Architecture
flowchart TD
Client[Host app: CLI, GUI, Tauri, or web gateway] -->|TCP/TLS + capnp-rpc| Gateway[RemoteSessionGateway]
Gateway --> Auth[Auth adapters]
Auth --> Sessions[SessionManager]
Gateway --> Broker[AuthorityBroker]
Broker --> Worker[Per-session RPC worker]
Broker --> Catalog[Remote service catalog]
Catalog --> Runner[Restricted service runner]
Runner --> GameServers[Game server processes]
Worker --> RemoteCapSet[RemoteCapSet]
RemoteCapSet --> Proxies[Remote capability proxies]
GameServers --> Proxies
Proxies --> LocalCaps[capOS capabilities]
Worker --> Audit[AuditLog]
The remote listener is a trusted gateway. In the final RPC shape it accepts
the transport, performs or delegates authentication, obtains a UserSession,
asks the broker for a remote-client bundle, and hosts a per-session RPC vat.
That vat exports a RemoteSession object and remote proxy objects for
capabilities in the broker-issued bundle. During the temporary dual-stack
period, the guest side still accepts DTO frames and the Linux host backend
hosts the first local proxy facade over those DTO calls.
For the first implementation the per-session worker may be an ordinary capOS service process. That shape matches the session-bound invariant: one workload process has one immutable session context. A single long-lived gateway may handle pre-auth connection state, but post-auth capability invocation should run inside a worker whose session context is the authenticated remote session, or through an equivalently reviewable dispatch path that cannot mix unrelated user sessions as ambient authority.
Bootstrap Interfaces
The DTO surface below is now pinned in schema/capos.capnp:
RemoteAuthStart, RemoteAuthStep, RemoteServiceGrantRequirement,
RemoteServiceExport, RemoteServiceProfile, plus the
RemoteSessionGateway, RemoteAuthFlow, RemoteSession,
RemoteCapSet, RemoteServiceCatalog, and RemoteServiceRunner
interfaces. Round-trip coverage for the new structs lives in
capos-config/tests/remote_capnp_rpc_dto_roundtrip.rs. The transport
that consumes them is still gated on the userspace async-runtime
decision (capnp-rpc v0.25 is std-only and needs a futures
executor). The first proxy slice is host-backend-only and dual-stack:
it uses capnp-rpc locally in the trusted Linux backend for Chat while
translating to the legacy RemoteGatewayRequest/RemoteGatewayResponse DTO
union on the gateway wire. The schema and generated bindings do not change for
that slice, and browser JavaScript still receives only view models, typed
results, typed denials, and redacted transcript rows.
enum RemoteAuthKind {
password @0;
publicKey @1;
oidcDeviceCode @2;
oidcAuthorizationCodePkce @3;
passkey @4;
mtlsClientCert @5;
guest @6;
anonymous @7;
serviceCredential @8;
}
struct RemoteAuthMethod {
kind @0 :RemoteAuthKind;
label @1 :Text;
profileHints @2 :List(Text);
interactive @3 :Bool;
enabled @4 :Bool;
}
struct RemoteAuthStart {
kind @0 :RemoteAuthKind;
selector @1 :LoginSelector;
requestedProfile @2 :Text;
clientNonce @3 :Data;
# Source metadata is intentionally not a client-supplied field.
# The gateway derives LoginSourceMetadata from the accepted socket
# and its own connection event id before calling
# SessionManager.login. A client-supplied source field would let
# remote callers forge audit metadata downstream services depend on.
}
struct RemoteAuthStep {
prompt @0 :Text;
redaction @1 :Bool;
url @2 :Text;
userCode @3 :Text;
challenge @4 :Data;
expiresAtMs @5 :UInt64;
}
interface RemoteSessionGateway {
authMethods @0 () -> (methods :List(RemoteAuthMethod));
start @1 (request :RemoteAuthStart) -> (flow :RemoteAuthFlow);
guest @2 (requestedProfile :Text) -> (session :RemoteSession);
anonymous @3 (requestedProfile :Text) -> (session :RemoteSession);
}
interface RemoteAuthFlow {
next @0 (response :Data) -> (step :RemoteAuthStep, done :Bool,
session :RemoteSession);
cancel @1 () -> ();
}
struct RemoteCapEntry {
name @0 :Text;
interfaceId @1 :UInt64;
transferPolicy @2 :Text;
leaseExpiresAtMs @3 :UInt64;
}
interface RemoteSession {
info @0 () -> (info :SessionInfo);
capSet @1 () -> (caps :RemoteCapSet);
renew @2 (proof :Data, requestedDurationMs :UInt64)
-> (session :RemoteSession);
logout @3 () -> ();
}
interface RemoteCapSet {
list @0 () -> (entries :List(RemoteCapEntry));
get @1 (name :Text, expectedInterfaceId :UInt64) -> (cap :AnyPointer);
}
struct RemoteServiceGrantRequirement {
name @0 :Text;
interfaceId @1 :UInt64;
transferMode @2 :Text;
holder @3 :Text; # backendHeld, serviceOwned, or clientFacet
}
struct RemoteServiceExport {
name @0 :Text;
interfaceId @1 :UInt64;
transferPolicy @2 :Text;
}
struct RemoteServiceProfile {
id @0 :Text;
label @1 :Text;
processGraph @2 :List(Text);
requirements @3 :List(RemoteServiceGrantRequirement);
exports @4 :List(RemoteServiceExport);
state @5 :Text; # unavailable, attachable, startable, running
}
struct RemoteServiceLaunchRequest {
profileId @0 :Text;
grantNames @1 :List(Text);
}
struct RemoteServiceCatalogEntry {
id @0 :Text;
label @1 :Text;
summary @2 :Text;
capName @3 :Text;
transportInterfaceId @4 :UInt64;
schemaInterface @5 :Text;
proxyStatus @6 :Text;
methods @7 :List(Text);
notes @8 :List(Text);
}
struct RemoteServiceLaunchStatus {
profileId @0 :Text;
status @1 :Text; # notLaunched, unsupported, denied, ready, running
launchSupported @2 :Bool;
message @3 :Text;
acceptedGrantNames @4 :List(Text);
exportedServices @5 :List(RemoteServiceCatalogEntry);
}
interface RemoteServiceCatalog {
list @0 () -> (profiles :List(RemoteServiceProfile));
}
interface RemoteServiceRunner {
probe @0 (request :RemoteServiceLaunchRequest)
-> (status :RemoteServiceLaunchStatus);
start @1 (request :RemoteServiceLaunchRequest)
-> (status :RemoteServiceLaunchStatus);
attach @2 (profileId :Text) -> (status :RemoteServiceLaunchStatus);
}
The AnyPointer result is proposal shorthand for an ordinary Cap’n Proto
capability pointer whose expected interface ID was already checked by the
gateway. Generated client helpers should immediately cast it to the requested
typed client. The remote client does not receive a numeric local capId,
endpoint selector, result-cap index, or session identifier it can replay
somewhere else.
The catalog and runner sketches are also proposal-level. They describe the
remote-facing contract, not the internal implementation. The completed launch
DTO/probe slice uses a serviceLaunch request/response arm for the
side-effect-free probe: RemoteServiceLaunchRequest carries only a profile id
plus explicit grant names, and RemoteServiceLaunchStatus reports status such
as notLaunched, unsupported, denied, ready, or running, launch
support, accepted grant names, and exported or planned service descriptors.
The current Adventure slice makes that serviceLaunch path a real restricted
backend launch for the default make run manifest, so Adventure may report
running and launchSupported=true after the approved server graph starts.
Paperclips remains a future launch profile. A capOS service runner may use
local spawn authority, BootPackage data, or broker-held service caps inside
capOS, but the remote client and browser/webview code receive only service
descriptors, launch requests, status results, denials, and remote capability
descriptors. Raw ProcessSpawner, process handles, endpoint owner caps, local
cap ids, and result-cap slots are not exposed.
Service Catalog And Game Server Launch
The default make run story and the focused game proofs are intentionally
different:
system.cueimportscue/defaults/defaults.cue, boot-launches standaloneinit, and lets init startchat-server,remote-session-capset-gateway,remote-session-web-ui, and the foreground shell. The default binary set includes Adventure server/NPC/client binaries and the terminal Paperclips binary.make runforwards guest port 8080 to a loopback host port and printsremote self-served UI: tcp 127.0.0.1 <port> -> guest :8080; themake run-default-web-uitarget proves the capOS-served endpoint with browser automation. Adventure is not boot-started automatically; the current remote-sessionserviceLaunchslice startsadventure-serverplus simple NPC companions through a restricted backend runner when requested. Paperclips landed in Path A + Path B as described below; the defaultmake runmanifest reportslaunchSupported=false / status=missingBinaryfor thepaperclipslauncher until Path C (the kernel-side AuthorityBroker allowlist extension and the on-wire DTO arm) lands.- The default remote-session gateway is narrow. It has console, scoped
TCP-listen authority for guest port
2327,SessionManager, andAuthorityBroker, plus narrowly approved backend launch authority for the Adventure profile; it does not expose rawProcessSpawner, raw network manager/socket authority, endpoint owner handles, process handles, local cap ids, result-cap slots, or game service endpoint owner caps. Theremote-session-web-uiservice separately receives scoped TCP-listen authority for guest port8080,SessionManager,AuthorityBroker, andconsole. make run-adventureusessystem-adventure.cueto startchat-server,adventure-server, Adventure NPC companion processes, anadventure-scenario-test, and the shell. The Adventure server exports anAdventureendpoint, consumes a client facet ofChat, and owns room/player state keyed by live caller-session references.make run-paperclipsusessystem-paperclips.cueto startpaperclips-serverandpaperclips-proof-serverservices exportingPaperclipsGameendpoints. The terminal client is then launched with explicitStdIO, game endpoint, timer, and optionalproof_acceleratorgrants. The server owns generated content, game state, timer cadence, command specs, status snapshots, project entries, unlock checks, and game-rule mutation.
The remote UI should not treat those terminal transcripts as the product boundary. The staged path is:
- The broker advertises a remote service catalog for the authenticated session. The catalog is derived from manifest/default profiles and policy, and includes only services the remote profile may inspect, attach to, or start.
- The launch DTO/probe slice is complete. It defines a remote-safe launch request, status, and probe contract for cataloged profiles. It can report unsupported launch state, accepted grant names, a message, planned exported service descriptors, and denial status without side effects: no process starts and no new capabilities are attached.
- The current Adventure slice implements the restricted service-runner path for the default manifest. It starts the Adventure server plus simple NPC companion processes with explicit named grants, then returns launch status and remote descriptors for exported or broker-held caps. Process handles stay backend-local.
- The trusted Rust backend attaches those descriptors to the backend-held
RemoteCapSetand drives typed calls. Browser JavaScript or a Tauri webview receives view models, launch/status forms, service descriptors, denials, and results, not raw capOS handles. The implemented DTO worker slices coverChat.sendplus Adventurestatus,look,inventory, and first mutable boundedgo(direction). - The first UI panels can be generic: service list, start/attach, status, read-only Adventure controls, a bounded movement control, transcript, and denial details. Purpose-built Adventure and Paperclips clients can layer richer rendering and broader mutable game actions over the same service-runner and remote CapSet backend later; Paperclips does not have default-manifest remote launch support yet.
Operator commands should stay explicit:
make run
cargo run --manifest-path tools/remote-session-client/Cargo.toml \
--target x86_64-unknown-linux-gnu \
--bin remote-session-client -- --host 127.0.0.1 --port <printed-port>
CAPOS_REMOTE_SESSION_PORT=<printed-port> make remote-session-ui
Add --launch-adventure to the CLI command to start the default-manifest
Adventure graph through the restricted serviceLaunch path and require a
running status.
Add --adventure-status after --launch-adventure to require read-only
Adventure status, look, and inventory responses through the
session-bound worker path.
Add --adventure-go east after --launch-adventure to require the first
bounded mutable Adventure go(direction) response through that same worker
path.
The Tauri wrapper runs from this repository and reuses the same backend
boundary by loading the loopback remote-session-ui surface in a desktop
webview:
CAPOS_REMOTE_SESSION_PORT=<printed-port> make remote-session-tauri
That target checks Tauri CLI and Linux build prerequisites, reports
dependency/scaffold status, and runs a deterministic wrapper check by default.
Set CAPOS_REMOTE_SESSION_TAURI_MODE=dev to launch cargo tauri dev.
Missing host Tauri packages fail with explicit diagnostics and point operators
back to make remote-session-ui. The webview receives the same browser-safe
view models, events, denials, typed results, and redacted transcript rows as
the trusted local web bridge; the backend keeps the remote session and caps.
Bidirectional UI Composition
A conventional GUI program opens a window and owns the controls inside it. A remote capOS session does not need to be that limited. The host app can expose a session-scoped UI host capability to capOS, and capOS-side services or agents can use that capability to propose a better interface for the current task:
- Paperclips can ask for counters, project controls, and status charts instead of printing lines.
- Chat can ask for a channel list, unread badges, and a message pane.
- Adventure can ask for a map pane, inventory slots, command buttons, and room transcript.
- A diagnostics agent can open log, metric, and trace panes side by side, highlight the relevant capability calls, and change density for a debugging session.
- A teaching or accessibility agent can request larger type, simplified controls, or a guided task layout for a particular session.
The authority is explicit and separate from service authority. Holding Chat
does not let a service rewrite the user’s UI. Holding RemoteUiHost or a
narrow UiSurface facet lets the service propose bounded UI changes for the
current remote session. The host app remains the compositor and policy
enforcer.
Conceptual shape:
enum UiPatchKind {
openSurface @0;
closeSurface @1;
updateModel @2;
setLayoutHint @3;
setThemeHint @4;
addCommand @5;
removeCommand @6;
}
struct UiSurfaceSpec {
surfaceId @0 :Data;
title @1 :Text;
kind @2 :Text;
safetyClass @3 :Text;
modelSchema @4 :UInt64;
}
struct UiPatch {
kind @0 :UiPatchKind;
surfaceId @1 :Data;
payload @2 :Data;
expiresAtMs @3 :UInt64;
}
struct UiEvent {
surfaceId @0 :Data;
command @1 :Text;
payload @2 :Data;
userInitiated @3 :Bool;
}
interface RemoteUiHost {
open @0 (spec :UiSurfaceSpec) -> (surface :RemoteUiSurface);
theme @1 (scope :Text, hints :Data) -> ();
}
interface RemoteUiSurface {
apply @0 (patch :UiPatch) -> ();
poll @1 (maxEvents :UInt16) -> (events :List(UiEvent));
close @2 () -> ();
}
The payloads above should become typed structs before implementation. They are
shown as Data only to keep the sketch short. The important boundary is that
UI updates are declarative patches and typed view models, not arbitrary host
code. The host validates the requested surface kind, model schema, command
set, theme tokens, data size, update rate, and safety class before rendering
anything.
This is still a remote CapSet client model:
host UI holds RemoteSession + RemoteCapSet
host UI grants a narrow RemoteUiHost/RemoteUiSurface cap to a trusted worker
capOS service or agent sends declarative UI patches through that cap
host UI renders and sends typed user events back
service effects still require ordinary service caps
The direction is therefore bidirectional but not symmetric. The host app can call capOS service caps. capOS can shape the session UI only through UI caps the host granted. Neither side gains ambient authority over the other.
Safety rules:
- Host chrome, login prompts, origin indicators, permission prompts, and emergency reset controls are reserved. capOS-rendered surfaces cannot spoof them.
- UI patches are session-scoped. Persistent layout/theme changes require an explicit profile/settings cap or user confirmation.
- Theme and look/feel changes use bounded tokens or validated design-system variables, not raw CSS injection.
- UI command descriptors are data; executing a command still calls a typed capability under the current session policy.
- The user can close, reset, or pin surfaces against agent rearrangement.
- UI updates are quota-bound and auditable when they materially affect workflow, consent, disclosure, or action execution.
- Browser front ends keep raw capOS caps server-side or in a Tauri/native Rust
backend. Browser JavaScript receives rendered state and sends user events; it
does not hold
RemoteCapSetentries.
This is the broader version of the WebShell idea. A web shell can be more than a terminal emulator: it can be a session workspace whose composition is negotiated by the capabilities present in the session. The terminal remains one surface in that workspace, not the only surface.
Authentication And Admission
Authentication adapters all produce the same output: a UserSession plus
profile inputs for the broker. They differ only in how the proof is obtained
and verified.
- Password: maps to the existing
SessionManager.login(method, selector, proof, source)path when remote password login is enabled by policy. It must use the existing credential failure/backoff/audit rules and must not be the only supported remote method. - Public key: maps to
SessionManager.sshPublicKeyor a generalized signature-auth method. SSH userauth and raw remote RPC public-key auth can share account/key records, but the transcript bytes must be domain-separated by protocol and channel binding. - OIDC/OAuth: device-code flow fits headless or CLI clients; authorization
code + PKCE fits browser-assisted clients. The OAuth/OIDC service verifies
ID tokens and maps external subjects through the user-identity admission
model before
SessionManagermints a session. - Passkey/WebAuthn: belongs behind the web-authenticator path. A remote native client may open a browser or use a platform authenticator, but raw authenticator secrets never become capOS app data.
- mTLS client certificate: TLS client-auth can identify a principal or pseudonymous subject through certificate policy. Certificate identity is an admission input; the resulting CapSet still comes from the broker.
- Guest and anonymous: explicit policy profiles. They are not fallbacks for
missing credentials and should receive short leases and narrow bundles. Guest
admission is currently surfaced through the bridge as an explicit
AuthMode::Guestoption (/api/login/guest, CLI--guest); the gateway enforces therequestedProfile == "guest"andprincipal.kind == Guestinvariants before broker dispatch via thevalidate_guest_admissionhelper, and refuses withRemoteErrorCode::AuthenticationDeniedand the redacted"guest login denied"message regardless of which policy branch fired. When the manifest has no guest seed account the gateway returnsRemoteErrorCode::DisabledAuthMethodso the bridge can distinguish a manifest-disabled method from a credential failure. Guest sessions surface only the configured display name ("Remote Guest") andprincipal_kindenum label to the bridge; the seeded principal id bytes are never disclosed through the bridge transcript or API envelope. - Service/workload credentials: future non-human clients can authenticate with OAuth client credentials, token exchange, mTLS, or signed workload assertions. They receive service-profile bundles, not human shell bundles.
Every method must record source metadata and protocol/channel binding appropriate to its transport. A successful proof selects a principal and session; it does not directly grant service authority.
Remote CapSet Semantics
A local process starts with a read-only CapSet page plus local cap-table
entries. A remote client instead receives a live RemoteCapSet object:
listreturns names, interface IDs, display metadata, and lease summaries.getreturns a typed RPC capability pointer only if the name exists and the expected interface ID matches.- The returned object is a proxy owned by the remote-session worker.
- Dropping the remote object releases the worker’s hold edge when no other remote references remain.
- Logout, expiry, revocation, disconnect, or worker shutdown breaks all session-bound proxy objects. The current DTO gateway implements kernel-backed explicit logout and owned-session connection teardown; full live proxy object-drop/revocation behavior remains future work.
This is still an actual session bundle. It is not a copy of the kernel’s local CapSet ABI. The remote representation exists because a Linux process has no capOS ring page, no capOS CapSet mapping, and no local cap table.
Invocation Context
Remote capability calls should look like ordinary calls to the target service:
remote client call
-> capnp-rpc message
-> per-session worker proxy
-> local capOS capability call
-> target service sees the worker's live session context
The remote client cannot choose service-visible subject identity. Request fields are ordinary data. If a service needs subject details, it uses the existing subject-disclosure policy: explicit request plus a matching service-scoped disclosure grant. By default it receives only the opaque service-scoped caller-session reference used by the session-bound invocation model.
Error And Lifetime Model
The remote path keeps the existing error split:
- Cap’n Proto RPC transport errors and broken connections become RPC exceptions or disconnected promises.
- Proxy/worker infrastructure failures become
CapException-like capability exceptions. - Domain outcomes remain schema result fields or unions.
- A missing cap name, interface mismatch, denied profile, stale session, or revoked lease is an observable denial, not a silent fallback to a broader service.
Open promises must fail when the remote session logs out or the connection is closed. The worker must release local caps on every close path.
Relationship To Shells And Gateways
Remote session CapSet clients are a peer of shell transports:
- Native shell: a local capOS process that uses its local CapSet and ring. It can later expose a schema-aware REPL over the same capabilities a remote client sees, but the remote client does not need to spawn a shell.
- SSH shell: a production CLI terminal transport. It authenticates and
launches
capos-shellwith aTerminalSession. It should not become the only way for external programs to call typed services. - WebShellGateway: browser terminal, webapp, and agent UI transport. Browser JavaScript must not receive raw capOS caps; the gateway can use the remote session CapSet model server-side and expose terminal frames, view models, command descriptors, or bounded tool requests to the browser. This is close to the same mental model as a “web shell”, except the shell is not the required protocol. The web UI can present service-specific controls over the same session CapSet, and capOS-side services can adjust the session workspace through UI composition caps. A remote CapSet web UI can be built before the full WebShellGateway by omitting terminal delegation, shell-runner policy, and agent execution; it is just another host client of the remote session bundle.
- Tauri or desktop GUI: the Rust/native backend may hold the remote
RemoteSessionand typed capability clients, while the UI layer receives rendered state, command descriptors, and user-intent events. The UI layer should not receive replayable capOS authority as data. The backend may grant narrow UI-surface caps back to capOS services so they can propose adaptive layouts without gaining arbitrary desktop control. - Agent shell: the agent runner holds session caps server-side and presents tool descriptors to the model. A hosted agent can use the same remote session bundle shape as long as actual capOS invocations remain in the trusted worker.
- Interactive command surfaces: command metadata can be one of the granted capabilities. A remote client can render command specs directly instead of scripting text through a shell.
Authority Rules
- The gateway receives scoped listener/TLS/auth/session/broker/audit authority, not raw broad network or spawn authority.
- Post-auth workers receive only the broker-issued remote-client bundle plus proxy lifecycle authority.
- Default remote bundles should be narrower than operator shell bundles.
- Raw
ProcessSpawner, unrestrictedNetworkManager, key-vault, credential store, broad account store, broad storage root, and host debug caps require explicit elevated policy. - Remote proxyable caps must declare transfer/lifetime policy. Local-only caps
may appear in a local shell CapSet without being exportable through
RemoteCapSet. - Capability names are lookup conveniences. Interface ID and broker policy define whether a returned object is usable for the requested type.
- Replayable handles are forbidden. Session IDs, grant IDs, endpoint metadata, object epochs, and proxy table positions are not bearer tokens.
Design Grounding
- Session-Bound Invocation Context defines the one-session-per-process invariant and privacy-preserving endpoint caller-session metadata.
- User Identity and Policy defines principals, sessions, profiles, admission sources, renewal, and brokered CapSet minting.
- Boot to Shell defines the existing
CredentialStore/SessionManager/AuthorityBrokerpath and non-password login directions. - SSH Shell Gateway, Certificates and TLS, and OIDC and OAuth2 define public-key, TLS/mTLS, and federated admission inputs.
- capos-service defines the service lifecycle shape needed for listener loops, per-session context, shutdown, drain, and metrics.
- Capability-Based Service Architecture
defines the broader service taxonomy, capability layering, and
init/spawn boundary the gateway, per-session workers, and restricted
service runner reuse. The default
make rungateway, the Adventure service-runner path, and the Paperclips Path B worker plumbing inherit the process-startup, attenuation, and HTTP-capability rules described there; Path C will extend the broker allowlist surface in the same authority frame. - Remote Session UI Security
defines the web-security posture for the loopback
remote-session-uibridge and its Tauri desktop wrapper – per-browserBrowserSessioncookies, CSRF/CSP/cookie discipline, first-wins ownership, local HTTP parser bounds, and Tauri capability minimization – that the trusted Rust backend in this proposal exposes to the browser. Both proposals reference each other; this proposal owns the upstream remote-session CapSet wire and host-client shape, while that proposal owns the browser-facing authority boundary. - R17 – Remote-session UI bridge and Tauri wrapper are research-only routes long-horizon residual risk (distributable packaging, desktop automation, non-loopback exposure) back to this proposal and the remote-session UI security proposal. Non-loopback remote-session UI exposure must remain blocked until that production posture is accepted by the corresponding review-finding task.
- Interactive Command Surfaces defines typed command sessions that can be rendered by remote clients.
- Browser Capability and Agent Web Sessions defines browser-side authority boundaries and gateway mediation for web UI sessions.
- Language Models and Agent Runtime defines agent runners, tool proxies, and browser-agent UI orchestration boundaries.
- Cloudflare, Cap’n Proto, Workers RPC, and Cap’n Web grounds production object-capability RPC, live object bindings, and remote resource-exhaustion discipline.
- Spritely, OCapN, and CapTP grounds distributed object-capability lifetime, promise, reference, and handoff questions while staying non-binding for capOS wire compatibility.
- Cap’n Proto Error Handling grounds the exception-versus-domain-result split that the host-backend facade and eventual gateway RPC transport must preserve.
Implementation Shape
The first implementation is deliberately small:
- Keep the existing
capnp-chat-interopservice and harness as the transport starting point, but rename the target outcome in planning docs to remote session CapSet interop. Done. - Add generated Linux Rust bindings for the relevant schema subset. Done.
- Add a host client library that connects through QEMU user TCP. Done with a
schema-framed DTO transport; replacing it with standard
capnp-rpcframing and live proxy objects remains the next transport step. - Add a capOS gateway that supports one policy-enabled auth method plus
explicit guest/anonymous behavior. Done for password, anonymous, and
guest, with disabled public-key, OIDC, and passkey/WebAuthn method
entries advertised. Guest admission ships with a dedicated
RemoteGatewayRequest.guestLoginarm, thevalidate_guest_admissionbroker-side enforcement helper that pins therequestedProfile == "guest"plusprincipal.kind == Guestinvariants, and aRemoteErrorCode::DisabledAuthMethodpath so the bridge can distinguish a manifest-disabled method from a credential failure. - Return remote session summary, CapSet list, and typed
getmetadata. Done as DTOs. - Call at least two capabilities from the bundle. Done for
session,system_info, the worker-backedChat.sendpath, and Adventurestatus/look/inventory/go(direction)afterserviceLaunch. The focused chat proof also shows a service-domain denial remains a schemachatSent(false)result and thatchat-serversees bounded session-bound caller metadata through disclosure policy. Broader Adventure methods, Paperclips methods, live proxy objects, and object-level release/drop lifecycle remain future work. - Prove a missing cap, wrong interface ID, wrong profile, stale session, and
logout path fail closed. Done for the focused proof, including a
kernel-backed
UserSession.logoutcall and owned-session disconnect propagation in the DTO gateway; full release, live proxy object-drop, renewal, and revocation propagation remains future work. - Add a first host UI client over the current UI-neutral Rust client. Done for
a trusted local web bridge with a loopback browser UI and Rust backend that
holds the remote session state. It covers endpoint configuration, auth
methods, login, session summary, CapSet inspection,
sessionInfo,systemMotd, denial probes, logout, stale-call proof, redacted transcript export, and a focused browser automation proof. The repo-local Tauri wrapper now checks or launches the same loopback backend/webview boundary; distributable packaging remains later. The UI remains separate from WebShell and does not include a terminal emulator, shell-runner policy, or agent execution. - Define the launch DTO/probe shape after the read-only remote service
catalog. Done: this slice defines a remote-safe launch request, launch
status, and side-effect-free probe so the CLI/web backend can render forms
and denials for Adventure/Paperclips profiles. It deliberately does not
start processes, create endpoint owners, attach caps, or expose raw
ProcessSpawner, process handles, endpoint owner handles, local cap ids, result-cap slots, or browser-held capOS capabilities. - Implement the actual restricted Adventure service-runner path. Done: the
default-manifest Adventure profile starts
adventure-serverplus simple NPC companion processes and attaches or retains the resulting Adventure/chat descriptors/caps in the backend-held remote CapSet. Paperclips landed in two halves: Path A added the read-sideRemotePaperclips*DTO schema (RemotePaperclipsCommandResult,RemotePaperclipsCommandList,RemotePaperclipsProjectList,RemotePaperclipsStatusSnapshot,RemotePaperclipsEvent,RemotePaperclipsProjectStatus,RemotePaperclipsEventKind, and the single-commandRemotePaperclipsCommandinput DTO) inschema/capos.capnp, with bounded wire-roundtrip coverage incapos-config/tests/remote_paperclips_dto_roundtrip.rs; Path B added the dedicateddemos/remote-session-paperclips-worker/crate mirroring the Adventure worker shape, the gatewaySessionWorkerKind::Paperclipsenum variant with matchingSessionWorkerSetarms andspawn_paperclips_graph/build_paperclips_service_launch/fill_paperclips_launcher/paperclips_catalog_statushelpers, a manifest-staticgameendpoint slot on the gateway capset, bridgeRequestKind::PaperclipsInitial/Command/Status/Projectssynthesis from cachedserviceLaunchstate (the on-wire control plane lands in Path C), UI launch slot plus status chip with paired smoke automation (paperclipsLaunchVisible/paperclipsStatus/paperclipsStatusObserved), thesystem-remote-session-paperclips.cuefocused manifest, and themake run-remote-session-paperclips-vm/make run-remote-session-paperclips-uigates. RawProcessSpawner, process owner handles, endpoint owner caps, local cap ids, result-cap slots, and browser-held capOS capabilities stay out of the remote contract. Process handles stay backend-local. Adventurestatus/look/inventorycontrols and first mutable boundedgo(direction)use the session-bound worker pattern; Paperclips Path B uses the same worker shape with bridge-side response synthesis until Path C lands the wire-level DTO arm and the broker allowlist grants for the default manifest. Broader Adventure controls, Path C wire/broker extension, and rich Paperclips client implementations remain later. - Replace the bounded
make remote-session-tauripreflight with the actual repo-local Tauri wrapper over the same Rust backend. Done for check/dev mode:CAPOS_REMOTE_SESSION_PORT=<printed-port> make remote-session-taurivalidates the wrapper scaffold and host prerequisites, andCAPOS_REMOTE_SESSION_TAURI_MODE=devlaunches the wrapper throughcargo tauri dev. Distributable packaging remains gated on reviewed sidecar/backend lifecycle handling. - Add the first typed proxy layer as a host-backend-only temporary
dual-stack. Done for
Chat:tools/remote-session-client/hosts a localcapnp-rpcfacade that translates backend-held proxy calls to the existing DTO gateway protocol while keeping schema/generated bindings, gateway wire shape, and browser authority unchanged. The later gateway rewrite must provide standardcapnp-rpcframing, typed remote proxy objects, exception mapping, release/drop handling, and resource bounds before the bespoke DTO service path can be retired. - Layer richer service clients on top of the same backend boundary. The
first richer client is a session-summary diff: a pure Rust helper in
tools/remote-session-client/src/session_diff.rscompares two snapshots of the session view (CapSet entries plusSessionInfoSummary) and returns typedCapSetDiff/SessionSummaryFieldDiffrecords keyed on(name, interface_id)and on visible session fields. Renewals or policy rebinding surface aspolicy_changedrather thanremoved+added. The trusted web bridge stores the raw snapshots backend-side and exposes/api/call/session-diff-refresh, which returns a redactedSessionSummaryDiffVm. Browser JavaScript receives only that view model: added/removed cap entries by(name, interfaceIdHex, transferPolicy, leaseExpiresAtMs), policy/lease changes, redacted session-id changes, and a summary string. The first call after login captures a baseline (hasBaseline=false); subsequent calls return the diff against the previous snapshot. The browser renders the diff in a dedicated “Last refresh diff” pane on the Session view; rawsession_id_hex, replayable cap handles, and kernel session ids stay backend-side. The focused UI smoke clicks “Refresh & Show Diff” twice and asserts both the no-baseline and post-baseline shapes. Two backend host tests cover the baseline + no-change path and the added-cap + expiry-change path. - Add a separate UI-composition proof only after the basic session proof:
grant a narrow test
RemoteUiSurface, accept one declarative patch, send one typed user event back, and prove the service cannot spoof trusted chrome or persist layout state without the relevant cap.
Later slices can add more auth adapters, TLS, renewal, browser-assisted auth, service credentials, UI composition surfaces, promise pipelining, and distributed GC.
Visual Design Handoff
The host UI visual language is anchored on two Claude Design handoffs:
- The original
capOS Loginbundle (delivered 2026-05-02 13:26 UTC). Only the CSS tokens and design intent were ported into the production UI; the prototype is not kept in-tree. - The
capOS Workspacebundle (delivered 2026-05-02, seetools/remote-session-client/ui/design-bundle/). Covers the post-login workspace shell, chat list, active group chat with embedded approval cards, active DM with E2E lock + fingerprint card, active call (collapsed banner + full-pane), stage room, and a “start sheet” with the four ocap-clean entry flows (open DM from contact card, redeem invite, browse directory, start ephemeral chat). This bundle IS kept in-tree as reference attools/remote-session-client/ui/design-bundle/and includes conversation transcripts, HTML prototypes, JSX components, and the unique theme assets. See itsCAPOS-INTEGRATION.mdfor the bundle-to-live-UI mapping and the iteration-7 prerequisite (CSP refactor + per-browser BrowserSession cookie before any inline scripts/styles from the prototype reach production).
Both bundles ship four themes (Space, Mountain, Light, Operator)
and a consistent token system (themes.jsx in either bundle is
authoritative for palette / typography / radii / blur). The
branding assets actually shipped under branding/ were copied
into tools/remote-session-client/ui/assets/ for the bridge to
serve; the prototype’s reference imagery is kept only in the
in-tree design-bundle directory.
What landed in tools/remote-session-client/ui/:
- Vanilla CSS rewrite of
styles.cssaround the design’s theme tokens. No React, no Babel, no third-party CDN script tags. Trust boundary stays intact: the loopback bridge serves only static assets. index.htmlrestructured to the design’s hero + auth-card + footer layout with mobile responsiveness, an Operator dashed inner frame (capos://authlabel), and the originaldata-testsurface fully preserved somake run-remote-session-capset-uistill passes.- A trusted-static feature flag block (
window.CAPOS_UI_FEATURES, overridable via?features=) gates surfaces that are scaffolded but not yet backed by the Rust gateway. Default flags match what the current backend honours.
Surfaces scaffolded but flag-gated off by default (no functional support in capOS yet; future tracks will wire them):
- Passkey sign-in (
?features=passkey). Tracksdocs/proposals/boot-to-shell-proposal.md(passkey/WebAuthn, credential setup) anddocs/proposals/cryptography-and-key-management-proposal.md. - OIDC / SSO providers (
?features=ssofor Google/GitHub/Okta). Tracksdocs/proposals/oidc-and-oauth2-proposal.md. The trusted Rust backend must own the provider integration; browser JavaScript must continue to receive only view models, results, and denials. - MFA second-factor step (
?features=mfa). Tracksdocs/proposals/boot-to-shell-proposal.md. The 6-digit input animates end-to-end as a UI demo today; production wiring is a future slice. - Success step (
?features=successStep). The current Rust backend transitions straight to the workspace on session start; the success card is design-parity scaffolding for a future mid-step surface. - Capability-grant consent strip. Removed from the design itself during iteration (the user concluded it demonstrated the wrong thing for capOS); kept in the deferred list because a future consent-on-grant flow for OAuth-style external identities would re-use the same visual language.
Surfaces flag-gated on by default but UI-only today (decorative state without a backend round-trip):
- System status pill, Region pill, Language pill, Footer, Hero panel, Remember-device checkbox, Forgot-password link, Password show/hide toggle.
Constraints the visual layer must keep across future slices:
- Login is a dedicated OS-like screen with a visible username field and
no full persistent technical header. Resource profile names such as
operatorare not user-typed system details. - Browser login sends username/password only. The username field is empty
by default: the browser UI does not pre-fill from
CAPOS_REMOTE_SESSION_USER, hostUSER, or any other host-local identity hint, because that would disclose account hints before authentication. - Authenticated users land in a compact Services-first workspace where Session, CapSet, Diagnostics, and Transcript are separate views. The UI smoke harness must continue to fail if any visible button is not exercised; new flag-gated buttons must stay hidden by default so the smoke surface does not grow without paired automation coverage.
- No third-party CDN script tags or runtime frameworks are added to the
trusted UI. Theme switching uses the existing
data-themeattribute on<html>/<body>; CSS variables flip the design tokens.
Proposal: SSH Shell Gateway
Production remote shell access for capOS using SSH as a terminal transport while preserving the native shell’s capability boundaries.
Status Split
Implemented:
- SSH-shaped authority prerequisites and fixture authentication proof: development-only sign-only host key, manifest-seeded authorized-key lookup, public-key session minting over fixture authentication bytes, unsupported feature policy/audit classification, restricted shell launcher, and a bounded host-local plain-TCP terminal-host proof.
UserSession.auditContextfails closed after logout (sameensure_session_liveguard asinfo());run-ssh-public-key-sessionproves pre-logout success, post-logout failure, idempotent second logout, and continued closed state.
Not implemented:
- encrypted SSH packet transport;
- OpenSSH-compatible key exchange and channel handling;
- full SSH userauth transcript validation;
- channel binding;
TerminalSessionFromByteStreamterminal-factory wiring;- OpenSSH harness.
Do not infer OpenSSH-compatible remote login from the current “partially implemented” status.
Remote and non-loopback deployment is blocked. The current proof uses development/fixture key material and host-local plaintext wiring for bounded authority checks; it is not a production SSH service. Before exposure beyond loopback, the implementation must have encrypted SSH transport, production host-key storage, durable authorized-key/account storage, full userauth transcript validation, channel binding, audit records for auth and shell launch, and a reviewed pre-auth/post-auth isolation story.
Problem
The Telnet Shell Demo described in
Networking proves that a remote TCP
connection can become a TerminalSession without granting the shell raw
network authority. That is the right capability boundary, but Telnet is
intentionally not a production remote access path. It has no encryption, no
host authentication, no replay protection, no key-based user authentication,
and no deployable security story beyond “host loopback in QEMU.” This
proposal is the production remote-shell successor to that loopback-only
research demo; the demo’s TerminalSession boundary survives, but its
plaintext transport does not.
capOS needs a production-oriented CLI remote shell that works with normal SSH clients while avoiding the Unix mistake of treating an SSH login as a raw remote root shell, ambient user id, inherited file descriptor set, or global filesystem entry point.
The SSH path should be a terminal host and session authenticator. It should not become a general-purpose privilege broker, TCP proxy, process supervisor, or substitute for the native shell’s capability model.
Relationship To Telnet
SSH reuses the Telnet Shell Demo’s core contract – the same
TerminalSession boundary Shell requires
for any terminal-backed capos-shell, and the same broker-issued shell
bundle Boot to Shell mints for a
fresh session:
- A gateway accepts TCP connections.
- The gateway owns transport framing and terminal-host behavior.
- The spawned
capos-shellreceives a cap namedterminalimplementingTerminalSession. - The shell receives the normal broker-issued shell bundle for the authenticated session.
- The shell does not receive raw
TcpSocket,NetworkManager, listener, broad process-spawn, private-key, authorized-key-store, or host-key authority.
The transport changes. Telnet handles plaintext option negotiation over a host-loopback QEMU forwarding rule. SSH handles version exchange, key exchange, host-key proof, encrypted packet framing, user authentication, session channels, PTY requests, window changes, shell requests, and clean channel teardown.
The security boundary does not change. The shell still sees only a terminal session and a scoped capability bundle.
SSH is not the only remote client model. It is the production terminal/CLI transport for operators who want an interactive shell. Programmatic clients should use the remote session CapSet path instead: authenticate through a session/admission method, receive a broker-issued remote CapSet view, and call provided capabilities over Cap’n Proto RPC without creating a shell. Public-key account records may feed both paths, but the authentication transcript bytes must be domain-separated by protocol and channel binding. See Remote Session CapSet Clients.
The first SSH implementation milestone is still host-local development. It should not silently inherit the Telnet demo’s trusted gateway compromise. Before implementation, the SSH path must either close the gateway authority gap with scoped listener and shell-only launcher grants, or explicitly preserve that gap in a task record as a host-local-only compromise while still proving that the spawned shell has no raw network, spawn, key, or SSH transport authority.
Pre-auth and post-auth shell flows must not share broad process/address-space authority for production exposure. Either split the authentication gateway and post-auth shell launcher into separate processes with narrow handoff caps, or produce a reviewable proof that the shared process cannot use pre-auth network, key, listener, or parser state as post-auth shell authority.
Scope
Initial SSH support is deliberately narrow:
- SSH-2 only, following the RFC 4251-4254 family at the protocol level.
- One interactive
sessionchannel per connection for the first proof. pty-req,window-change,shell, EOF, close, and disconnect handling.- Public-key user authentication first.
- Fresh random material for key exchange, rekey, padding, session identifiers,
and authentication challenges comes from
EntropySourceor a narrowed SSH transport-crypto service that ownsEntropySource; it is never ambient process state. - Password authentication only if it is wired to the existing
CredentialStorefailure/backoff path and policy explicitly enables it. - No port forwarding, agent forwarding, X11 forwarding, SFTP, SCP, subsystem requests, exec requests, direct TCP forwarding, or arbitrary environment import in the first milestone.
Those excluded SSH features are not harmless defaults. In capOS they require their own capabilities, policy, accounting, and audit records before exposure.
Components
flowchart TD
Client[SSH client] -->|TCP 22| Gateway[SshGateway]
Gateway --> HostKey[SshHostKey cap]
Gateway --> Keys[AuthorizedKeyStore]
Gateway --> Sessions[SessionManager]
Gateway --> Broker[AuthorityBroker]
Gateway --> Launcher[RestrictedShellLauncher]
Gateway --> Listen[TcpListenAuthority]
Gateway --> Audit[AuditLog]
Keys --> Sessions
Sessions --> Broker
Broker --> Bundle[Scoped shell bundle]
Gateway --> Terminal[SSH-backed TerminalSession]
Launcher --> Shell[capos-shell]
Terminal --> Shell
Bundle --> Shell
SshGateway is the only component exposed to the network. It is an ordinary
userspace service once the socket capability path can support it. During an
early implementation it may wrap the same in-kernel TCP capabilities used by
Telnet; a later decomposed-network stack should not change the shell contract.
The schema-level gateway contract is intentionally small: status and shutdown
methods identify the service surface without granting child shell authority.
SshHostKey is a sign-only private-key capability. It should be backed by
the PrivateKey/KeyVault model from
Cryptography and Key Management:
the gateway can sign the SSH exchange hash but cannot export private key
material, enumerate unrelated keys, or administer the vault.
AuthorizedKeyStore maps an SSH public key to a principal and authentication
policy. It stores public key material and policy metadata, not shell
authority. OpenSSH-format public keys are bytes imported into a verifier path,
matching the crypto proposal’s PublicKeyFormat.opensshWire escape hatch for
public material. The initial schema returns an SshAuthorizedKeyDecision with
principal/profile metadata and an audit reason; actual shell authority still
comes from SessionManager and AuthorityBroker.
TerminalSession is backed by the SSH channel. The gateway translates channel
data, EOF, close, PTY mode, and window-size events into the terminal host
contract. The schema names this construction surface SshTerminalFactory;
it returns a result-cap index for the SSH-backed TerminalSession. Password
prompts, hidden echo, cancellation, and teardown stay at that boundary.
TcpListenAuthority is the scoped listener grant shape for this milestone. It
can mint only the configured TcpListener rather than exposing raw
NetworkManager.createTcpListener for arbitrary ports.
RestrictedShellLauncher is narrower than the transitional
RestrictedLauncher: it launches only the native shell against a supplied
terminal/session context instead of accepting an arbitrary binary name. The
current kernel source is manifest-declared as restricted_shell_launcher; it
adds the child terminal, session, and stdio grants itself and accepts
only named capability-sourced pass-through grants for the reviewed shell
startup bundle (creds, sessions, audit, broker, and optional
system_info). Before spawn it verifies the supplied UserSession profile
matches the requested profile, and the focused proof shows the spawned native
shell running under that supplied session.
Authority Model
The gateway receives only the capabilities required for its job:
- TCP listen authority for the configured SSH port, preferably as a
manifest-declared
TcpListenerhandoff or scoped listener factory rather than rawNetworkManager. - Sign-only
SshHostKeyauthority for configured host-key algorithms. - Narrow
EntropySourceauthority, or anSshTransportCryptocap that owns entropy and exposes only SSH key-exchange, rekey, cipher/MAC, and random padding operations. - Read or verify authority over
AuthorizedKeyStore. SessionManagerauthority to mint a session after successful SSH authentication.AuthorityBrokerauthority to request the normal remote shell profile.- Restricted shell launch authority scoped to
capos-shell. - Pass-through grants required by the current shell startup path, such as
creds,sessions,audit, andbroker, where policy permits them. AuditLogappend authority for connection, authentication, launch, and teardown records.
In the production-shaped authority model, it does not receive:
- Broad
ProcessSpawnerauthority. - Raw
NetworkManager, outboundconnectTcp, or an arbitrary listener factory. - Key export or
KeyVaultadministrative authority. - Storage namespace authority except the narrow public-key records required
by
AuthorizedKeyStore. - SSH agent, port-forward, or subsystem authority unless later proposals add explicit caps for those surfaces.
A host-local development checkpoint may temporarily preserve raw
NetworkManager, arbitrary listener factory, or broad ProcessSpawner
authority in the gateway only if a task record captures the compromise and the
harness proves it does not cross the shell boundary. The spawned shell must
never receive raw NetworkManager, TcpListener, TcpSocket,
ProcessSpawner, SSH transport, host-key, authorized-key-store, key-vault, or
general-purpose entropy authority.
Identity metadata is not authority. A login name, SSH username, key fingerprint, source IP, principal id, or profile label only becomes useful after a trusted service returns a capability bundle.
Authentication
Host authentication
The host key should be a narrow wrapper around a PrivateKey cap, constrained
to SSH host-key signing. Host keys are generated or imported through
KeyVault, opened through an explicit SealPolicy, and rotated through a
versioned host identity record. The gateway can sign the key exchange hash but
cannot export private material.
SSH transport keys are separate from the host key. Key exchange must use fresh entropy and the algorithm policy selected for the deployment. The baseline standards are RFC 4251-4254; extension negotiation and modern algorithm recommendations come from later SSH RFCs such as RFC 8308, RFC 8709, RFC 9142, and other updates recorded by the RFC Editor for the 4251-4254 family. The first implementation should pin a small reviewed algorithm set rather than accepting every algorithm a library exposes.
For development, a manifest-seeded host key may be acceptable only when the
manifest field, docs, and harness mark it as non-production. The current
development path uses kernelParams.sshDevelopmentHostKey with the required
label capos-development-only-ssh-host-key and the kernel source
ssh_development_host_key; the resulting cap exposes only public metadata and
signs bounded ssh-ed25519 exchange hashes with the manifest seed for QEMU
proof. make run-ssh-host-key verifies the signature against the configured
public key, proves wrong-algorithm denial, and checks that the development seed
and raw signature are not printed to proof logs. For deployment, host keys need
persistent storage, rotation policy, key-management-backed signing, and audit.
User public keys
Public-key login maps an accepted SSH public key to a principal record and authentication strength. The key record should include:
AuthorizedSshKey {
keyId
principalId
publicKey
algorithm
fingerprint
allowedProfiles
sourcePolicy
createdAtMs
disabledAtMs
comment
}
The current manifest-seeded prerequisites implement public key record loading,
generic authorization decisions, and a bounded session-mint bridge. The
AuthorizedKeyStore accepts ssh-ed25519 records with 32-byte public keys and
SHA-256 fingerprints, rejects duplicate ids and fingerprints, maps principals
to existing seed accounts, and denies disabled records. SessionManager
accepts bounded fixture authentication bytes/signatures for configured keys and
mints UserSession metadata with publicKey authentication strength; the
focused make run-ssh-public-key-auth proof also shows AuthorityBroker
denying a mismatched shell profile.
SessionManager.sshPublicKey consults the bootstrap RamAccountStore after
signature verification using lookup_by_principal. Non-Active account
statuses (Disabled, Locked, RecoveryOnly) and missing principals fail closed
before a session is minted, so a runtime account-store mutation cannot be
ignored by the SSH path even though authorized-key records carry their own
disabledAtMs flag. The bootstrap fallback (no account store wired) keeps the
seed-account validation contract: manifest validation guarantees every
authorized-key principal binds to an active seed account. The
run-ssh-public-key-session smoke also proves UserSession.auditContext
returns principal metadata before logout and fails closed with
ensure_session_live after explicit logout(), matching the same
fail-closed contract as info().
Each denial path emits a stable auth= audit code (no schema variant change).
The codes form the SSH gateway’s operator-visible audit contract:
ssh-public-key for success, ssh-key-unknown, ssh-key-disabled,
ssh-key-profile-not-allowed, ssh-bad-signature, ssh-account-missing,
ssh-account-disabled, ssh-account-locked, ssh-account-recovery-only,
ssh-account-lookup-failed, ssh-profile-kind-invalid,
ssh-profile-not-interactive, ssh-auth-bytes-invalid. Failed records keep
principal and profile blank by policy: the auth= code is the only
discriminator, so failed-auth lines cannot be used as a side channel to probe
for valid principal IDs.
This is still not a complete SSH public-key authentication exchange: no SSH
transport transcript, channel binding, or terminal factory is wired
end-to-end. A bounded plain-TCP terminal-host proof now reuses the configured
key fixture to mint a public-key session and launch capos-shell through
RestrictedShellLauncher, but that proof is not an encrypted SSH transport or
OpenSSH userauth exchange. End-to-end QEMU proof of the
ssh-account-disabled/ssh-account-locked paths requires an
AccountStoreManagerCap kernel cap source so a demo can mutate account state
at runtime; that is tracked in the local-users management backlog and is not
required by the bounded host-local SSH gateway proofs.
Cloud metadata may seed initial authorized keys through the cloud-bootstrap
path, but those keys are input to AuthorizedKeyStore, not ambient login
authority. A metadata-provided key still needs an account/profile mapping and
should be auditable as cloud-seeded material.
Passwords and step-up
Password authentication over SSH is optional and should be disabled unless
CredentialStore can enforce the same generic failure text, bounded backoff,
rate limits, and audit behavior as the local shell. Keyboard-interactive can
later drive step-up prompts, but it should not be the first implementation
unless a concrete policy needs it.
SSH Channel Policy
The first gateway accepts only session channels that request an interactive
shell. It rejects:
execrequests.subsystemrequests such as SFTP.- agent forwarding.
- TCP forwarding and reverse forwarding.
- X11 forwarding.
- environment variables except a small reviewed allow-list, if any.
- more than one active shell channel per connection.
Each rejected request should produce an SSH protocol failure plus an audit record with a reason code. The audit record should not include command lines, environment dumps, key material, or terminal content.
The current bounded policy surface is capos-config::ssh_policy. It allows
public-key auth, one session channel, PTY, window-change, and a first shell
request. It denies disabled password auth, exec, subsystem/SFTP, direct TCP/IP,
TCP/IP forwarding and cancellation, agent forwarding, X11 forwarding,
environment import, second session-channel opens, and second shell channels.
Password auth has no policy allow path in this proof; it stays denied until a
real CredentialStore verifier, backoff, and audit path is wired into the
gateway. Denials return only a protocol failure class and a stable audit
reason code; request payloads such as command text and environment values are
not part of the decision data.
Implementation Slices
The final OpenSSH proof should not land as one opaque SSH server commit. Keep the implementation reviewable by landing these slices in order:
- Version exchange. A bootable
ssh-gatewayservice accepts one host-local OpenSSH TCP connection, exchanges RFC 4253 identification strings, records only sanitized client software/version metadata, and disconnects before key exchange without launching a shell. The compatibility harness uses/usr/bin/ssh; malformed and overlong client identification strings are covered by a separate low-level hostile TCP/banner fixture. - KEXINIT and algorithm selection. Parse KEXINIT, select exactly one reviewed development algorithm set, and disconnect on unsupported algorithms. Algorithm names are transport policy inputs, not authority.
- Development key exchange. Complete the host-local encrypted transport by
deriving traffic keys from the negotiated KEX shared secret, exchange hash,
and session id per RFC 4253. Entropy supplies ephemeral KEX material,
padding, and challenges, not direct session-key bytes. Call
SshHostKey.signExchangeHashand prove no private host-key or raw entropy material reaches logs or child shell grants. - Public-key userauth. Bind the OpenSSH public-key userauth transcript to
SessionManager.sshPublicKey, accept the configured key, deny unknown keys generically, and keep password auth disabled until a real verifier/backoff path is wired. - Channel policy. Route session open, PTY, window-change, shell, exec,
subsystem, forwarding, agent, X11, environment, and second-channel requests
through
capos-config::ssh_policy, producing protocol-visible failures and sanitized audit reason codes for denied features. - SSH-backed terminal launch. Replace the plain-TCP terminal-host proof
with an SSH channel-backed
TerminalSession, launchcapos-shellthroughRestrictedShellLauncher, runsession,caps, andexitvia OpenSSH, and prove cleanup for both client disconnect and shell exit.
Resource And Teardown Rules
SSH exposes several resource boundaries before the shell even starts: handshake CPU, pending connections, packet buffers, channels, PTY state, terminal buffers, authentication attempts, and live shell processes.
The gateway must have fixed per-connection bounds and fail closed when they are exceeded. Disconnect, TCP close, SSH channel close, failed authentication, session expiration, shell exit, and gateway teardown must all release the same resources:
- accepted socket,
- SSH connection state,
- terminal session object,
- spawned shell handle,
- broker-issued grants,
- authentication challenge state,
- audit correlation record.
Shell exit should close the SSH channel. Client disconnect should close the
terminal and let the shell observe the normal TerminalSession close path.
Exit Criteria
The first SSH milestone is complete when:
SshGateway, host-key, authorized-key, and SSH-backed terminal contracts are documented in schema/design form.- The development host-key path is available only through an explicitly
non-production manifest field and a narrow
SshHostKeycap; production signing remains blocked on key management and persistent storage. - A manifest can start an SSH gateway with only scoped TCP listen, host-key, authorized-key, session, broker, audit, and restricted shell-launch grants, or the remaining host-local demo compromise is explicitly preserved in a task record.
- The gateway accepts a normal OpenSSH client on a host-local QEMU forwarded
port, authenticates one public key, spawns
capos-shellwith aTerminalSession, runs one command, and disconnects cleanly. - The harness proves denied password login when disabled, denied port forwarding, denied subsystem requests, rejected unknown keys, and cleanup after client disconnect.
- The harness proves unavailable entropy or disabled KEX algorithms fail closed before authentication or shell launch.
- Documentation states which parts are development-only and which are acceptable for production deployment.
Dependencies
- Telnet Shell Demo from Networking for
the socket-backed
TerminalSessionproof this gateway succeeds. TerminalSessionFromByteStreamas a shared prerequisite for SSH channel and TLS/mTLS-backed remote terminals. SSH channel data is not a connectedTcpSocket; it must enter the same terminal factory used by Telnet-over-TLS – whose certificate, trust store, ACME, and pinning model lives in Certificates and TLS – so line discipline, echo policy, IAC handling where relevant, close semantics, and hidden password behavior do not fork by transport.- Cryptography and key-management primitives for sign-only host keys.
EntropySourceor a narrowed SSH transport-crypto service for key exchange, rekey, packet padding, and challenge freshness.- User identity, account, and session policy records for
AuthorizedKeyStoreprincipal/profile mapping. - System-monitoring audit records for remote authentication, denied SSH features, launch decisions, and teardown.
- Resource accounting for connection, channel, and shell-process limits.
- Persistent storage before production host keys and authorized keys can survive reboot safely.
Remote-shell ingress should land in this order:
TerminalSessionFromByteStreamand shared terminal line/echo/hidden-input discipline.- A transport-neutral byte-stream terminal factory used by both SSH channel data and TLS/mTLS cleartext byte streams.
- Either Telnet-over-TLS or SSH may land first, but neither should fork terminal semantics.
- Production deployment profile chooses SSH for familiar operator CLI access and TLS/mTLS, configured through Certificates and TLS, for PKI-integrated service/operator environments.
No more SSH terminal transport work should land until the shared prerequisite exists and has proof coverage for byte-identical hidden password behavior, line/IAC factoring, and repeated close/reconnect behavior.
Grounding
This proposal relies on these in-tree design documents and research notes:
- Networking for the Telnet Shell Demo this gateway succeeds and the TCP capability path the SSH listener reuses.
- Shell for the native
capos-shelland theTerminalSessionboundary every remote-shell transport must preserve. - Boot to Shell for
CredentialStore,SessionManager,AuthorityBroker,RestrictedShellLauncher, andEntropySource, including the bounded SSH terminal-host proof that already lands inside that flow. - Certificates, TLS, and Certificate Transparency for the TLS/mTLS counterpart transport profile and the shared certificate, trust-store, and pinning model the Telnet-over-TLS factory consumes.
- Cryptography and Key Management
for
PrivateKey,PublicKeyFormat.opensshWire,KeyVault, andSealPolicy. - User Identity and Policy for principal/account/session/profile semantics.
- Resource Accounting and Quotas for listener, socket, channel, packet-buffer, and shell-process bounds.
- System Monitoring for audit record shape and retention boundaries.
- Storage and Naming for the capability-native storage model needed before production host keys and authorized keys become durable.
- Trust Boundaries for remote-shell ingress review criteria.
- Local Users Management Backlog for account, role, and RAM-store sequencing that feeds authorized-key principal mapping.
- Genode Research for the session-factory precedent: clients request narrowed sessions from authority-bearing components instead of receiving broad factories directly.
- Pingora Research for the listener/service/runtime split that informs keeping TCP listener setup separate from application shell authority.
External standards grounding starts from RFC 4251, RFC 4252, RFC 4253, and RFC 4254. Later SSH algorithm and extension updates, including RFC 8308, RFC 8709, and RFC 9142, must be checked when choosing the implementation’s accepted algorithm set.
Non-Goals
- Replacing the native shell with a POSIX shell.
- Treating SSH username or Unix UID as authority.
- Ambient home directories, inherited file descriptors, or global paths.
- SSH agent forwarding as a shortcut to key authority.
- SFTP/SCP as a storage API before scoped file/storage capabilities exist.
- Port forwarding before explicit network-proxy capabilities and policy exist.
Proposal: Telnet over TLS Shell
An optional remote-shell path that wraps the existing Telnet
TerminalSession handoff in TLS 1.3. It is not the default production access
interface and should not be prioritized ahead of
SSH Shell Gateway for operator CLI access or
WebShellGateway for browser/agent workflows. Its value is narrower: service
terminals, compatibility environments that already standardize on TLS client
certificates, and deployments that want a small terminal protocol over the
project’s certificate/TLS stack.
This proposal sits on three load-bearing siblings and should be read with
them: Networking for the socket capability
surface (now served by the userspace network stack; the kernel
SocketTerminalSession shim and the kernel socket owner behind it are
retired), the host-loopback exposure rule, and the trust-boundary-debt
paragraph this proposal must not extend;
Certificates and TLS for
TlsServerConfig, CertVerifier, TrustStore, CertificateStore.watch,
Issuer/ACME, and rotation-without-restart; and
SSH Shell Gateway for the shared
TerminalSession-factory contract, the
RestrictedShellLauncher/SessionManager/AuthorityBroker boundary, and
the staged “transport verifies, SessionManager authorizes” split this
proposal’s mTLS path mirrors. None of those siblings depends on this
proposal; this proposal depends on all three.
Why Both, And Why Not Just SSH
The networking proposal and docs/status.md correctly call out
plaintext Telnet on 127.0.0.1:2323 as a loopback-only research demo
and name the SSH Shell Gateway as the production remote-shell successor
to that demo. That comparison is between SSH and plaintext Telnet, not
between SSH and TLS-wrapped Telnet. Once TLS is in the picture the
operational tradeoffs change, and capOS has good reasons to expose both
paths in production.
| Aspect | SSH Shell Gateway | Telnet over TLS |
|---|---|---|
| Transport | Bespoke SSH-2 binary packet protocol; the gateway parses KEXINIT, channel-open, requests, etc. | TLS 1.3 record layer plus a Telnet IAC stream consumed by the cleartext-pair TerminalSession factory through a shared state-machine module (the retired kernel SocketTerminalSession previously played this role for plaintext TCP). |
| Protocol surface inside capOS | Whole SSH message set must be parsed and reviewed even when only one channel is allowed; algorithm-policy parser; channel/forwarding/agent/X11/subsystem reject paths. | TLS handshake (rustls or equivalent) + the shared Telnet IAC state machine already implemented for the plaintext path. The gateway itself is mostly handshake plumbing and does not parse IAC. |
| User auth | Public-key built into the protocol (SshHostKey, AuthorizedKeyStore, SessionManager.sshPublicKey). Password optional and gated. | Two paths: passwords through the local CredentialStore flow, or mTLS client certificates verified against a TrustStore and mapped through SessionManager.tlsClientCert. |
| Identity model | SSH key fingerprints, principal records keyed off public-key bytes, custom rotation/audit story. | X.509 with subjects/SANs and the project’s existing certs/TLS proposal: ACME issuance, CT, OCSP, CertificateStore.watch, mTLS, pinning, name constraints. Rotation and revocation share infrastructure with everything else TLS. |
| Ecosystem leverage | New SSH-only operational track: authorized_keys, host-key custody, fingerprint pinning, key rotation tooling. | Reuses the PKI/ACME track that capOS already needs for cloud KMS HTTPS, the web-shell gateway, mTLS between services, monitoring egress, OIDC, and audit. |
| Client population | OpenSSH and friends; familiar to operators. | Standard TLS-capable telnet clients (telnets:// on port 992) and openssl s_client piping into telnet’s IAC discipline; clients are scarcer than for SSH but tooling exists. |
| Composability | One protocol; one auth model. | Transport identity (server cert, optional client cert) is orthogonal to user auth; deployments can layer mTLS-issued client identity and a passkey/OIDC step-up if they want, without inventing protocol extensions. |
| Best fit | Operator CLI access where SSH client tooling and public-key login are the right defaults. | Workload-to-workload terminal access between services that already speak TLS, deployments that prefer corporate-CA client certs for identity, browser/web bridges that already terminate TLS, and any environment where minimising new protocol surface is a security goal. |
The paths are complementary, but not equal priorities. SSH remains the main operator CLI target; WebShell remains the main browser/agent target. A production capOS deployment can expose Telnet over TLS when its operational stack wants certificate-based terminal access, but the roadmap should treat it as an optional follow-up after certificate/TLS, durable identity, session lifecycle, audit, and listener-authority work are already credible.
Scope
The first milestone is deliberately narrow but production-shaped from day one — there is no separate “demo-only” gate after which the proposal pivots to a different cert custody story:
- TLS 1.3 only. Implicit-TLS variant: TLS handshake first, then a normal Telnet byte stream over the established TLS record layer. IANA-registered port 992 (“telnets”) is the default; deployments pick their own port via the manifest.
- One interactive
TerminalSessionper connection. - Server-side TLS always; mTLS client certificates as the recommended
user-auth path, with passwords through
CredentialStoreas the fallback for deployments that have not provisioned client certs yet. - Algorithm policy is a single reviewed set: one ECDHE group
(
x25519), one signature algorithm pair (ed25519leaf,ed25519orecdsa_p256root acceptable), one AEAD cipher suite (TLS_CHACHA20_POLY1305_SHA256orTLS_AES_128_GCM_SHA256). No downgrade negotiation surface, no TLS 1.2 fallback. - Cert and key custody routes through the cert proposal’s
KeyVault/CertificateStore/CertVerifier/TrustStorecaps from the start. The QEMU smoke uses a manifest-seeded development leaf exactly the same way ACME issuance would: throughKeyVaultimport andCertificateStore.put. There is no separate dev-only signing surface that production has to retire. - Telnet IAC handling is owned by the
TerminalSessionimplementation that receives the byte stream — the cleartext- byte-pairTerminalSessionfactory. IAC handling is a userspace concern in every path: the kernelSocketTerminalSessionthat once terminated the plaintext-TCP path is retired. The TLS gateway terminates TLS, hands a cleartext byte pair to the factory, and itself parses no IAC. See The Userspace-CleartextTerminalSessionFactory for the ownership rule.
Out of scope (with reasons recorded in Considered Alternatives where the question is liable to come up again):
- STARTTLS-via-Telnet-options upgrade.
- TLS 1.2, RSA key exchange, non-AEAD ciphers, compression, externally provided session-ticket keys, multi-cipher policy.
- SSH-style channel multiplexing, port forwarding, agent forwarding, X11, subsystem requests.
- In-kernel TLS termination.
Components
flowchart TD
Client[telnets:// client / openssl s_client] -->|TCP + TLS 1.3| Gateway[telnet-tls-gateway]
Gateway --> Listen[TcpListenAuthority badge 992]
Gateway --> TlsCfg[TlsServerConfig cap]
TlsCfg --> Key[PrivateKey sign-only]
TlsCfg --> ChainSrc[CertificateStore.watch]
TlsCfg --> Verifier[Optional CertVerifier + TrustStore for mTLS]
Verifier --> ClientTrust[TrustStore for client CA]
Gateway --> Sessions[SessionManager]
Gateway --> Broker[AuthorityBroker]
Gateway --> Launcher[RestrictedShellLauncher]
Gateway --> Audit[AuditLog]
Gateway --> Factory[TerminalSessionFromByteStream]
Factory --> Term[Cleartext-backed TerminalSession]
Launcher --> Shell[capos-shell]
Term --> Shell
Sessions --> Broker
Broker --> Bundle[Scoped shell bundle]
Bundle --> Shell
The shape mirrors the SSH proposal one-for-one. Only the transport authority changes.
telnet-tls-gateway is the only network-facing component. It owns:
- The TCP listener, acquired through a manifest-declared
TcpListenAuthoritywhose badge is the configured TLS port (992 in the default manifest). - TLS 1.3 server-side handshake and record layer, composed from a
TlsServerConfigcap. The gateway never sees the rawPrivateKey; the TLS config encapsulates sign-only key authority, the certificate-chain source, the optional client-cert verifier, and the algorithm policy. - The cleartext byte pair (read half + write half) produced by the
TLS layer, immediately handed to
TerminalSessionFromByteStreamafter the handshake completes. The gateway implements no Telnet, echo, line-discipline, or terminal logic.
TlsServerConfig, TrustStore, CertVerifier,
CertificateStore.watch, KeyVault, SealPolicy, EntropySource are
not defined here. They are the caps the
Certificates and TLS and
Cryptography and Key Management
proposals already specify.
RestrictedShellLauncher and the broker/session/credential plumbing
are unchanged from the plaintext demo and the SSH proposal. The
spawned capos-shell receives only terminal, child-local stdio,
and the broker-issued shell bundle (session, creds, sessions,
audit, broker, optional shell_config). It does not see TLS,
certificate, trust-store, key, listener, raw socket, or
gateway-protocol authority.
The Userspace-Cleartext TerminalSession Factory
The retired plaintext demo terminated the byte stream inside the kernel:
TcpSocket.intoTerminalSession consumed a connected accepted socket
and returned a move-only TerminalSession backed by the kernel
SocketTerminalSession cooked-mode shim (line discipline, password echo
policy, CRLF normalization). That shim is removed: the kernel socket
owner behind it was retired by the userspace network-stack migration,
and TcpSocket.intoTerminalSession now fails closed in every dispatch
path. There is no kernel-terminated terminal byte stream any more; a
network-backed TerminalSession must be constructed in userspace.
The kernel model never extended to TLS anyway. TLS termination must be userspace — adding rustls to the kernel would be a substantial expansion of the in-kernel networking surface, exactly the expansion the networking proposal’s “trust-boundary debt” paragraph forbids. The kernel TCP path was acceptable because the bytes already crossed that boundary; TLS records do not.
This means TLS-backed remote shells need a different TerminalSession
construction surface that consumes a userspace-owned bidirectional
cleartext byte pair. Sketch:
interface ByteStreamPair {
inbound @0 () -> (rx :ByteStream);
outbound @1 () -> (tx :ByteStream);
closeHint @2 () -> (hint :CloseHint);
}
interface TerminalSessionFromByteStream {
wrap @0 (pair :ByteStreamPair, options :TerminalLineOptions)
-> (term :TerminalSession);
}
Line discipline (cooked vs raw, password echo policy, paste handling,
CRLF state) belongs inside the implementation of wrap, not in the
gateway. The implementation must:
- Preserve the same
LineEcho::Hiddensemantics the retired kernelSocketTerminalSessionenforced (the cooked-mode line discipline survives as host-testedcapos_lib::line_discipline), including the fix history captured in the Telnet IAC handoff commits. - Keep the spawned shell’s view of
TerminalSessionbyte-identical to the UART terminal path. The shell must not need to care about the transport. - Treat partial reads, partial writes, peer close, and TLS
close_notifyas ordinaryTerminalSessionclose events, not transport-specific errors leaking to the shell. - Own Telnet IAC handling for the cleartext byte pair. IAC ownership is wholly a userspace concern: no kernel component terminates a network byte stream any more, and the cleartext bytes reach the terminal only through this factory. The IAC state machine (option negotiation, suppress-go-ahead, echo policy, the NUL-prefixed-password and staircase-output fix history) belongs in a shared userspace module the gateways and the cleartext-pair factory call into, so neither path forks the byte rules and no IAC parsing returns to the kernel.
This factory is also what the SSH Shell Gateway needs: SSH
channel-backed terminals are not connected TcpSockets either. This
proposal therefore defines the surface in a transport-neutral way.
Whichever of SSH or Telnet-over-TLS lands first will deliver
TerminalSessionFromByteStream, and the other reuses it.
Authority Model
The gateway receives only the capabilities required for its job:
TcpListenAuthoritywhose badge is the configured TLS port. Mints exactly oneTcpListenerfor that port and nothing else; rawNetworkManager.createTcpListeneris not granted.TlsServerConfigfor TLS server-side handshake. Not the underlyingPrivateKey,KeyVault,CertificateStoreadministrative surface, orTrustStoremutation.EntropySource, or a narrowedTlsTransportCryptocap that owns entropy and exposes only TLS handshake, record-layer, rekey, and random-padding operations. Random material for handshake nonces, key derivation, and record nonces never comes from ambient process state.TerminalSessionFromByteStreamfor the cleartext-backed terminal.SessionManagerto mint a session at handoff: anonymous in the password path,tlsClientCertin the mTLS path.AuthorityBrokerto request the normal shell bundle profile.RestrictedShellLauncherto spawncapos-shellwith the supplied session and the reviewed pass-through grants only.AuditLogappend authority for connection, handshake outcome, authentication outcome, shell launch, and teardown records. Audit records carry stable reason codes; they do not carry private key material, certificate private parts, raw entropy, decrypted password bytes, or terminal content.
It explicitly does not receive:
- Raw
NetworkManager, rawTcpListenerfactories beyond the configured port, outboundconnectTcp, or any UDP/ICMP authority. - Raw
PrivateKeyaccess,KeyVaultadministration, key generation, key export, or certificate issuance. CertificateStoremutation or trust-store administration. The gateway consumes aTrustStorefor client-cert verification; it cannot add or remove anchors.- Broad
ProcessSpawnerauthority. Shell launch goes throughRestrictedShellLauncheronly. CredentialStoreauthority, and no parsing, logging, audit, or storage authority for credential bytes. The gateway necessarily has plaintext password bytes in its TLS-record and cleartext-pair buffers while a record is being consumed (see the password fallback section’s TCB note); it does not runCredentialStoreverification, does not interpret those bytes as credentials, and does not retain them.capos-shellhandlesloginexactly as on the local console.- Any kernel-internal or system-wide
TerminalSessionfactory beyond the cleartext-byte-pair construction surface.
The spawned shell does not gain TLS, certificate, trust-store, key,
network, listener, raw socket, or gateway-protocol authority. The
boundary the plaintext demo proves with caps is preserved verbatim.
Authentication
Server identity
Server identity is asserted through the leaf certificate carried by
TlsServerConfig. Custody routes through the cert/key proposals from
the start:
- The leaf private key is a
KeyVault-backed sign-onlyPrivateKeycap under an explicitSealPolicyallowing only TLS server-side signing. - The leaf chain is produced through whichever issuance path the
deployment uses: ACME for internet-facing endpoints, manifest-issued
for development and air-gapped, internal-CA-issued for corporate
fleets. The cert proposal’s
Issuer/Acmeinterfaces are the source of truth. - Rotation lands through
CertificateStore.watch.TlsServerConfigre-derives its leaf for the next handshake; existing TLS sessions finish on the old chain. No gateway restart, no SIGHUP, no filesystem signaling.
The QEMU development manifest seeds a leaf and key through the same
cap surface — the cert is imported into KeyVault and
CertificateStore, not exposed through a parallel “dev-only” signing
cap. Smoke harnesses pin the development leaf by SHA-256 SPKI; deploys
pin or trust through their normal trust path.
User authentication: mTLS path (recommended)
The recommended production user-auth path is mTLS:
TlsServerConfig.clientVerifierreturns aCertVerifierplus aTrustStoreof acceptable client CAs, scoped to the deployment.- The TLS handshake requires a client certificate. The gateway
verifies it through
CertVerifier.verifyChainagainst the client-CATrustStore, with name constraints, EKU (clientAuth), and revocation status enforced by the verifier policy. - On success, the gateway hands the verified leaf to
SessionManager.tlsClientCert(a new mint path mirroringsshPublicKey). The session manager maps subject/SAN/fingerprint to a principal record and allowed shell profiles, and mints aUserSessionwithtlsClientCertauthentication strength. AuthorityBrokerissues the shell bundle for the matched profile;RestrictedShellLauncherspawnscapos-shellwith that bundle and the cleartext-backed terminal.
The session manager’s mapping is intentionally explicit. A verified
client cert proves “this private key signed this handshake,” not
“this is user X.” Mapping subject/SAN to a principal is a separate
authorization step that lives in SessionManager, exactly as
AuthorizedKeyStore does for SSH public keys. Anonymous holders of a
trusted cert do not silently become privileged accounts.
mTLS user auth fails closed without ever reaching the shell. The
failure path is staged so transport verification and authorization
stay distinct, mirroring how SSH AuthorizedKeyStore and
SessionManager.sshPublicKey separate “key signature is valid”
from “key maps to a principal”:
- A client cert that fails the TLS trust path — untrusted issuer,
expired, revoked, signature invalid, name constraint violation,
missing
clientAuthEKU — ends with a TLS handshake alert. No authorization step runs andSessionManager.tlsClientCertis never called. - A client cert that successfully verifies through the configured
CertVerifierbut maps to no principal record causes aSessionManager.tlsClientCertdeny with a sanitized audit reason code, before any shell launch. Verified-but-unmapped certs are an authorization failure, not a transport failure, and must not be collapsed into the TLS alert above. - A profile mismatch between the requested shell bundle and the
mapped principal’s allowed profiles causes an
AuthorityBrokerdeny, again before launch.
User authentication: password fallback
Deployments that have not yet provisioned client certs use the existing local-shell path:
-
The TLS handshake completes with no client certificate (or with a client cert that the deployment has explicitly marked “transport-only”), and the gateway mints an anonymous session.
-
RestrictedShellLauncherspawnscapos-shell, which printslogin:and runslogin/setupagainstCredentialStorewith the same generic-failure / bounded-backoff / audit policy used on the local UART console. -
Password bytes are
LineEcho::Hiddeninput through the terminal session. The gateway implements no Telnet, line-discipline, or credential parsing of plaintext beyond moving bytes between the TLS record layer and the cleartext byte pair, and never logs password bytes or includes them in audit records or proof transcripts.Plaintext password bytes do exist in gateway-mapped TLS record-layer buffers and in the cleartext byte pair while the record is being consumed; that is unavoidable for any in-process TLS terminator and must be acknowledged honestly. The gateway is therefore part of the password-fallback TCB, comparable to the way the retired kernel
SocketTerminalSessionwas part of the plaintext demo’s TCB. The mTLS path is preferred precisely because it does not put password bytes on the wire or through the gateway in the first place.
This is weaker than mTLS but the trust boundary is no larger than
the local console: the kernel TCB plus one terminator-shaped
component (the gateway here, the kernel UART TerminalSession for
the local console). It exists so deployments can ship
Telnet-over-TLS before completing client-cert provisioning, not as
a recommended end state.
Step-up paths (future)
Deployments may want to combine transport-level identity (mTLS) with
an additional human factor (passkey, OIDC, TOTP). Step-up is the
shell’s responsibility, not the gateway’s: capos-shell gains a
stepUp command in a separate proposal, the gateway does not
short-circuit it. Treating mTLS plus passkey as orthogonal layers is
one of the reasons this path exists alongside SSH at all.
Considered Alternatives
STARTTLS via Telnet options
Rejected. Three reasons, in decreasing order of weight:
- No mainline client support. Generic Telnet+STARTTLS has no
IETF-standardised binding. RFC 2941 (Telnet Authentication
Option) and RFC 2946 (Telnet Data Encryption Option) are generic
frameworks; the only concrete TLS binding lives in TN3270E
(mainframe terminal emulators such as x3270, IBM Personal
Communications, and Vista TN3270). BSD/netkit telnet — the
standard Linux client and the one capOS already harnesses — does
not speak it. GNU inetutils telnet, the Microsoft Windows telnet
client, and PuTTY do not speak it. Targeting STARTTLS would
commit capOS to a TN3270E-shaped client population it has no
reason to address, while excluding the implicit-TLS clients that
do exist (
telnets://,openssl s_client, modern TLS-capable telnet implementations). - Pre-handshake plaintext window. STARTTLS requires plaintext IAC option exchange before TLS. That window leaks client identity, supports active downgrade attacks (server claims STARTTLS support is unavailable, expecting cleartext fallback), and complicates audit (where does “I refused to start TLS” log, and how does the server distinguish a legitimate non-TLS client from a downgrade attempt?).
- Forces pre-handshake protocol parsing into the gateway. STARTTLS requires the gateway to parse Telnet IAC before any TLS protection exists, complete with its own state machine to detect the STARTTLS option and decide whether to invoke TLS — protocol surface on unauthenticated cleartext bytes that the implicit-TLS-from-byte-zero design never exposes.
If a future deployment specifically needs TN3270E-style STARTTLS for mainframe interoperation, it is a separate proposal with its own authority model — not a generalisation of this one.
In-kernel TLS termination
Rejected. The networking proposal’s “trust-boundary debt” paragraph
explicitly forbids expanding kernel-side networking surface for its
own sake. TLS termination is large, well-served by rustls in
userspace, and gains nothing by living in the kernel.
A single “remote shell” proposal covering both SSH and TLS
Rejected. The two paths share a TerminalSession factory and the
broker/session/launcher plumbing, but their transport, key custody,
client population, and user-auth ergonomics differ enough that
collapsing them produces a worse design document. They are described
separately, sized separately, and can be implemented and audited
independently.
Implementation Slices
Slices land in this order. None is a single opaque commit. Slice 1 is shared with the SSH gateway and may be delivered by either project.
-
Userspace cleartext-byte-pair
TerminalSessionfactory. DefineByteStreamPair,TerminalSessionFromByteStream.wrap, andTerminalLineOptions. Implement against a plaintext userspace byte pair first, with no TLS in the loop. Build the line discipline on the shared host-testedcapos_lib::line_disciplinemodule (the cooked-mode core the retired kernelSocketTerminalSessionused) plus a userspace IAC state-machine module, producing byte-identical output to the UART terminal for echo policy, hidden password, CRLF state, and peer close. Either project (this proposal or the SSH gateway) may deliver this slice; both projects depend on it.No SSH or TLS terminal transport slice should proceed past fixture work until this factory exists, IAC/line discipline is factored, hidden password behavior is byte-identical to the raw TCP terminal, and repeated close/reconnect proofs pass.
-
TlsServerConfigconsumption with development leaf. Wire a manifest-seeded leaf intoKeyVaultandCertificateStore, composeTlsServerConfigwith the reviewed algorithm policy, and addmake run-telnet-tls-configproving the cap signs handshake transcripts, refuses non-allow-listed algorithms, and never exposes private key bytes in proof logs. The dev path uses the same caps as production; only the issuance source differs (manifest import vs ACME / internal CA). -
telnet-tls-gatewayservice, password path. Boot the userspace gateway against a scopedTcpListenAuthorityfor port 992, terminate one host-loopback TLS 1.3 handshake withopenssl s_client, write the cleartext byte pair into the factory, and run alogin→caps→exittranscript through the existingCredentialStoreflow. Prove the “Service Liveness” rule with repeated connections. -
mTLS user auth. Add
SessionManager.tlsClientCert, defineAuthorizedTlsClientrecords (subject/SAN/fingerprint → principal/profile mapping), wireTlsServerConfig.clientVerifier, and prove the four staged states with the trust-path/authorize distinction the mTLS auth section already requires: trust-path failure such as untrusted issuer, expired, revoked, signature invalid, name-constraint violation, or missingclientAuthEKU (TLS handshake alert, noSessionManagercall); verified-but- unmapped cert (SessionManager.tlsClientCertdeny pre-launch with sanitized audit reason); verified+mapped cert with profile mismatch (AuthorityBrokerdeny); accepted cert (UserSessionwithtlsClientCertstrength reaches the shell,capsconfirms boundary). -
Production custody path. Replace the manifest-seeded leaf with an ACME-issued or internal-CA-issued chain through the cert proposal’s
Issuerinterface. Prove rotation throughCertificateStore.watchlands without restart and without breaking in-flight sessions. -
system-telnet-tls.cue,make run-telnet-tls, and the host harness. Default the manifest to mTLS-required with a fallbackpasswordOnlyknob, add cleanup proofs for client disconnect, serverclose_notify, and shell exit, and update the topic indexes, sidebar, anddocs/tasks/README.mdwhen the slice lands.
Each slice keeps the kernel networking surface untouched. New TLS
state lives in the userspace gateway; new line-discipline state, if
any, stays inside the TerminalSession factory’s implementation.
Resource And Teardown Rules
The gateway must enforce fixed per-connection bounds and fail closed
when they are exceeded. Disconnect, TCP close, TLS close_notify,
failed handshake, terminal-factory error, shell exit, and gateway
shutdown must all release the same resources:
- accepted socket,
- TLS connection state (handshake buffers, key schedule, record-layer buffers),
- cleartext byte pair,
TerminalSessionobject,- spawned shell handle,
- broker-issued grants,
- audit correlation record.
Shell exit closes the cleartext byte pair, which closes the TLS
layer, which closes the TCP socket. Client disconnect or TLS
close_notify closes the TLS layer, which closes the byte pair,
which the shell observes as a normal TerminalSession close. There
is no privileged “tear down everything” path that bypasses the
byte-pair lifecycle.
The accept loop applies the same shape as the post-7a155f4
plaintext gateway: per-connection failures (handshake error,
factory error, launch error, shell wait error) are log-and-continue
events; setup-time failures (listener creation, broker bootstrap,
TLS config acquisition) and accept itself remain fail-closed. The
“Service Liveness” review rule applies verbatim.
Threat Model And Honest Limits
What Telnet-over-TLS gives, with TLS 1.3 + AEAD + ECDHE + deployment-issued or pinned-development leaf:
- Confidentiality and integrity against passive and active network observers.
- Forward secrecy of session bytes after the connection ends.
- Per-session randomness (replay protection) from the TLS handshake.
- Server identity assertion as good as the deployment’s trust path: ACME-issued public chain, corporate-CA chain, or SPKI pinning in the QEMU smoke.
- With mTLS: cryptographic client identity tied to PKI, with rotation and revocation on the same operational track as the rest of the deployment’s TLS estate.
What it does not give:
- SSH-style channel multiplexing, exec, port forwarding, agent forwarding, subsystem requests. These are explicit non-goals; if they are needed, the SSH gateway is the right path.
- Resistance against an attacker who can replace the deployment’s trust path on the client side. SPKI pinning in the harness mitigates this for the QEMU smoke; deployments must use a real trust anchor.
- Stronger user auth than the deployment provisioned. mTLS without
principal mapping is just transport; password fallback without
step-up is just
CredentialStore. The gateway does not synthesise authority it was not given.
This proposal does not claim Telnet-over-TLS is “as secure as SSH” or “more secure than SSH.” It is a different protocol with a different operational profile and a smaller surface to review. Whether that profile suits a given deployment is an operational decision, not a default.
Dependencies
- Networking for the socket capability surface served by the userspace network stack, the host-loopback exposure rule, and the trust-boundary-debt paragraph this proposal must not extend.
- Certificates and TLS for
TlsServerConfig,Certificate,CertificateChain,TrustStore,CertVerifier,CertificateStore.watch,Issuer/ACME, algorithm policy, and CT/OCSP plumbing. - Cryptography and Key Management
for sign-only
PrivateKey,KeyVault,SealPolicy, andEntropySource(or a narrowedTlsTransportCryptocap). - Shell for the
TerminalSessionboundary and the rule that remote text transports do not turn the shell into a raw byte-stream consumer. - Boot to Shell for
CredentialStore,SessionManager,AuthorityBroker, and thelogin/setupflow the password fallback path reuses. - SSH Shell Gateway for the parallel
TerminalSessionfactory requirement and theTcpListenAuthority/RestrictedShellLauncher/SessionManagerconventions to mirror. - User Identity and Policy for principal/account/session/profile semantics shared by password and mTLS paths.
- Resource Accounting and Quotas for listener, socket, handshake-buffer, key-schedule, terminal, and shell-process bounds.
- System Monitoring for audit record shape and retention boundaries.
- Storage and Naming for the capability-native storage path that production leaf certs and client-cert principal records become durable through.
External standards grounding:
- IANA Service Name and Transport Protocol Port Number Registry —
telnetson TCP/992 (the implicit-TLS variant the default manifest binds). - RFC 8446 (TLS 1.3). Older TLS RFCs are listed only to document why they are explicitly out of scope.
- RFC 854/855/856/857/858 (Telnet, option negotiation, binary, suppress-go-ahead, echo) for the upper protocol the kernel IAC filter already implements.
- RFC 5280 (X.509 PKI) and RFC 8555 (ACME) for the certificate chain and issuance paths.
- RFC 2941 / RFC 2946 cited only as the explicitly-rejected STARTTLS-style alternative (see Considered Alternatives).
Grounding
In-tree project docs read or re-read while shaping this proposal:
- Networking
for the Phase A/B/C boundaries, the
TcpListenAuthorityshape, the kernel-side IAC filter, the post-7a155f4IAC handoff fix, and the trust-boundary-debt rule against expanding kernel networking surface. - SSH Shell Gateway
for the
RestrictedShellLauncher/SessionManager/AuthorityBroker/ scoped-listener pattern and the staged transport-verify-then-authorize separation that the mTLS path now mirrors. - Certificates and TLS
for
TlsServerConfig,CertVerifier,TrustStore,CertificateStore.watch,Issuer/ACME, and the rotation-without- restart rule the production-custody slice depends on. - Cryptography and Key Management
for
KeyVault,SealPolicy, sign-onlyPrivateKey, andEntropySourceshape. - Shell for the
TerminalSessionboundary and the rule that remote text transports do not become rawByteStream/StdIOsubstitutes. - Boot to Shell
for the
login/setupflow the password fallback reuses and theCredentialStorefailure/backoff/audit policy. REVIEW.mdfor the Service Liveness rule applied to the gateway accept loop, the design-grounding requirement that produced this section, and the proposal-doc shape (status header, last-reviewed timestamp with timezone, relative links).
docs/research/ files read for prior-art grounding:
- Genode for the
session-factory precedent: clients receive narrowed sessions from
authority-bearing components rather than holding a broad factory
themselves. The
TerminalSessionFromByteStream/ gateway split follows that pattern, exactly as the SSH proposal does. - Pingora for the
listener / TLS-termination / service split that informs keeping
the
TcpListener, the TLS terminator, and the application-shaped shell-launch authority on separate caps. The TCB-acknowledgement paragraph in the password-fallback section is grounded in this separation: TLS termination puts plaintext in the terminator’s memory by construction, and the right answer is to size and bound the terminator, not to claim it never sees the bytes. - Plan 9 and Inferno
for the Plan 9
cpuremote-shell precedent: a CPU server is reached over a connection-oriented transport (originally TCP, with TLS/SSL added later), the client authenticates through 9P’s pluggableTauth/Rauthauth-fid mechanism, and only after authentication does the clientTattachand run an interactive shell. Inferno’s certificate-based authentication model is the same shape with X.509 instead of Kerberos. The relevance here is structural: remote-CLI access can be built around connection- oriented authenticated transports with verification and authorization as separate stages, exactly the split this proposal uses for mTLS plusSessionManager.tlsClientCert. capOS does not adopt Plan 9’s namespace-as-authority model — that is the wrong primitive for a Cap’n Proto-typed system — but the staged authenticate-then-attach pattern validates the design.
No other docs/research/ file is directly applicable: the seL4,
Zircon, EROS/CapROS/Coyotos, LLVM, capnp/OS error handling,
IX-on-capOS hosting, and out-of-kernel scheduling reports do not
address remote-shell transport choice or PKI integration in ways
that change this proposal.
Non-Goals
- Replacing or subordinating the SSH Shell Gateway. The two are peer production paths.
- Telnet-over-TLS as a research-only or demo-only path. Production
custody (
KeyVault+CertificateStore.watch+ ACME / internal CA) is the target shape from slice 5; the manifest-seeded development leaf is a stepping stone, not a parallel architecture. - STARTTLS via Telnet options.
- TLS 1.2 or any cipher-policy negotiation surface that allows downgrade.
- Adding rustls, in-kernel TLS, or any in-kernel networking parser.
The kernel
SocketTerminalSessionis retired; no terminal or protocol byte handling returns to the kernel. - SSH-style channel multiplexing, exec, port forwarding, agent forwarding, X11, subsystems.
- Treating a verified client cert as authority. Authority comes
from the principal mapping in
SessionManagerand the bundle issued byAuthorityBroker, exactly as for SSH public keys.
Proposal: Boot to Shell
How capOS should move from “boot runs smokes and halts” to an authenticated, text-only interactive shell without weakening the capability model.
Problem
The old boot path was a systems bring-up path that started fixed services,
proved kernel and userspace invariants, and exited cleanly. The completed local
console milestone added interactive login/setup and shell behavior; the later
init-owned default manifest moved that shell behind standalone init. The
remaining problem space is remote/web login, stronger credential policy, and
richer shell/session behavior without reintroducing ambient authority.
The first interactive milestone was deliberately modest:
- Boot QEMU or a local machine to a text console login/setup prompt.
- Start a native capability shell after local authentication or first-boot setup.
- Keep browser-hosted text terminal, WebAuthn/passkeys, and remote enrollment as later work in the same proposal family after the local console path works.
- Keep graphical shells, desktop UI, window systems, and app launchers as a later tier.
The risk is that “make it interactive” tends to smuggle ambient authority back
into the system. A login prompt must not become a kernel uid, a web terminal
must not become an unaudited remote root shell, and first-boot setup must not
be a first-remote-client-wins race.
Scope
The completed local-console milestone covered:
- Serial/local text console login and first-boot credential setup.
- Native text shell as the post-login workload.
- Minimal
SessionManager,CredentialStore,AuthorityBroker, andAuditLogpieces needed to launch that shell with an explicit CapSet. - Password verifier records stored with a memory-hard password hash.
- Local recovery/setup policy for machines with no credential records.
Later in the same proposal family:
- Passkey registration and authentication for a web text shell.
- A passkey-only account path that does not require creating a password first.
- Federated login via OpenID Connect (OIDC) identity providers — device code on the local/serial console, authorization code + PKCE on the web text shell. See OIDC and OAuth2.
Out of scope:
- Graphical shell, desktop session, compositor, GUI app launcher, clipboard, or remote desktop.
- POSIX
/bin/login, PAM,sudo,su, or Unixuid/gidsemantics. - Password reset by policy fiat. Recovery is a separate authenticated setup or operator action.
- Making authentication proofs visible to the shell, agent, logs, or ordinary application processes.
Design Principles
- Authentication creates a
UserSession; capabilities remain the authority. - The shell is an ordinary process launched with a broker-issued CapSet.
- Console authentication, web authentication, and federated OIDC login feed the same session model.
- Passwords are verified against versioned password-verifier records; raw passwords are never stored, logged, or passed to the shell.
- Passkeys store public credential material only; private keys stay in the authenticator.
- OIDC ID tokens are verified against a pinned
OidcIdentityProvider; the raw token never reaches the shell or audit stream as bytes. - First-boot setup requires local setup authority or an explicitly configured bootstrap credential. Remote first-come setup is not acceptable.
- A missing credential store does not imply an unlocked system.
- Guest and anonymous sessions are explicit policy profiles, not fallbacks for missing credentials.
- Development images may have an explicit insecure profile, but that must be visible in the manifest and serial output.
Architecture
The original local console boot-to-shell proof collapsed the authentication
service and interactive shell into a single userspace process. Focused
shell-led smokes still boot capos-shell directly as initConfig.init with a
narrow bootstrap CapSet (see
Service Architecture).
The default system.cue path now runs capos-shell as an init-started service
through standalone init
(Service Architecture),
but the shell-side authority model is the same: it mints its own anonymous
UserSession and only upgrades after a password login:
flowchart TD
Kernel[kernel starts one init]
Init[standalone init or focused shell init]
Shell[capos-shell]
Cred[CredentialStore]
Session[SessionManager]
Broker[AuthorityBroker]
Audit[AuditLog]
Term[TerminalSession]
Web[WebShellGateway]
Launcher[RestrictedShellLauncher]
Kernel --> Init
Init --> Shell
Shell --> Term
Shell --> Cred
Shell --> Session
Shell --> Broker
Shell --> Audit
Session --> Broker
Broker --> Launcher
Cred --> Session
Audit --> Session
Audit --> Broker
Web -. "future" .-> Session
The shell keeps the authority-holding caps needed for its session boundary
(terminal, creds, sessions, audit, broker) because the current interactive
substrate has not split login, shell, and approval into separate services. It
does not hand those caps to any child it spawns; spawn grants go through the
broker-issued RestrictedLauncher whose allowlist depends on the current
session’s profile (empty for anonymous, full interactive shell set for
operator, and empty or narrowly policy-selected for guest). The launcher
itself is the
Service Architecture
ProcessSpawner cap wrapped behind broker-enforced policy, so a shell child
cannot widen its CapSet at spawn time.
The broker returns a narrow shell bundle such as:
terminal TerminalSession
self UserSession metadata
status read-only SystemStatus
logs scoped LogReader
home scoped Namespace or temporary Namespace
launcher RestrictedLauncher
approval ApprovalClient
Early builds can omit storage-backed home and use a temporary namespace. They
still should not hand the shell broad BootPackage, ProcessSpawner,
FrameAllocator, raw device, or global service-supervisor authority by default.
First Terminal Boundary
The first interactive console boundary should be a session-scoped
TerminalSession, not a widened boot Console cap and not a raw byte-stream
cap handed directly to login or shell processes.
Console stays the early-boot and panic-path output surface. The component
that owns the underlying local console transport, line discipline, edit buffer,
and later web-terminal framing can be called ConsoleTerminal or
TerminalMux; the external authority boundary is the same either way:
- only the terminal service owns raw console transport state and line buffers,
- the shell process receives the foreground
TerminalSessioncap and drives pre-auth password/setup input through it with per-callecho = hidden, - shell children do not inherit the terminal unless the shell names it in a spawn plan.
A later web-shell or federated-login service that needs a separate
authentication front-end will still get its own TerminalSession and its
own broker-issued bundle; it does not widen authority on the local console
shell. The shell-side framing of this split — terminal-host process versus
shell process, with the terminal owning raw console state and the shell
owning the post-auth command loop — lives in
Shell.
The first interface should stay line-oriented:
enum LineEcho {
visible @0;
hidden @1;
}
enum LineStatus {
submitted @0;
cancelled @1;
closed @2;
}
struct LineRequest {
prompt @0 :Text;
maxBytes @1 :UInt32;
echo @2 :LineEcho;
allowEmpty @3 :Bool;
}
interface TerminalSession {
write @0 (data :Data) -> ();
writeLine @1 (text :Text) -> ();
readLine @2 (request :LineRequest) -> (status :LineStatus, line :Data);
}
That shape fixes the first boot-to-shell boundary:
readLinereturns one bounded line or a structuredcancelled/closedresult. The service owns the temporary edit buffer and scrubs it after completion or cancellation.- Echo policy is per call. Password entry uses
echo = hidden; the shell never toggles a terminal-global echo bit that could leak into later prompts. - The terminal service enforces a hard implementation ceiling even if a caller
asks for a larger
maxBytes.ConsoleLoginand setup flows should request smaller bounds than the shell’s ordinary command reader. - Cancellation is line-scoped. Operator abort input returns
cancelledand the caller receives no partial secret buffer. - The first milestone does not need raw byte reads, terminal history replay,
multi-reader fan-out, or shell-visible secret-state. Paste framing, resize,
and richer terminal controls can extend
TerminalSessionlater.
This keeps password/setup entry inside ConsoleLogin and terminal services.
The broker, audit log, shell, and shell children only see the outcomes they
need: session metadata, policy results, and a terminal handle for post-auth
interactive work.
Console Login
The local console path now runs entirely inside capos-shell, so “login”
is a shell command rather than a separate pre-shell process. The shell always
boots with an anonymous session; authentication is an explicit user action.
The three states below describe what the login and setup commands see,
not a boot-time mode selector. The in-shell command surface and the
login / setup / caps / inspect command behavior live in
Shell
and Shell;
this proposal describes only the session/credential/broker authority side of
the same flow. The make run-login smoke covers the password path, and
make run-shell covers the anonymous-only path.
Password Configured
If CredentialStore has an enabled console password verifier for the selected
principal or profile, login prompts for the password, verifies it through
CredentialStore, mints an operator UserSession via SessionManager.login,
asks the broker for the operator shell bundle, and swaps the in-shell
session and launcher in place.
The verifier record should be versioned:
PasswordVerifier {
algorithm: "argon2id"
params: { memoryKiB, iterations, parallelism, outputLen }
salt: random bytes
hash: verifier bytes
createdAtMs
credentialId
principalId
}
Argon2id is the default target because it is memory-hard and widely reviewed. The record must include parameters so stronger settings can be introduced without invalidating older records. A deployment may add a TPM- or secret-store-backed pepper later, but the design must not depend on a pepper being present.
On failed attempts, the shell records an audit event and applies bounded
backoff before re-prompting within the same login invocation. The backoff
state is not a security boundary by itself, because local attackers may
reboot; the password hash strength still matters.
No Console Password
If no console password verifier exists, login reports that setup is
required. The user must run setup to create the first verifier. The
make run-login-setup smoke drives the first-boot path: no verifier exists,
login refuses, setup mints the first volatile verifier through the
manifest operator seed principal, and the shell then upgrades to an operator
session.
Setup mode can:
- create the first console password verifier,
- enroll a first passkey for the web text shell (future),
- create both credentials (future).
Until a credential is created, the shell stays in the anonymous session: it
can exercise caps, inspect, session, and help, but the broker-issued
anonymous launcher has an empty allowlist, so the shell cannot spawn children
or escalate authority. This matches the operator expectation: no configured
password means “setup required”, not “open console”.
Passkey-Only Deployment
Passkey-only should be possible without creating a password. It still needs a bootstrap authority path.
Acceptable first-passkey bootstrap paths:
- local console setup enrolls the first passkey and then never creates a password verifier,
- the manifest or cloud metadata includes a predeclared passkey public credential for an operator principal,
- the console prints a short-lived setup challenge that a web enrollment flow must redeem before registering the first passkey.
Unacceptable path:
- the first remote browser to reach the web endpoint becomes administrator because no password exists.
If a machine is passkey-only, the local console can still expose setup, recovery, guest, or diagnostic profiles according to policy. It should not silently become an unauthenticated administrator shell.
Guest and Anonymous Profiles
The user-identity proposal distinguishes authenticated, guest, anonymous, and pseudonymous sessions (see User Identity and Policy for the full taxonomy and User Identity and Policy for the underlying session structure). Boot-to-shell should consume that model directly.
Authenticated password login creates a human or operator UserSession with
auth strength password. Authenticated passkey login normally creates a human,
operator, or pseudonymous UserSession with auth strength hardwareKey.
Neither proof is authority by itself; both feed the broker.
Default password-authenticated local operator sessions do not expire by fixed
wall-clock timestamp; their normal lifecycle is explicit logout,
terminal/connection/process-tree close, or administrator revocation. A
manifest can still opt into a hard operator lifetime for focused proofs or
deployment policy.
Guest is the only unauthenticated profile that belongs on the local interactive
console by default. It is a deliberate SessionManager.guest() path with a
local interactive affordance, weak or no authentication, short expiry, tight
quotas, no durable home unless policy grants one, and a bundle such as:
terminal TerminalSession
self guest UserSession metadata
tmp temporary Namespace
launcher RestrictedLauncher(allowed = ["help", "settings"])
logs scoped LogReader for this guest session
Guest should not receive ApprovalClient for administrative actions unless a
named policy grants it. If no console password exists, setup may offer a guest
session only when the manifest explicitly enables a guest profile. Otherwise
the operator must create a credential or leave the ordinary shell unavailable.
Anonymous is different. It is usually remote or programmatic, has a random ephemeral principal ID, receives a smaller cap bundle than guest, and has no elevation path except “authenticate” or “create account”. It is not the console fallback for missing credentials, and it should not be counted as “booted to shell” unless the product goal is an explicitly anonymous demo.
If the web gateway later supports anonymous access, it should be a purpose-scoped workload or very restricted text terminal with no durable home, strict quotas, short expiry, and audit keyed by network context plus ephemeral session ID. It must not share the passkey setup path, because passkey-only bootstrap is a credential-enrollment flow, not anonymous access.
An empty CapSet remains the “Unprivileged Stranger” case. It is useful for attack-surface demonstration, but it is not a session profile and not a shell login mode.
Web Text Shell and Passkeys
This is later work in the same proposal family, not part of the current local-console acceptance gate. The web shell is a browser-hosted terminal transport, not a graphical shell. It should display the same native text shell protocol through a terminal UI and should launch the same kind of session bundle as the local console path.
Required pieces:
- network stack and HTTP/WebSocket or equivalent streaming transport,
- TLS or a deployment mode acceptable to browsers for WebAuthn,
- stable relying-party ID and origin policy,
- random challenge generation,
- passkey credential storage,
- user-verification policy,
- audit and rate limiting.
Passkey credential records should store public material:
PasskeyCredential {
credentialId
principalId
publicKey
relyingPartyId
userHandle
signCount
transports
userVerificationRequired
createdAtMs
}
The authentication flow is:
- Browser requests a login challenge.
WebShellGatewayasksSessionManagerorCredentialStorefor a bounded, random challenge tied to the relying-party ID and intended principal.- Browser calls the platform authenticator.
- Gateway verifies the WebAuthn assertion, origin, challenge, credential ID, public-key signature, user-presence/user-verification flags, and sign-count behavior.
SessionManagermints aUserSessionwith auth strengthhardwareKey.AuthorityBrokerreturns the shell bundle for that session/profile.RestrictedShellLauncherstarts the native text shell connected to the web terminal stream.
Registration requires an existing authenticated session, local setup authority, or an explicit bootstrap path. Passwordless registration is allowed; unauthenticated remote registration is not.
Remote Session Clients
The same authentication and broker model also serves non-shell remote clients.
A host app – CLI, native GUI, Tauri backend, webapp gateway, or service
client – should not have to start a terminal shell just to call typed
services. After password, public-key, OIDC, passkey, mTLS, guest, anonymous,
or service/workload admission succeeds under policy,
SessionManager mints a UserSession and AuthorityBroker returns a
remote-client bundle. The client then sees a remote CapSet view whose entries
are Cap’n Proto RPC object references, not local capOS cap slots.
This keeps boot/login policy unified:
- authentication proofs are consumed by trusted session/admission services;
- the broker chooses the CapSet for the selected profile;
- shells, web terminals, agents, and non-shell remote clients are different consumers of session bundles;
- password auth is one adapter, not the remote protocol shape.
The detailed remote-client design lives in Remote Session CapSet Clients.
Federated Login (OIDC)
OIDC is the third authentication path alongside password and passkey. It lets capOS accept identity from a corporate IdP (Azure AD, Google Workspace, Okta, Keycloak, Dex, GitHub) without capOS storing or managing primary user credentials. The schemas, grant types, JWKS handling, and token lifecycle live in OIDC and OAuth2; this section describes only the integration surface.
Console (device code)
Serial consoles have no browser. The login path is RFC 8628 device authorization:
ConsoleLogincallsOAuthClient.startDeviceCodeon an IdP that the manifest has configured as acceptable for console login.TerminalSession.writeLineprints the verification URL and user code; the user completes the flow on a separate device.ConsoleLoginpollspollDeviceCodeat the advertisedinterval, honoringslow_down. Expiry is a hard fail.- On
granted,ConsoleLoginpasses the resultingIdTokencap toSessionManager.login(method = "oidc", proof = idTokenRef). SessionManagercallsOidcIdentityProvider.verifyIdTokenwith the client’sIdTokenPolicy, receivesIdTokenClaims, derivesPrincipalInfo.id = hash(iss, sub), derivesauthStrengthfromacr/amr, and mints aUserSession.- The broker returns the same shell bundle as for other login methods; no OIDC-specific authority flows into the shell.
Failed verification uses the same generic failure text and bounded backoff as password login. The manifest controls which IdPs the console accepts and which subject patterns are allowed to log in; unlike password/passkey paths, OIDC login does not implicitly treat “any valid token from any configured IdP” as authority — a permitted- subject allow-list is required.
Web text shell (authorization code + PKCE)
WebShellGateway offers OIDC alongside WebAuthn. The gateway drives
OAuthClient.startAuthCode, redirects the browser to the IdP, and
consumes the returned code through completeAuthCode. PKCE is
mandatory; state and nonce are generated from EntropySource.
The gateway validates redirect URI exactly, requires TLS, and enforces
IdTokenPolicy.nonceMustMatch.
Identity provider trust
CredentialStore gains IdP trust records alongside password
verifiers and passkey public credentials:
IdpTrustRecord {
recordId
issuer # canonical URL
clientRegistrations # allowed OAuthClient records for this IdP
jwks # snapshot or discovery URL + pinned TLS roots
allowedAlgorithms
allowedAcr / allowedAmr
subjectAllowList # e.g. principals matching sub/email/groups
clockSkewSeconds
authStrengthMap # acr/amr -> AuthStrength (X.1254 LoA)
createdAtMs
}
Records are public material (IdP URLs, JWKS, policy). Like passkey
records, they can be bootstrapped from the manifest or cloud
metadata, with a bounded RAM overlay for admin-managed records until
durable storage exists. CredentialStore.verify stays a secret-
preserving boundary; OIDC verification that rejects a token returns
only denied with a generic failure class.
Federated principal bootstrap
For a fresh image with no local password, OIDC login can create the
first UserSession when the manifest explicitly predeclares:
- one or more trusted issuers,
- a subject allow-list or group/claim predicate,
- the principal identities those subjects map to.
This is the OIDC analog of the manifest-declared passkey bootstrap path: the authority comes from the manifest trust root, not from “the first caller who presents a token wins.” Without predeclared trust, OIDC login cannot be the only path to an administrative session on a fresh image — setup mode applies.
Scope of tokens
Access tokens issued alongside the ID token belong to the OAuth
service. Neither the shell nor the broker ever receives raw token
bytes. If the broker needs to delegate outbound authority to the
session (e.g. “read from our corporate storage API”), it returns a
wrapper cap holding an AccessToken cap, not a bearer string.
Refresh and session duration
SessionManager holds the RefreshToken cap associated with a
federated session when the IdP issues one and the scope includes
offline_access (or the IdP’s equivalent). Token refresh is a
privileged operation scoped to SessionManager and audited; the
shell cannot refresh its own session token. On logout or session
expiry, SessionManager releases the refresh token and optionally
calls the IdP’s revocation endpoint.
Required Interfaces
These are ordinary capabilities, not kernel modes.
EntropySource
Owns the only approved path for fresh auth/session secrets in the first implementation.
Responsibilities:
- provide unpredictable bytes for password salts, session IDs, setup tokens, and later WebAuthn challenges,
- fail closed when secure randomness is unavailable instead of returning predictable bytes,
- keep raw entropy authority out of shells and ordinary workloads.
Only CredentialStore, SessionManager, later WebShellGateway, and a
future SshGateway or narrower SSH transport-crypto service should hold it.
ConsoleLogin, the shell, and spawned workloads should never mint their own
session IDs, salts, setup tokens, SSH key-exchange material, or challenges.
CredentialStore
Owns credential verifier records and challenge state.
Responsibilities:
- list whether setup is required without exposing hashes,
- create password verifier records from setup authority,
- verify password attempts without returning the password or verifier bytes,
- register passkey public credentials,
- store trusted OIDC identity-provider records (issuer, JWKS or
pinned discovery URL, allowed audiences, subject allow-list,
acr/amr→AuthStrengthmapping) soSessionManagercan consumeOidcIdentityProvidercaps bound to deployment policy, - issue and consume bounded WebAuthn challenges,
- rotate or disable credentials through an authenticated admin path.
- load bootstrap verifier/public-credential and IdP-trust records from manifest or cloud bootstrap config and maintain a bounded RAM overlay until durable storage exists.
SessionManager
Creates UserSession metadata after successful authentication, explicit local
guest policy, purpose-scoped anonymous policy, or setup policy. It should
record auth method, auth strength, freshness, expiry, profile, and audit
context. It should not hand out broad system caps directly. Boot-to-shell uses
authenticated sessions and optional local guest sessions for ordinary
interactive shells; anonymous sessions are narrower remote/programmatic
contexts unless a manifest explicitly defines an anonymous demo terminal.
Session IDs come from EntropySource; if fresh randomness is unavailable,
authenticated login and token-bearing setup flows fail closed instead of
reusing predictable IDs. The end-to-end mint/promote sequence and the
account-store boundary it consumes are
User Identity and Policy;
the shell-side immutable-per-process invocation context that consumes the
minted session lives in
Shell and is
proven by make run-session-context. The make run-local-users smoke covers
the manifest-seeded local operator path that backs the password-login flow.
AuthorityBroker
Maps a session/profile to a narrow CapSet. Early policy can be static and manifest-backed. The important constraint is that the broker returns capabilities, not roles or strings that downstream services treat as authority.
ConsoleLogin
Consumes TerminalSession, CredentialStore, SessionManager, broker access,
and a restricted shell launcher. It never receives broad boot-package or device
authority unless a recovery profile explicitly grants it. It owns pre-auth
password/setup entry and must not forward raw password bytes, setup tokens, or
partial secret input into the shell, broker, or audit service.
On the current local-console substrate ConsoleLogin is not a separate
process. Its responsibilities are folded into capos-shell, which owns the
pre-auth TerminalSession, drives password/setup prompts, invokes
CredentialStore/SessionManager/broker, and promotes its own session
in place. The authority rules above still apply: the same process must not
leak password bytes, setup tokens, or broker secrets into spawned children.
A future web-shell or federated-login front-end can re-introduce a separate
ConsoleLogin-shaped service that mints sessions for a distinct shell
process.
WebShellGateway
Terminates the browser terminal session, handles passkey challenge/response, drives the OAuth authorization code + PKCE flow for federated login, and connects the authenticated session to the shell process. It should not own general administrative caps. It should ask the broker for the same narrow shell bundle as any other session.
SshGateway
Terminates SSH transport for CLI remote shell access, verifies host/user key
protocol state, maps accepted SSH public keys to sessions, and connects the
authenticated session to the shell process through an SSH-backed
TerminalSession. It should not own general administrative caps, raw
KeyVault administration, port-forward authority, or broad process-spawn
authority. It should ask the broker for the same narrow shell bundle as any
other session. The detailed transport and key-custody model is in
SSH Shell Gateway. The initial schema names the
supporting authority surfaces TcpListenAuthority, SshHostKey,
AuthorizedKeyStore, SshTerminalFactory, and RestrictedShellLauncher; the
development host-key path now exists only as an explicitly labeled
non-production QEMU proof. Bounded QEMU proofs now cover configured
authorized-key lookup, fixture public-key session minting, restricted shell
launch, and a plain-TCP terminal-host handoff, while real SSH signing,
encrypted transport, packet/channel handling, and the final OpenSSH harness
remain later gates.
OAuthClient and OidcIdentityProvider
Supplied by the OAuth service
(OIDC and OAuth2).
ConsoleLogin holds an OAuthClient cap configured for device-code
grants against the manifest-declared IdPs, and an
OidcIdentityProvider cap for ID-token verification. WebShellGateway
holds analogous caps configured for authorization code + PKCE.
Neither service retains access tokens in long-lived session state —
refresh tokens live inside SessionManager, bound to the
UserSession lifecycle.
AuditLog
Records setup entry, credential creation, failed attempts, successful session creation, broker decisions, shell launch, credential disablement, and logout. Audit entries must not include passwords, password hashes, passkey private material, bearer tokens, complete environment dumps, or full terminal lines. Correlate auth/session events with opaque record IDs and policy/result codes, not with secret-bearing payloads.
First Security Substrate
Before local setup/login code lands, the first implementation should fix these rules:
- Entropy source:
CredentialStoreandSessionManagerreceive anEntropySourcecap. Password salts, session IDs, setup tokens, and later passkey challenges come only from it. If secure randomness is unavailable, credential creation, authenticated session creation, setup-token issuance, and passkey enrollment fail closed. The only remaining boot path is an explicit manifest-gated guest or development profile. - Credential backing:
CredentialStoreis initialized from manifest or cloud-bootstrap verifier/public-credential records plus a bounded RAM overlay for setup-created credentials and disable/rotate state. Until a real storage service exists, any setup-created credential and any disable/rotate action recorded only in that overlay is volatile and both the console UX and audit records must say so. The manifest may carry verifier or public-credential material, not raw passwords or reusable setup tokens. - Bounded setup-token/challenge state:
CredentialStoreowns one bounded table for setup tokens and later WebAuthn challenges. Each record is bound to a purpose, principal/profile, opaque record ID, secret bytes, created/expiry times, and consumed bit. The first redemption attempt consumes the record whether the attempt succeeds or fails, so replay always fails closed and retry requires a newly minted token or challenge. Records are scrubbed on consume or expiry. - Auth failure policy:
CredentialStore.verifyreturns onlysuccess,denied, orunavailable.ConsoleLoginprints generic failure text and enforces bounded backoff without revealing whether a principal exists, which field mismatched, or whether a verifier came from bootstrap config or the RAM overlay. Permanent lockout is out of scope for the first milestone; bounded delay plus audit is required. - Audit and redaction:
AuditLogrecords structured auth/session events with result codes, profile, auth method, reason classes, and opaque credential/token record IDs. Principal/session IDs appear only after successful authentication or when referring to an already minted session; a failed pre-auth attempt logs only a terminal-local event ID plus generic failure class. It must never log raw passwords, verifier bytes, salts, setup-token/challenge secrets, passkey private material, or full terminal lines. When setup creates a volatile credential or RAM-only disable state, the audit event recordsvolatile = truerather than any secret-bearing payload.
Prerequisites
Boot-to-shell should not be selected before these pieces are credible:
- Default boot uses init-owned manifest execution; the kernel starts only
initwith fixed bootstrap authority. initcan start long-lived services and not just short smoke binaries.ProcessSpawnercan launch the shell and login services with exact grants.- A
TerminalSessionpath exists. CurrentConsolestays output-oriented; login and shell work should use bounded line input with per-call echo mode and structured cancellation instead of raw console reads. - The native text shell exists as a
capos-rtbinary withcaps,inspect,call,spawn,wait,release, and basic error display. EntropySourceexists for salts, session IDs, setup tokens, and later WebAuthn challenges, and auth/setup flows fail closed if it is unavailable.- There is at least bootstrap verifier/public-credential backing plus a bounded RAM overlay. Durable credential storage can come later, but the first implementation must be honest about whether created credentials survive reboot.
- Minimal
SessionManager,AuthorityBroker, andAuditLogservices exist. - A restricted launcher or broker wrapper prevents the shell from receiving broad init authority.
- Web text shell requires networking, HTTP/WebSocket or equivalent, TLS/origin
handling, and WebAuthn verification. It can lag local console boot-to-shell.
TLS configuration, server certificates, ACME issuance, OCSP stapling, and
CT policy are defined in
Certificates and TLS;
WebAuthn attestation certificate verification uses the
CertVerifierfrom that proposal against a FIDO MDS trust store. - Federated OIDC login requires outbound TLS to the IdP discovery and JWKS endpoints, an OAuth client service, and manifest-declared IdP trust records. It depends on networking and the interfaces in OIDC and OAuth2. Device code can land with the local console path once networking exists; authorization code + PKCE lands with the web text shell.
Completed Local Milestone Definition
The local-console boot-to-shell milestone completed when:
make run-shellor the default boot path reaches a text login/setup prompt. The focused proofs aremake run-terminalfor the bounded line-disciplineTerminalSessionsurface,make run-credentialfor the password-verifier store,make run-loginfor the password-login path,make run-login-setupfor the first-boot setup path,make run-local-usersfor the manifest-seeded local-operator path, andmake run-shellfor the anonymous-only path.- With a configured password verifier, the console refuses the shell on a bad password and launches it on the correct password.
- With no console password verifier, the console enters setup mode and requires creating a credential or selecting an explicitly configured local guest or development policy before launching a normal shell.
- If secure randomness is unavailable, setup and authenticated login fail closed; only explicitly enabled guest or development profiles may continue.
- Guest console sessions, when enabled, are created through
SessionManager.guest()and receive only terminal/tmp/restricted-launcher style caps with no administrative approval path by default. - Anonymous sessions are not used as the missing-password console fallback and are not accepted as proof that the ordinary boot-to-shell milestone works.
- The shell starts with a broker-issued CapSet and can prove at least one typed capability call plus one exact-grant child spawn through a granted launcher or other explicitly scoped spawn authority.
ConsoleLogindrops itsTerminalSessiononce the shell starts, and a shell-spawned child without an explicit terminal grant cannot use the terminal.- Audit output records setup/auth/session/broker/shell-launch events without leaking secrets.
- Web text shell, passkey-only enrollment, and remote setup remain later work in this proposal family after the local console path exists.
- Graphical shell work is not part of the acceptance criteria.
Implementation Plan
-
Text console substrate.
TerminalSessionis the first interactive console boundary. KeepConsoleoutput-only; terminal services own bounded line buffers, per-call echo mode, and cancellation behavior. -
Native shell binary. The shell proposal’s minimal REPL over
capos-rtlists CapSet entries, inspect metadata, call granted capabilities includingTerminalSession, use a granted restricted launcher or other scoped spawn authority for exact-grant child launch, wait, release, and print typed errors. The ordinary shell profile must not depend onBootPackageor broadProcessSpawnerauthority. -
Credential store prototype. Manifest/cloud-bootstrap-backed verifier and public-credential records, a bounded RAM overlay for setup-created credentials,
EntropySourceintegration for salts/session IDs/tokens, and Argon2id verification anchor the local path. Host-generated verifier inputs are bootstrap configuration, not acceptance evidence for future credential work. -
Console setup/login. The configured-password path and no-password setup path are implemented. Setup creates verifier state through
CredentialStore, not ad hoc shell process config. The local password path now prompts forusername>before hiddenpassword>, routesSessionManager.loginthrough an account/principal selector plus proof/source metadata, verifies only selected accounts that ownconsole-password, and migrates the existing seeded console password to an explicit defaultoperatoraccount without creating username-enumeration terminal differences. Durable account-local verifier records remain future storage-backed work. -
Minimal session and broker.
UserSessionmetadata and the policy broker return a narrow shell bundle. Anonymous bundles stay separate from ordinary shell login, and QEMU proofs show the shell cannot obtain broad boot authority by default. -
Audit and failure policy. Generic auth failure handling, bounded attempt backoff, hidden password entry, and redacted audit records are part of the completed local path. Future passkey/setup-token challenge state must preserve the same no-secret logging rule.
-
Web text shell gateway. After networking and a terminal transport exist, add WebAuthn registration and authentication for the browser-hosted terminal. Support passkey-only enrollment through local setup or explicit bootstrap authority.
-
Federated OIDC login. Add
OAuthClient/OidcIdentityProviderintegration toConsoleLogin(device code) andWebShellGateway(auth code + PKCE). ExtendCredentialStorewith IdP trust records. Mapacr/amrclaims toAuthStrength. Require a manifest-declared subject allow-list for administrative sessions. -
Durability and recovery. Move credential and IdP-trust records from boot config or RAM into a storage-backed service once storage exists. Define recovery as a credential-admin operation, not an implicit bypass.
Security Notes
- Password hashing belongs in userspace auth services, not the kernel fast path.
- WebAuthn challenge state must be single-use and bounded by expiry.
- The web gateway must validate origin and relying-party ID; otherwise passkey authentication is meaningless.
- Setup tokens are credentials. They must be short-lived, single-use, audited, and hidden from ordinary process output.
- Credential records are sensitive even though they are not raw secrets; avoid printing them in debug logs.
- The shell and any agent running inside it must treat logs, terminal input, files, web pages, and service output as untrusted data.
Non-Goals
- No graphical shell in this milestone.
- No passwordless remote first-use takeover.
- No kernel
uid,gid,root, or login mode. - No default shell access to broad
BootPackage, rawProcessSpawner,DeviceManager, raw storage, or global supervisor caps. - No authentication proof passed through command-line arguments, environment variables, shell variables, audit records, or agent prompts.
Open Questions
- Which Argon2id parameters fit the early userspace memory budget while still resisting offline guessing?
- How should durable storage merge bootstrap verifier records with the first RAM overlay once a storage-backed credential service exists?
- How should local console setup prove physical presence on cloud VMs where serial console access may itself be remote?
- What is the first acceptable TLS/origin story for QEMU and local development WebAuthn testing?
- Should passkey-only machines keep a disabled console password slot for later recovery, or should recovery be entirely credential-admin/passkey based?
Proposal: SystemInfo Capability
System-wide informational data (banner/MOTD today, hostname, help topics, and on-ISO documentation later) exposed as a single typed capability instead of ad-hoc per-feature kernel parameters.
Status: Phase 1 + Phase 2 implemented. Phase 1 introduced the
SystemInfo capability (renamed from ShellConfig, schema field motd)
and unified the print site so console and Telnet shells both call
SystemInfo.motd() themselves. Phase 2 then moved post-login authority
into AuthorityBroker.shellBundle: the broker mints SystemInfo plus a
profile-scoped serviceEndpoints list (adventure + chat for operator
shells, empty for guest and anonymous shells), so Telnet/SSH-launched
operator shells can run chat-client/adventure-client without per-transport
manifest forwarding. chat is the kernel-singleton chat endpoint
(KernelCapSource::ChatEndpoint) so all operator shells share one
chat-server queue; adventure is a fresh per-session endpoint.
Last reviewed: 2026-04-29 05:59 UTC.
Problem
The pre-existing ShellConfig capability had a single method (motd) and
was distributed via manifest cap grants. That was already a capability
shape, but two things made it brittle:
- The name claimed too little. “Shell config” suggests configuration of
the shell binary, but the data is system-wide and transport-agnostic
(banner text doesn’t belong to any one shell). Anything similar we
wanted to expose later — hostname, help topics, manpages — would either
squat on
ShellConfig(wrong scope) or get its own one-method cap (proliferation). - The print site was asymmetric.
initprinted the banner over COM1 before launching the console foreground shell; the Telnet-spawned shell printed it itself after the gateway forwardedshell_configas a manifest grant. Two code paths, two places to keep consistent. The SSH Shell Gateway successor, and any future transport, would add a third.
The capability model already supports a clean fix: one cap, one print site, room to grow.
Design
Interface
interface SystemInfo {
motd @0 () -> (text :Text);
hostname @1 () -> (name :Text);
# Future:
# helpTopics @2 () -> (topics :List(HelpTopic));
# manPage @3 (name :Text) -> (page :ManPage);
}
Adding methods later is a Cap’n Proto-compatible change. Each future
addition gets its own kernel data source (or a userspace SystemInfo
service backed by storage, when persistence exists). Callers that only
need MOTD do not pay for the others.
Data Source
SystemInfo is currently kernel-backed and reads from manifest
kernelParams.motd (renamed from shellMotd) and kernelParams.hostname
(landed; defaults to capos). A CloudMetadata-derived or storage-backed
mutable hostname remains future work;
help topics and manpages will eventually be served by a userspace
documentation service that holds a SystemInfo cap as one of its
exports. The kernel implementation is intentionally minimal — it owns
text the boot manifest already provided, and nothing else.
Distribution
A process gains SystemInfo by listing it as a manifest cap source:
caps: [{
name: "system_info"
source: {kernel: "system_info"}
}]
Phase 1 granted the cap to:
init— kept.initno longer readsSystemInfoitself, but the manifest spawn loop forwards init-held kernel-source caps to each service. The console foreground shell and any gateway service that receivessystem_infois reached through this forwarding, so init must hold the cap.- The default console foreground shell (new — needed so the console shell can print MOTD itself).
telnet-gateway,restricted-shell-launcher, and the SSH gateway terminal-host (each forwardedsystem_infoto the child shell viaRestrictedShellLauncher, the same mechanism that forwardscreds/sessions/audit/broker; thetelnet-gatewayand SSH terminal-host demos are since removed with the kernel socket owner).
Phase 2 moved normal shell distribution into
AuthorityBroker.shellBundle: the broker mints a fresh SystemInfo
cap per session and returns it alongside the launcher, copied session,
and any profile-scoped service endpoint caps allowed for that profile.
RestrictedShellLauncher no longer requires a system_info pass-through grant.
Print Site
The banner is printed by the shell on startup after it obtains its initial shell bundle, across all transports:
#![allow(unused)]
fn main() {
// shell/src/main.rs
fn write_motd_from_bundle(...) -> Result<(), i64> {
let mut system_info = SystemInfoClient::new(bundle.system_info.capability());
let motd = system_info.motd_wait(ring, WAIT_FOREVER)...;
for line in motd.lines() {
terminal.write_line_wait(ring, line, ...)?;
}
Ok(())
}
}
init is no longer responsible for printing MOTD — its
write_motd_to_terminal helper is removed.
Why Phase 1 Stayed Manifest-Driven
Moving SystemInfo distribution into AuthorityBroker.shellBundle
made architectural sense, but it required the broker to hold or be able
to mint informational caps and changed the shell bundle shape. Phase 1
therefore isolated the rename, the unification of the print site, and
the schema interface as separately reviewable prerequisites.
Phase 2: Broker-Minted SystemInfo and Service Endpoints
AuthorityBroker.shellBundle returns a RestrictedLauncher, a copied
UserSession, SystemInfo, and any allowed profile-scoped service endpoint
caps per call:
interface AuthorityBroker {
shellBundle @0 (sessionCapId :UInt32, profile :Text)
-> (launcherIndex :UInt16,
sessionIndex :UInt16,
systemInfoIndex :UInt16,
serviceEndpoints :List(BundleEndpoint));
}
struct BundleEndpoint {
name @0 :Text; # e.g. "chat", "adventure"
capIndex @1 :UInt16;
}
The broker mints:
SystemInfo(always) — replaces the manifest grant.- Service-endpoint caps the requested profile is allowed to reach
(
chatandadventurefor operator profiles, none for guest or anonymous).
RestrictedShellLauncher’s required shell grants collapsed to
creds, sessions, audit, and broker; system_info and service
endpoint authority now arrive through the broker bundle, keeping the
kernel launcher minimal.
Phase 2 implementation notes
- Phase 2 landed in three sub-tiers (A:
SystemInfo; B:adventure; C:chat). The broker holds a kernel-sideArc<Endpoint>for chat — theKernelCapSource::ChatEndpointlazy singleton constructed byBootCapFactory— andArc::clones it into every operator bundle.adventureis fresh per operator bundle. - The shell prefers manifest-granted (
CapSet) caps over bundle service endpoints when both have the same name. The focused chat manifest now gives init the kernel singletonchat_endpointto forward tochat-serverand relies on the broker-issuedchatendpoint for the normal shell path instead of a shell-local chat-server export, matching the Telnet and default shell bundle model. Normal shell@chat badge 200syntax is now rejected by the parser before it can reach the delegated-client relabel check; lower-level smoke paths retain relabel fixtures for kernel/process-spawn enforcement. RestrictedShellLauncher::REQUIRED_SHELL_GRANTSno longer requiressystem_info; the broker is now the single source for that cap.
Cross-References
- Shell — banner ownership and help-topic discovery were implicit open questions; this proposal resolves “where does the banner live” (Phase 1) and “where does the post-login authority live” (Phase 2).
- Networking — Telnet gateway/shell interaction; SystemInfo is now part of the broker bundle consumed by the shell after launch.
- Boot to Shell — login flow runs after the shell has acquired its initial anonymous bundle and printed MOTD from broker-minted SystemInfo.
- Userspace Authority Broker — Phase 2 makes the broker the single source of post-login authority, including informational caps.
Non-Goals
- This proposal does not introduce persistent storage for system information. MOTD comes from the boot manifest; future fields will come from manifest, CloudMetadata, or a userspace documentation service when those exist.
- This proposal does not add a separate pre-authentication issue/banner channel. MOTD is printed after initial shell-bundle acquisition; a true pre-auth warning banner would need its own reviewed distribution path.
- Hostname is now served by
SystemInfo.hostname @1, sourced fromkernelParams.hostname(defaultcapos) and printed by the shellhostnamecommand. A mutable, CloudMetadata-derived, or storage-backed hostname is still out of scope until a consumer needs it.
Open Questions
- Pre-authentication warning banner: MOTD now comes from the initial
broker-issued shell bundle. If capOS later needs a banner before
SessionManager/AuthorityBrokerinteraction, it should be a distinct issue-style surface rather than a regression to ad-hoc manifest grants. - Hostname source: the manifest-field path landed (
kernelParams.hostname, read-only). A CloudMetadata-derived or storage-backed mutable hostname remains parked until a consumer needs to change it at runtime. - Help-topic discovery: tied to schema reflection and the
SchemaRegistryopen question in shell-proposal.md. Likely lives in a userspace documentation service, not in the kernel cap.
Proposal: System Manual Capability
A built-in, system-served reference manual: capOS should be able to explain
itself from inside itself. The Manual capability serves Unix-style man
pages, schema-derived interface manuals, and a man-shaped reference corpus
through three surfaces – the shell, the self-served web UI, and a typed capnp
API – without any ambient file access.
Status: Phases 1-4 settled. Phase 1 landed
the Manual capnp interface, the boot-packaged ManualCorpus blob compiled by
tools/manualc, the kernel Manual cap (kernel/src/cap/manual.rs), and the
shell man/apropos builtins, proven by make run-system-manual-smoke. Phase 2
landed the self-served web-UI doc viewer. Phase 3 (schema-derived section-2
DESCRIPTION) was satisfied at Phase 1 and its already-landed contract is now
locked by proofs (see Phasing). Phase 4 (programmatic API + agent export) is
settled with deviation: the describe @4/buildInfo @5/topics @2 runtime
support already shipped, so Phase 4 reduces to documenting that contract as
stable and locking the consistency invariants that genuinely hold – byte
identity between the in-system manual navigation and the published-site
llms.txt is infeasible and undesirable (see Phasing). This
proposal promotes the documentation surface that the
SystemInfo proposal sketched as the
# Future: helpTopics @2 / manPage @3 stubs into a dedicated capability, and
gives the self-served web UI an
in-system documentation source instead of relying on the externally hosted
mdBook site.
Last reviewed: 2026-05-26 23:17 UTC.
Phase 1 as-built notes
Two Phase-1 choices refine the original plan and are recorded here so the design matches the code:
- Section-2 DESCRIPTION is sourced from
.capnpdoc comments at build time. The schema already carries per-interface doc comments, andtools/manualcparses the.capnptext directly (not the generated bindings), so section-2 pages take their NAME-line title and DESCRIPTION from the interface doc comment and their SYNOPSIS from the parsed methods. This is a pragmatic improvement over the originally-planned curated-prose-keyed-by-id: it cannot drift from the schema and needs no per-interface prose file. The build check therefore requires every interface to carry a non-empty doc comment. Phase 3 still adds doc-comment preservation in the generated bindings for live reflection (describe), which is a distinct mechanism from build-time text parsing. topics @2returns a section-based index. Pages are indexed by manual section (Commands, Capabilities, …). The Phase 2 web-UI viewer renders this section-based index in its topic sidebar; replacing it with the curated front-mattertopicstaxonomy remains future corpus/build work.describe @4is backed by a build-time interfaceId index.manualccomputes each interface’s capnp type id (validated against the generated*_INTERFACE_IDconstants) and emits an id->page-name index in the blob, sodescriberesolves an in-tree interface id to its section-2 page today.
Problem
Today capOS has rich documentation, but none of it is reachable from a running
capOS instance. The corpus lives in docs/ and is rendered by the host-side
mdBook pipeline (tools/mdbook-doc-metadata/); a booted system, a shell user,
or the in-guest web UI cannot answer “what does this capability do?” or “how do
I use this command?” without leaving the system. For a research OS whose whole
thesis is that the typed interface is the contract, the inability to read
those contracts in-system is a conspicuous gap.
Three concrete pressures motivate a dedicated capability:
- A public explorer demo needs in-system docs. The cloud-deployment and self-served web-UI work point toward a publicly explorable instance. An explorer who has never seen capOS needs to discover capabilities, commands, and concepts from inside the UI – not by alt-tabbing to an external site that may drift from the running build.
- SystemInfo is the wrong home. The SystemInfo proposal already foresaw
this and stubbed
helpTopics/manPagemethods, while noting the tension: documentation would “either squat onShellConfig(wrong scope) or get its own one-method cap (proliferation).” SystemInfo is for small system-wide scalars (MOTD, hostname). Documentation is a queryable corpus with search, sectioning, and cross-links. Bolting a content/query service onto a scalar-info cap is the wrong shape. - Schema-as-manual is a capability-native idea worth capturing. Because every capability is a typed Cap’n Proto interface, a capability’s reference page can be generated from its own schema. “The interface IS the permission” extends naturally to “the interface IS its own reference page.” No other OS doc system gets this for free; capOS should not throw it away.
Design Principle: Ground On man, Modernize Navigation
The design is deliberately conservative at its core and ambitious at its edges.
- The core is
man. The proven Unix model – ordered sections (NAME,SYNOPSIS,DESCRIPTION,OPTIONS,ERRORS,SEE ALSO, …), numbered manual sections by kind, andapropos/man -kkeyword search – is the contract every page honors. The mechanics (man <name>,man <section> <name>,man -k) are immediately familiar to anyone who knowsman. - The navigation is modern. On top of the
mancore we layer the discovery affordances that make documentation pleasant: a topic index (reusing the existing front-mattertopicstaxonomy),tldr-style example-first quick views, hyperlinkedSEE ALSOcross-references, and an agent-readable export. Plan 9’s “follow the documentation pointers on demand” philosophy – navigate by need, not by linear reading – is the model for the cross-link graph.
The two are layered, not in tension: the modern surface is a renderer and
index over man-shaped content, never a replacement for it.
Manual Sections (the capOS analog of man 1-8)
Classic man numbers sections by kind of thing documented. Plan 9 keeps the
same idea but splits “devices / file servers / protocol / formats” – a split
that maps cleanly onto a capability OS, where devices and services are
capabilities. The proposed capOS sections:
| Section | Name | Contents |
|---|---|---|
1 | Commands | Shell builtins and userspace command binaries (spawn, caps, login, …). |
2 | Capabilities | One page per typed capability interface (Console, Timer, ProcessSpawner, DMAPool, …); SYNOPSIS schema-generated. |
3 | Runtime & SDK | capos-rt / libcapos / capos-service APIs available to userspace programs. |
5 | Manifests & Schemas | Boot-manifest fields, CUE config, schema/capos.capnp structures and their wire contracts. |
7 | Concepts | Prose: the capability model, the ring protocol, threading contract, session-bound invocation context. |
8 | Operations | Operator/admin surfaces: boot, run targets, remote-session gateway, cloud deployment. |
The section numbers diverge from Unix deliberately: Unix 2 is syscalls, but
capOS’s whole point is that it has essentially no syscall surface, so 2 is
Capabilities instead. The numbering is capOS-specific and documented in
intro(7); the mechanics are unchanged, so the muscle memory transfers.
Section 2 is the capability-native centerpiece – the part no conventional OS
can auto-generate, because conventional OSes have no machine-readable interface
contract for every resource.
Interface
struct ManPage {
name @0 :Text; # "console", "spawn", "capability-model"
section @1 :UInt8; # 1,2,3,5,7,8 (see section table)
title @2 :Text; # short NAME-line abstract
body @3 :Text; # rendered page text (man-shaped sections)
seeAlso @4 :List(PageRef); # cross-links -> SEE ALSO graph
examples @5 :List(Text); # tldr-style example-first snippets
source @6 :Source; # schemaReflection | prose | runtime
lastReviewed @7 :Text; # provenance from doc front matter
buildId @8 :Text; # build/commit id this page was rendered from
}
struct PageRef { name @0 :Text; section @1 :UInt8; siteOnly @2 :Bool; }
struct Topic { key @0 :Text; title @1 :Text; pages @2 :List(PageRef); }
struct Apropos { query @0 :Text; matches @1 :List(PageRef); }
enum Source { schemaReflection @0; prose @1; runtime @2; }
interface Manual {
# man <name> [section]: fetch a single page.
page @0 (name :Text, section :UInt8) -> (page :ManPage);
# man -k / apropos: keyword search over the prebuilt index.
apropos @1 (query :Text) -> (result :Apropos);
# the modern topic index, reusing the docs front-matter taxonomy.
topics @2 () -> (topics :List(Topic));
# enumerate a section (man -s 2: list all capabilities).
section @3 (section :UInt8) -> (pages :List(PageRef));
# interfaceId -> section-2 page lookup (see note below).
describe @4 (interfaceId :UInt64) -> (page :ManPage);
# the build/commit this manual blob was produced from.
buildInfo @5 () -> (commit :Text, builtAt :Text);
}
Manual is read-only and holds no authority beyond serving text. Adding
methods later is a Cap’n Proto-compatible change, matching the additive
discipline the SystemInfo and DDF work already follow.
describe @4 is an interfaceId -> section-2 page lookup. It does not take
or verify a live capability: the caller passes an interface id it already knows
(capabilities expose interface_id() today, see capos-lib/src/cap_table.rs),
and the Manual returns that interface’s manual page, or not-found for an id it
does not document (it covers only in-tree interfaces). It is the programmatic
complement to section-2 browsing – convenient for an SDK or agent that holds
a cap and wants its reference page – not reflection on the live object.
Content Model: Man-Shaped Pages, Not Raw Markdown
A ManPage is structured text with the conventional ordered sections
(NAME/SYNOPSIS/DESCRIPTION/SEE ALSO/…), not free-form Markdown.
This is a deliberate, load-bearing choice. The published docs/ tree is
long-form Markdown – proposals, architecture prose, mermaid diagrams, mdBook
preprocessor directives – and is not a man-shaped corpus. The Manual therefore
serves a purpose-authored, man-shaped corpus built at make time; it is a
distinct artifact, not a verbatim mirror of docs/*.md. The guest renders the
fixed section set (terminal pager / web pane) and needs no full Markdown engine.
What this corpus does and does not share with the published mdBook site:
- Shared: the taxonomy provenance and the schema-derived section-2
interface membership. The man corpus and the published site are tagged from
the same front-matter/
topicsvocabulary, so neither invents a category the other does not recognize. They do not share top-level navigation keys:topics @2navigates by manual section (Commands, Capabilities, …), while the site’sllms.txtnavigates the fulldocs/tree bydocs/SUMMARY.mdsection. The in-system index is a curated subset and deliberately diverges from the site’s navigation (see the Phase 4 settlement under Phasing). - Not shared: the long-form prose body. The manual is concise reference;
the site is the depth. Concept pages (section
7) are short man-shaped summaries whoseSEE ALSOpoints at the fuller site/docs/page. The manual does not claim to reproduce that long-form content.
This removes the conflation between “reuse the corpus” and “serve man-shaped sections”: the taxonomy is reused; the man pages are curated.
Page provenance (the three sources)
ManPage.source records how each page’s body was produced:
schemaReflection– interface manuals (section 2/5). The page structure is derived from the schema at build time: capnp reflection makes method/field names and signatures recoverable, so theSYNOPSIS(method list with ordinals and parameter/return shapes) is generated directly fromschema/capos.capnpfrom day one and cannot drift from the live interface. TheDESCRIPTIONprose comes from.capnpdoc comments. Two prerequisites are not done today and are tracked as their own tasks (see Sequencing): (a) the schema carries almost no doc comments, and (b) the no_std generated bindings undertools/generated/do not preserve schema doc text. Until (a)/(b) land, the generated SYNOPSIS is real but the DESCRIPTION falls back to curated prose keyed by interface id. So the Phase-1 drift surface is the prose body only – not the method list – and a build check (below) forbids a missing page.prose– authored reference (section 1/3/7/8). Man pages authored in the manual corpus for commands, the runtime/SDK, concepts, and operations. Curated, man-shaped, taxonomy-tagged.runtime– live facts (section 8). A small set of pages interpolate live state (current run target, the caller’s granted capabilities forcaps/inspect), sourced from existing caps such asSystemInfo, and are markedSource.runtimeso they are never cached as static.
Subset rule (what is and isn’t in the manual)
The in-system manual ships sections 1/2/3/5 in full plus curated
section-7/8 summaries. It deliberately does not carry the long-form
research, proposal, and design corpus – that stays on the published site. To
keep the boundary legible to an explorer rather than surprising:
PageRef.siteOnlymarks topics that exist on the published site but have no in-manual page, soaproposresults distinguish them visibly.- A concept page whose depth lives on the site ends with an explicit
SEE ALSOlink to the fuller site page. - The subset rule is itself documented in
intro(7), and a build check classifies everydocs/page as in-manual, summarized, or site-only so the boundary stays explicit as the docs grow.
Delivery and freshness
The man corpus and generated interface manuals are compiled to a compact,
read-only blob and delivered like the boot manifest – a boot-packaged payload
read through an offset/length method, mirroring BootPackageCap
(kernel/src/cap/boot_package.rs). The blob is built at make time, so a
given ISO’s manuals match that build. Every page carries the build/commit id
that produced it (ManPage.buildId plus Manual.buildInfo @5), so an
explorer of a possibly-drifting public build can always tell which capOS
version they are reading about. A build check fails the build if any in-tree
capability interface lacks a section-2 page. A future storage-backed Manual
service (once persistence exists) can serve a mutable corpus without changing
the capability shape.
Three Surfaces
All three surfaces are clients of the same Manual capability. None of them
re-implement documentation; they render ManPage values.
1. Shell man / apropos
A man builtin (and apropos alias for man -k) joins the shell’s existing
command dispatch (shell/src/main.rs, the match command { ... } block that
already handles help, caps, inspect, motd). It calls Manual.page /
Manual.apropos on a Manual cap held in shell state and paginates the body
to the terminal.
help, man, and inspect stay distinct and complementary:
helpremains the terse, cap-free built-in command list for first orientation;help <command>becomes a shortcut toman 1 <command>.man <name>is the full reference served by theManualcap.man <capability>complements the existinginspect <cap>command:inspectshows this instance,manshows the interface.
2. Self-served web-UI doc viewer (Implemented)
The self-served web UI service
(demos/remote-session-web-ui/) already holds capabilities in session state
and exposes them over HTTP routes that return JSON view-models – raw caps
never reach the browser. A Manual cap added to that service backs the routes
/api/man?name=§ion=, /api/apropos?q=, and /api/topics, and a viewer
page renders pages with a topic sidebar, a search box (apropos), and clickable
SEE ALSO links. This is the surface that makes a public explorer demo
self-explanatory: a viewer can browse the capability catalog and
concept pages with no shell and no external site.
As-built notes:
- The
Manualcap is granted to the web-UI service via the manifest ({kernel: "manual"}), looked up from the service CapSet alongsideconsole/sessions/broker, and never crosses to the browser; only the rendered JSON view-models do. The doc routes are read-only and require no login, sinceManualconfers no authority beyond documentation access. - The viewer is served as a separate
/manualpage, distinct from the login-proof page on/, so the session-proof leak assertions on/stay whole-body while the manual page may legitimately display capability interface names as documentation text. SEE ALSOand topic refs are rendered from the structuredseeAlso/pagesPageReflists (not by re-parsing body text).siteOnlyrefs link out to the published site in a new tab; in-manual refs navigate in-system. The Phase 1 corpus builder currently emits only in-manual refs (siteOnlyis always false); the viewer’ssiteOnlypath is proven against the shipped ref-classifier, and wiring realsiteOnlyrefs is follow-on corpus authoring.- The topic sidebar lists the section-based
topics @2index (Commands, Capabilities, …); the front-matter taxonomy-backed index remains future work. - Proof:
make run-remote-session-self-served-web-uidrives a headless browser that logs in, rendersconsole(2)with itssource/buildIdprovenance, searchesapropos timer, and asserts thesiteOnly-vs-in-manual link classifier – without leaking any raw cap or session internals to the browser.
3. Programmatic capnp API
Because Manual is an ordinary capability, any process or agent granted it can
fetch documentation programmatically – an SDK can surface inline help, and an
in-system agent can ground itself on the real interface contracts. describe @4
resolves a held interface id to its section-2 page, topics @2 returns the
navigation taxonomy, and buildInfo @5 (plus each page’s buildId) tells an
agent exactly which capOS build it is reading about. Agents get one
machine-readable index of everything the running system documents: the
man-shaped subset, navigated by manual section. That subset is a curated
projection of the published site rather than a byte-identical copy of its
llms.txt navigation – the in-system manual and the published full-tree index
deliberately differ in scope (see the Phase 4 settlement under Phasing), and the
invariant that holds is shared taxonomy provenance, not identical keys.
Navigation and Discovery (making it great)
The features that lift this above a flat page dump, all riding on the man
core:
- Topic index over
apropos.topics @2exposes the curated front-matter taxonomy;apropos @1does free-text keyword fallback against a keyword index built into the blob atmaketime (NAME lines + tagged keywords), not a linear scan of page bodies. A reader navigates by topic when they know the area and by keyword when they do not. SEE ALSOas a real graph.ManPage.seeAlsois structuredPageRefs, not prose, so every surface renders them as links and an explorer can walk the concept graph – the Plan 9 “pointers on demand” model.siteOnlyrefs link out to the published site.- Example-first quick views.
ManPage.examplescarriestldr-style snippets so the common case (“how do I actually usespawn?”) is answered in five lines before the full DESCRIPTION. - Provenance is visible.
source,lastReviewed, andbuildIdtravel with every page, so a reader can tell an auto-generated interface manual from curated prose and see which build it describes.
Authority and Security Model
- Read-only, no ambient authority.
Manualonly returns text. Holding it grants nothing but the ability to read documentation; it cannot mutate state or widen any other authority. Documentation access is itself a capability, consistent with Principle 1 (no ambient authority). - Scoped distribution. The web-UI service and operator shells receive
Manualvia manifest cap grants or theAuthorityBrokerbundle, exactly asSystemInfois distributed today. A public/anonymous web session can be granted aManualwith no risk, because it confers no authority. - Browser boundary unchanged. As with all web-UI routes, the browser
receives only rendered
ManPageJSON view-models; theManualcap never crosses into browser JavaScript. - No code execution. Pages are structured text rendered by the viewer. The Manual never serves executable content, so an explorable public instance does not gain a new code path from documentation serving.
Phasing
-
Phase 1 – capability + shell
man. (Implemented.) Defined theManualinterface, authored the man-shaped corpus, delivered the boot-packaged blob, implementedpage/apropos/topics/section/describe/buildInfo, stamped every page with the build/commit id, added the build check that forbids a missing section-2page, and added the shellman/aproposbuiltins. Section-2SYNOPSIS is schema-generated; section-2DESCRIPTION is sourced from.capnpdoc comments at build time (see Phase 1 as-built notes); sections1/7ship authored prose. Proof:make run-system-manual-smoke. -
Phase 2 – web-UI viewer. (Implemented.) Added
Manualto the self-served web-UI service and shipped the viewer page (topic sidebar,apropossearch, page provenance, and clickableSEE ALSOlinks withsiteOnlyexternal linking). This is the public-explorer-facing milestone. Proof:make run-remote-session-self-served-web-ui. -
Phase 3 – schema-derived section-2 DESCRIPTION. (Satisfied at Phase 1; proofs hardened.) The Phase-1 corpus builder already auto-generates the section-
2DESCRIPTION (and the per-method SYNOPSIS docs) from.capnpdoc comments and backsdescribe @4with a build-time interfaceId->page index, so the prose-drift window the original Phase-3 plan targeted is already closed. Phase 3 therefore reduces to locking and proving that already-landed contract, with two deliberate deviations from the original plan:- Fail-closed over warn-and-fallback. The build check fails when an
interface lacks a doc comment (
manualcenforce_coverage), rather than falling back to curated prose with a warning. Fail-closed keeps the served pages provably schema-sourced and is the reviewed Phase-1 choice; it is kept, not regressed. - No live runtime reflection. The prerequisite binding-preservation task
emits doc text as Rust
///attributes, which are not runtime-accessible data, and the kernel cannot introspect interface docs at call time. Live reflection indescribeis neither feasible at the currentcapnpversion nor necessary:manualcreparses the live schema each build andmake generated-code-checkenforces schema<->binding parity, so the served signatures and DESCRIPTION cannot drift fromschema/capos.capnp.
Proofs: a
manualchost test (describe_index_resolves_to_schema_derived_section2_pages) asserts that for every shipped interface thedescribe @4descriptor resolves to the same schema-derived section-2pagepage @0serves, with the doc comment in the DESCRIPTION body;make run-system-manual-smokeasserts the served Console DESCRIPTION body and a method-doc line both originate from doc-comment text. Open follow-up:describe @4has no userspace client today (the shellmanbuiltin andManualClientusepage @0), so a runtimedescribeexerciser would need aManualClient::describe_waitplus a caller (out of this slice’stools/+kernel/src/cap/manual.rsscope). - Fail-closed over warn-and-fallback. The build check fails when an
interface lacks a doc comment (
-
Phase 4 – programmatic API + agent export. (Settled with deviation.) The programmatic surface this phase set out to stabilize –
describe @4(interfaceId -> section-2 page),buildInfo @5(build/commit provenance), andtopics @2(navigation taxonomy) – already shipped at Phase 1 with kernel dispatch (kernel/src/cap/manual.rs) andManualClientmethods (capos-rt/src/client.rs). Phase 4 therefore reduces to declaring that contract stable for SDK/agent consumers and locking the consistency invariants that genuinely hold, with one deliberate deviation from the original “unify … so they share one source” framing:describe @4/buildInfo @5are stable programmatic API.describe @4is the id-keyed complement to section-2 browsing (it takes an interface id the caller already holds; it does not reflect on a live object) and resolves through the build-time descriptor indexmanualcemits.buildInfo @5returns the corpus commit, and every page carries that same commit as itsbuildId, so an agent grounding itself on a possibly-drifting public build can always resolve which capOS build any page describes. These are additive, read-only methods; widening them later stays Cap’n Proto-compatible.- Deviation: byte-identical navigation keys / a single source pass are
infeasible and undesirable. The original plan called for the in-system
topics @2list to be identical to the publishedllms.txtnavigation keys, emitted from one source pass. That contradicts the reviewed subset rule: the in-system manual is the man-shaped subset navigated by manual section (topics @2keyscommands/capabilities/runtime-sdk/manifests-schemas/concepts/operations), while the publishedllms.txtindexes the fulldocs/tree navigated bydocs/SUMMARY.mdsite sections (Start Here/Runnable Demos/System Architecture/…). The two key sets are deliberately disjoint and cover different content scopes; forcing identity would either shrink the published agent index to the man subset or replace the manual’s section navigation, regressing a shipped, asserted artifact. The two artifacts are also produced by separate compilers in separate languages with separate inputs (tools/manualc, a Rust capnp-blob compiler readingschema/capos.capnp+docs/manual/; andtools/mdbook-doc-metadata/generate-llms-txt.js, a Node site generator readingdocs/SUMMARY.md+ page front matter), so a literal single pass is a re-architecture, not a documentation slice. - What is delivered instead: the shared invariants that cannot diverge are
locked. The genuinely shared dimension is the front-matter/
topicstaxonomy provenance and the schema-derived section-2 interface membership, not the top-level navigation keys. Amanualchost test (topics_taxonomy_is_canonical_and_stable) pins the servedtopics @2taxonomy to its canonical key/title/section/order projection so the navigation cannot silently drift, andbuild_info_commit_grounds_every_pageproves thebuildInfo @5round-trip stamps every page with the corpus commit.
Proofs: the two
manualchost tests above;make run-system-manual-smokeassertsbuildInfo @5grounds pages of both provenance kinds (the schema-derivedconsole(2)and the authoredspawn(1)) with one consistent, non-placeholder build id. Open follow-up: a true single-source agent export shared with the publishedllms.txtwould require reconciling the subset rule (a documented man-subset projection of the site taxonomy) plus schema, Makefile, corpus, and kernel scope beyond this slice’stools/+ proposal surface; anddescribe @4/topics @2still have no shell exerciser (the shellman/aproposbuiltins usepage @0/apropos @1), carried from the Phase 3 follow-up.
Sequencing and Priority
Recording this design now is cheap and the public-explorer angle is real, but
Phase 1 implementation competes with foundational work. capOS does not yet
have persistence, a userspace network stack, or the DDF Task-5 userspace device
authority gate closed. Phase 1 (capability + boot-packaged corpus + shell
man) is buildable on current infrastructure and does not depend on those
gates; Phase 2 depends only on the already-implemented self-served web UI.
Unless a public demo is imminent, sequence Phase 1 behind the foundational
milestones on the priority ladder rather than ahead of them. The schema
doc-comment authoring and binding-preservation prerequisites (Phase 3) can
proceed independently and are worth doing regardless, because they also improve
review and the published interface docs.
Relationship to Existing Proposals
- SystemInfo proposal: retire the
# Future: helpTopics @2 / manPage @3stubs there in favor of this capability; SystemInfo stays scalar (MOTD/hostname/host metadata). That proposal’s front matter and interface comment should be updated to point here when this lands. - mdBook documentation-site proposal: complementary, not competing. mdBook
remains the host-rendered public site (the long-form depth);
Manualis the in-system concise reference. They share the front-matter/topics taxonomy provenance, so navigation cannot drift on the taxonomy it is drawn from. They do not share top-level navigation keys: the site’sllms.txtindexes the fulldocs/tree bydocs/SUMMARY.mdsection, whiletopics @2indexes the man-shaped subset by manual section. The Phase-4 agent export is the in-system manual’s own machine-readable index of that subset, not a copy of the site’sllms.txt(see the Phase 4 settlement under Phasing). - Remote-session UI security proposal: the web-UI viewer inherits its view-model and browser-boundary rules; this proposal adds no new authority-bearing route.
- Interactive command surfaces proposal: a future typed
CommandSessioncould host a richer in-shell pager forman, but Phase 1 uses the existing line-based terminal write path. - Cloud deployment / public-release boundaries: in-system docs are a prerequisite for a self-explanatory public explorer demo; this proposal is the documentation half of that story.
Open Questions
- Subset boundary precision. The subset rule (sections
1/2/3/5full,7/8summaries, long-form on the site) needs a concrete inclusion list, and the build check that classifies eachdocs/page as in-manual / summarized / site-only must be authored so the boundary stays legible as docs grow. - Schema doc-comment authoring. Section-
2DESCRIPTION quality (Phase 3) depends on writing real doc comments acrossschema/capos.capnp; that authoring is its own tracked work and gates the auto-generated path. A REVIEW.md rule now requires doc comments on new/changed interfaces so the gap does not regrow. - Structured page schema. Whether man sections are a fixed set of typed fields or a single tagged-text body; leaning toward a small fixed set so both renderers stay trivial.
Design Grounding
- Capability dispatch and
interface_id():capos-lib/src/cap_table.rs; boot-packaged read-only blob pattern:kernel/src/cap/boot_package.rs. - Shell command dispatch for the
manbuiltin:shell/src/main.rs. - Web-UI cap-holding + view-model boundary:
demos/remote-session-web-ui/. - Front-matter / topics taxonomy reused for
topics:tools/mdbook-doc-metadata/and mdBook proposal. - Prior interface sketch and the scope tension this resolves: SystemInfo proposal.
Relevant Research
- Unix manual conventions and section ordering –
man-pages(7). - Plan 9’s section split (commands / devices / file servers / protocol /
formats) and “follow the documentation pointers on demand” navigation model,
which motivate the capability-section mapping and the
SEE ALSOgraph (docs/research/Plan 9 report). - Cap’n Proto runtime reflection (
RawStructSchema/ dynamic schema), the basis for schema-derived section-2SYNOPSIS and DESCRIPTION. - Modern discovery affordances –
tldr/apropos/topic indexes – adopted as a navigation layer over, not a replacement for, themancore.
Proposal: Interactive Command Surfaces
Typed command surfaces for native interactive applications without moving
application parsing into StdIO text streams.
Current Target Versus Future Design
The immediate target is deliberately narrower than this proposal:
capos-shellexposes generic process control commands, includingspawnfor asynchronous launch andrunfor launch-and-wait.- Chat and adventure clients are ordinary spawned commands, not shell builtins.
- Interactive child I/O uses an explicit
StdIOendpoint client with stdin/stdout/stderr-shaped semantics while the shell keeps ownership of itsTerminalSession. - Focused QEMU smokes prove the resident-service plus shell-spawned-client path before the native command protocol hardens.
The future native design is the CommandSession/CommandSurface protocol
below. It should replace semantic command parsing inside chat/adventure
clients once the prototype has proved the process, grant, wait, and terminal
bridging mechanics.
The native shell substrate this proposal extends is described in Shell; the agent-mode tool-use loop that will consume the same command surfaces as typed tool descriptors lives in Language Models and Agent Runtime.
Problem
The current chat/adventure worktree moved application commands out of
capos-shell builtins and into ordinary shell-spawned clients. That fixes one
bad boundary, but it leaves another one: the clients read lines from StdIO
and parse command text such as go north, take key, /join #lobby, and
say hello themselves.
That is still too stringly for capOS. The kernel and services already expose
typed capabilities. Native interactive applications should not receive their
primary operation as an unstructured terminal line and then rebuild an ad hoc
parser. StdIO is useful for textual programs, logs, compatibility layers,
and simple smoke harnesses. It is not the right semantic boundary for a native
application command language.
The other design pressure is terminal reuse. The same native shell should work from a local UART, GUI pane, web terminal, or test harness. That argues for a terminal host process that owns terminal transport and rendering separately from the shell process that owns command routing and capability context.
Goals
- Keep application-specific verbs out of
capos-shell. - Keep application command semantics out of unstructured
StdIOtext parsing. - Let a user type familiar command forms such as
go northorchat join #lobbywhile the executable representation is a typed invocation. - Support nested subcommands without hardcoding app grammar into the shell.
- Let terminal hosts provide line editing, completion, history, resize, and GUI/web rendering from the same command metadata.
- Preserve typed service authority: parsing a command never grants access, and every effect still requires the right capability.
Non-Goals
- POSIX shell compatibility.
- A global command namespace.
- Making terminal text a security boundary.
- Removing
StdIO; it remains the byte/text stream adapter for programs whose interface really is textual.
Layering
flowchart TD
Uart[UART TerminalHost] --> Terminal[Terminal entity]
Web[Web TerminalHost] --> Terminal
Gui[GUI TerminalHost] --> Terminal
Terminal --> Shell[Native shell session]
Shell --> Cmd[Interactive CommandSession]
Cmd --> Adventure[Adventure service cap]
Cmd --> Chat[Chat service cap]
Shell --> Launcher[Restricted launcher]
Shell --> Broker[AuthorityBroker]
The terminal host owns raw input/output, line discipline, presentation state,
history, paste handling, resize events, and later GUI/web affordances. The
terminal entity is the session object the host exposes to a foreground shell or
application view. TerminalSession remains the capability boundary for a
foreground text session, but it does not have to be implemented inside the
shell.
The native shell owns command namespace, current capability context, spawn/wait state, and policy-mediated bundle changes. It can run from any terminal host because it talks to the terminal entity, not to a particular UART.
An interactive application owns a CommandSession. It exposes a command
surface and receives structured invocations. The application may be a thin
adapter over service capabilities, as the adventure client should be, or a
resident service may expose the command session directly.
Command Pattern
command <args> is acceptable as user-facing syntax, but it must not become
the application ABI. It is a parseable notation for a declared command surface.
The shell or terminal host parses text into a CommandInvocation; the
application receives typed fields.
Conceptual schema:
struct CommandSurface {
revision @0 :UInt64;
prompt @1 :Text;
commands @2 :List(CommandSpec);
}
struct CommandSpec {
path @0 :List(Text);
summary @1 :Text;
args @2 :List(CommandArg);
flags @3 :List(CommandFlag);
redaction @4 :List(RedactionClass);
}
struct CommandArg {
name @0 :Text;
kind @1 :CommandValueKind;
required @2 :Bool;
variadic @3 :Bool;
restOfLine @4 :Bool;
completions @5 :CompletionSource;
}
struct CommandInvocation {
surfaceRevision @0 :UInt64;
path @1 :List(Text);
args @2 :List(CommandValue);
flags @3 :List(CommandFlagValue);
}
interface CommandSession {
describe @0 () -> (surface :CommandSurface);
invoke @1 (command :CommandInvocation) -> (result :CommandResult);
poll @2 (maxEvents :UInt16) -> (events :List(CommandEvent));
close @3 () -> ();
}
The parser is generic:
- Match the longest declared command path.
- Parse arguments according to the declared shapes.
- Treat ambiguous prefixes as errors with alternatives.
- Treat
restOfLineas one text argument; do not split it again in the app. - Attach redaction metadata before audit or transcript recording.
- Re-read
CommandSurfacewhen a command returns a new revision.
The application can still reject a typed invocation if the command is no longer valid. That is ordinary semantic validation, not text parsing.
Subcommand Nesting
Nested subcommands work if the command path is represented as a token list rather than a single string. Examples:
go north
take brass-key
say hello there
chat join #lobby
chat who
inventory equip lantern
admin npc spawn wanderer room=atrium
Those become:
path=["go"], args={direction:"north"}
path=["take"], args={item:"brass-key"}
path=["say"], args={text:"hello there"}
path=["chat","join"], args={channel:"#lobby"}
path=["chat","who"], args={}
path=["inventory","equip"], args={item:"lantern"}
path=["admin","npc","spawn"], args={kind:"wanderer", room:"atrium"}
The shell does not need adventure-specific code for any of these. It needs a
generic command tree, longest-prefix matching, value parsers, and completion
hooks. The same mechanism can describe shell commands such as spawn, wait,
login, and caps, even if the implementations remain inside the shell for
now.
Subcommand nesting is also a better fit for GUI/web sessions than raw StdIO.
A terminal host can render chat join as a command palette entry, offer room
completions for go, or show buttons for zero-argument commands such as
look, all from the same metadata.
Adventure Shape
The adventure command session should own only the caps it needs:
adventure Adventure or Endpoint client cap
chat Chat or Endpoint client cap
session optional UserSession metadata cap
It should expose a dynamic surface derived from current player state:
lookgo <direction>with room-specific direction completionstake <item>with visible item completionsdrop <item>with inventory completionsinventorysay <text...>withrestOfLine=truechat join <channel>chat whoquit
The shell or terminal host parses those forms. The adventure command session
turns the resulting invocation into typed Adventure and Chat calls. The
adventure service still validates the session-bound caller identity, room,
exits, items, and chat channel authority. Dynamic completions are convenience,
not authority.
This is the balance capOS wants: generic shell integration, app-owned command metadata, typed service calls, and no application-specific shell builtins.
The same describe-returned CommandSurface is the metadata source the agent
runner in Language Models and Agent Runtime projects to
typed tool descriptors with per-tool permission modes (auto / consent /
stepUp / forbidden). A command surface is therefore not only a shell parsing
input – it is the contract surfaced to interactive operators, scripted
harnesses, and model-driven tool-use loops alike.
Role of StdIO
StdIO remains useful, but it should be demoted to a transport and
compatibility interface:
- output streams for simple textual programs,
- test harnesses that script input and check transcript output,
- POSIX personality descriptor emulation,
- applications whose real protocol is text.
For capOS-native interactive applications, StdIO.read() should not be the
primary command interface. A command session can still emit render events that
the shell forwards to a terminal host, and a compatibility adapter can expose
the same session as text when necessary.
Terminal Host Separation
The shell should not permanently own the terminal implementation. A separate terminal host process gives the system one shell that can be reused across different front ends:
- local UART host for QEMU and early hardware,
- web host for browser terminal sessions,
- GUI host for a desktop pane or command palette,
- test host for smoke scripts.
Each host owns a terminal entity and grants a foreground TerminalSession or
equivalent view to the shell. The shell runs command sessions and returns
render/update events. The host decides how to display them.
This also avoids a future false choice between “shell owns the terminal” and “child process receives the terminal.” The terminal entity can support a foreground lease, shell-mediated command sessions, and later split panes or GUI widgets without making every child process a terminal driver.
Migration Plan
- Land the current shell-spawned
StdIOclients as an explicit prototype: no app-specific shell builtins, no terminal-cap delegation to children, andrunavailable for blocking command execution. - Add focused QEMU smokes for chat and adventure against that prototype so the resident service, exact grants, wait path, and terminal bridge have a stable regression target.
- Add a userspace
CommandSessionDTO/protocol in the shared demo/runtime layer, carried over ordinaryEndpointuntil a manifest-visible interface is worth committing. - Teach
capos-shella generic command-surface parser and command-provider registry. Do not addchat,play adventure,go,take, or similar application verbs as hardcoded shell matches. - Move adventure command parsing out of
demos/adventure-client/and into command descriptors plus typedAdventure/Chatinvocations. - Split terminal hosting from the shell when the local UART path needs to
support a second front end or when the web terminal work starts. Until then,
keep the current terminal implementation constrained to the
TerminalSessionboundary so the split is mechanical.
See Shell for the broader native-shell
authority model the command-surface protocol plugs into, and
Language Models and Agent Runtime for the agent-mode
consumer that turns the same CommandSurface metadata into typed tool
descriptors with explicit permission modes.
Proposal: Userspace Authority Broker and Init-Owned Shutdown
Problem
The current shell authentication path uses a kernel AuthorityBroker
capability. The shell starts with anonymous authority, calls the broker for an
anonymous bundle, then calls it again after password login for an operator
bundle. That works, but it places session policy, launcher policy, and shell
bundle construction inside the kernel.
That is the wrong long-term boundary. The kernel should provide primitive mechanisms: process creation, capability transfer, endpoint rendezvous, memory, terminal I/O, and process lifecycle. Login policy, operator profiles, service allowlists, and shell bundles are userspace policy and should be owned by init or an init-managed service.
Shutdown exposes the same issue. A shutdown command should not be a raw kernel poweroff capability passed to the shell. The natural capOS behavior is that the kernel halts when init and all remaining processes are gone. Shutdown policy should therefore be implemented as init-owned lifecycle orchestration: stop services, wait for them, release authorities, and then let init exit.
Current State
Implemented pieces today:
- The kernel starts one init process from the boot manifest.
- Init reads
BootPackage, validates the init-owned service graph, spawns services, records exported capabilities, and waits for children. - The shell receives a terminal and anonymous authority, then upgrades after password login.
AuthorityBrokeris a kernel capability implemented inkernel/src/cap/authority_broker.rs.- Demo launcher policy that used to live as kernel-side binary and worker
allowlist constants is now carried by
kernelParams.authorityBrokerPolicyin the boot manifest.capos-configvalidates that referenced binaries exist, duplicate entries are rejected, worker service grant names are explicit, and unknown worker service origins fail closed. ProcessHandlesupportswait, but not termination.- There is no init-owned lifecycle control capability yet.
The consequence is a mixed trust boundary: init owns service graph execution, but the kernel still owns shell session bundle policy.
Goals
- Move authority-broker policy out of the kernel.
- Make init, or an init-managed broker service, responsible for authenticated shell bundles.
- Keep shell unauthenticated authority minimal.
- Make shutdown an init-owned control operation, not a direct kernel shutdown cap.
- Preserve the kernel rule that the system halts naturally when the last process exits.
- Keep all authority transfer explicit and inspectable through capabilities.
Non-Goals
- Do not add ambient service names or a global service registry.
- Do not give shell raw
ProcessSpawnerbefore authentication. - Do not add a kernel “kill everything” syscall.
- Do not introduce restart policy, persistence, or crash recovery in this proposal.
- Do not solve multi-user policy; this proposal only moves the current local operator/anonymous policy out of the kernel.
Proposed Architecture
Init starts two policy-facing services:
authority-broker: userspace service that owns shell bundle policy.shell: interactive shell, initially anonymous.
Init also keeps a private lifecycle table for services it spawned. That table contains process handles, service names, restart policy state, and shutdown ordering metadata. Init does not expose the raw table. It exposes attenuated control capabilities.
Capability Graph
flowchart TD
Kernel[Kernel primitives] --> Init[init]
Kernel --> Terminal[TerminalSession]
Kernel --> Spawner[ProcessSpawner]
Kernel --> Sessions[SessionManager]
Kernel --> Audit[AuditLog]
Kernel --> Creds[CredentialStore]
Init --> Broker[authority-broker service]
Init --> Shell[shell]
Init --> Services[managed services]
Broker --> ShellAnon[anonymous shell bundle]
Broker --> ShellOp[operator shell bundle after login]
Init --> Shutdown[init-owned ShutdownControl]
Broker --> ShellOp
Shutdown --> ShellOp
The shell talks to the broker over an endpoint. Before login, the broker returns an anonymous bundle with no service-management authority. After login, the broker returns an operator bundle that includes a restricted launcher and, if policy allows, an init-owned shutdown control capability.
Interfaces
The exact schema can evolve, but the minimum shape should separate broker policy from init lifecycle control.
interface AuthorityBroker {
shellBundle @0 (sessionCapId :UInt32, profile :Text)
-> (launcherIndex :UInt16,
sessionIndex :UInt16,
hasShutdownControl :Bool,
shutdownControlIndex :UInt16);
}
interface ShutdownControl {
shutdown @0 () -> ();
}
interface ProcessHandle {
wait @0 () -> (exitCode :Int64);
terminate @1 (reason :Text) -> ();
}
AuthorityBroker can be implemented as a userspace service using endpoint IPC
instead of a kernel cap. ShutdownControl is produced by init, not by the
kernel. ProcessHandle.terminate is a primitive lifecycle operation, but the
kernel only targets one process handle; init owns the policy that decides which
handles to terminate and in what order.
Shutdown Flow
- Shell starts anonymous and does not hold
ShutdownControl. - User runs
login. - Shell obtains an operator bundle from the userspace broker.
- If policy allows, the bundle includes
ShutdownControl. - User runs
shutdown. - Shell invokes
ShutdownControl.shutdown. - Init stops accepting new service operations.
- Init asks managed services to terminate in dependency order.
- Init waits for all service handles to exit.
- Init releases remaining capabilities and exits.
- The kernel observes no remaining runnable user processes and halts through the existing last-process-exited path.
This keeps final machine shutdown in the kernel, but keeps shutdown authority and orchestration in userspace.
Broker Migration Plan
Phase 1: Define Userspace Interfaces
- Add schema for endpoint-served
AuthorityBrokerandShutdownControl. - Keep the kernel broker temporarily for compatibility.
- Keep the manifest-owned
authorityBrokerPolicyshim as the compatibility source for admitted demo binaries until the userspace broker owns equivalent policy directly. - Add runtime clients for both interfaces.
- Add QEMU proof that an anonymous shell cannot call shutdown.
Phase 2: Init-Owned Shutdown
- Extend init with a lifecycle table for spawned services.
- Add a private init service endpoint for shutdown requests.
- Add
ProcessHandle.terminateor equivalent single-process lifecycle primitive. - Make init terminate and wait for managed services before exiting.
- Add QEMU proof that
shutdownafter login exits QEMU cleanly.
Phase 3: Userspace Authority Broker
- Implement
authority-brokeras a userspace service. - Grant it only the policy inputs and capabilities needed to mint shell bundles.
- Have shell obtain anonymous and operator bundles from that service.
- Keep shell without raw
ProcessSpawner; it should receive only restricted launch authority. - Add QEMU proof that pre-login shell cannot spawn privileged services and post-login shell can run the expected demo commands.
Phase 4: Retire Kernel Broker
- Remove
kernel/src/cap/authority_broker.rs. - Remove
KernelCapSource::AuthorityBroker. - Remove kernel-side broker bundle construction and tests.
- Update docs so the kernel boundary is again primitive-only.
Security Properties
- Shell starts without shutdown authority.
- Shutdown authority is granted only after an authenticated session is proven.
- The broker cannot invent kernel powers; it can only delegate capabilities it received from init.
- Init remains the root of service lifecycle policy.
- Kernel process termination remains per-handle, not global.
- Service shutdown is auditable because it flows through init and named process handles.
Open Questions
- Should
ShutdownControl.shutdownbe one-way, or should it return staged progress events before init exits? - Should services receive a graceful
StdIOclose, a typed lifecycle signal, or onlyProcessHandle.terminate? - Should the broker be a separate process, or should init directly expose the broker endpoint until service supervision is stronger?
- How should restart policies interact with shutdown mode?
- Should shutdown require a fresh authentication event, or is the current operator session sufficient?
Verification
Required QEMU proofs:
- Anonymous shell:
shutdownis denied or unavailable. - Operator shell: login returns shutdown authority.
- Shutdown command causes init to terminate managed services and exit.
- QEMU exits through the existing last-process halt path.
- Existing adventure/chat demo still works before shutdown.
Host tests should cover:
- Broker policy decisions for anonymous vs operator profiles.
- Init shutdown ordering over a synthetic lifecycle table.
- Manifest validation rejecting direct shell access to privileged lifecycle primitives before login.
Proposal: Go Language Support via Custom GOOS
Running Go programs natively on capOS by implementing a GOOS=capos target
in the Go runtime.
Current Manual Pages
- Go VirtualMemory Contract freezes the current allocator-facing memory contract for this proposal.
- Programming Languages summarizes the current
language support matrix and the distinction between native runtime adapters,
POSIX compatibility adapters, and WASI host adapters. The Go row points back
here for the native
GOOS=capostrack and to the WASI host adapter’s Phase W.8 TinyGo / upstreamGOOS=wasip1CUE evaluator path. - Userspace Binaries holds the overall
language-runtime track. Its “Future: Go (
GOOS=capos)” section delegates the native plan to this proposal, and its “Phase W.8 (TinyGo / Go-on-WASI CUE evaluator, blocked)” entry tracks the WASI-side interim path. - WASI Host Adapter documents the in-tree
wasmi-backed host. Phase W.8 there is the TinyGo / upstream Go
(
GOOS=wasip1) CUE evaluator slice that runs inside the host adapter and bridges to the native Go track described in this proposal. The detailed plan lives in WASI Host Adapter Task 9. - In-Process Threading freezes the thread/process ownership contract that Phase 2 of this proposal builds on.
- Park Authority freezes the compact
CAP_OP_PARK/CAP_OP_UNPARKABI that the Go runtime’s futex glue must target instead of a Linux-style futex syscall namespace. - Memory Management documents the implemented
kernel memory and baseline
VirtualMemorybehavior. - Userspace Runtime documents the
capos-rtclient surface that a future Go runtime port will call. - LLVM Target is the main research grounding for Go runtime and target-triple work.
Motivation
Go is the implementation language of CUE, the configuration language planned for system manifests. Beyond CUE, Go has a large ecosystem of systems software (container runtimes, network tools, observability agents) that would be valuable to run on capOS without rewriting.
The userspace-binaries proposal keeps Go as a dedicated future runtime track.
This proposal explores the native path: a custom GOOS=capos that lets Go
programs run directly on capOS hardware, without a WASM interpreter in between.
Go through WASI remains a narrower option for CPU-bound tools such as CUE
evaluation before the native runtime port exists.
Why Go is Hard
Go’s runtime is a userspace operating system. It manages its own:
- Goroutine scheduler — M:N threading (M OS threads, N goroutines), work-stealing, preemption via signals or cooperative yield points
- Garbage collector — concurrent, tri-color mark-sweep, requires write barriers, stop-the-world pauses, and memory management syscalls
- Stack management — segmented/copying stacks with guard pages, grow/shrink on demand
- Network poller — epoll/kqueue-based async I/O for
net.Conn - Memory allocator — mmap-based, spans, mcache/mcentral/mheap hierarchy
- Signal handling — goroutine preemption, crash reporting, profiling
Each of these assumes a specific OS interface. The Go runtime calls ~40 distinct syscalls on Linux. capOS currently has 2.
Syscall Surface Required
The Go runtime’s Linux syscall usage, grouped by subsystem:
Memory Management (critical, blocks everything)
| Go runtime needs | Linux syscall | capOS equivalent |
|---|---|---|
| Heap allocation | mmap(MAP_ANON) | VirtualMemory.reserve + commit, or compatibility map |
| Heap deallocation | munmap | VirtualMemory.unmap releases reservations and committed frames |
| Stack guard pages | mmap(PROT_NONE) + mprotect | Reserve uncommitted guard pages; use committed VM_PROT_NONE only when contents must be retained |
| GC needs contiguous arenas | mmap with hints | Contiguous virtual reservations; physical frames are committed sparsely |
| Commit/decommit pages | madvise(DONTNEED) | VirtualMemory.commit / decommit within reserved ranges |
capOS needs: A sys_mmap-like capability or syscall that can:
- Map anonymous pages at arbitrary user addresses
- Set per-page permissions (R, W, X, none)
- Allocate contiguous virtual ranges without requiring contiguous physical frames
- Decommit without unmapping (for GC arena management)
This could be a VirtualMemory capability:
interface VirtualMemory {
# Map anonymous pages at hint address (0 = kernel chooses)
map @0 (hint :UInt64, size :UInt64, prot :UInt32) -> (addr :UInt64);
# Unmap pages
unmap @1 (addr :UInt64, size :UInt64) -> ();
# Change permissions on mapped range
protect @2 (addr :UInt64, size :UInt64, prot :UInt32) -> ();
# Reserve virtual address space without physical frames
reserve @3 (hint :UInt64, size :UInt64) -> (addr :UInt64);
# Commit physical frames inside a reserved range
commit @4 (addr :UInt64, size :UInt64, prot :UInt32) -> ();
# Decommit physical frames while keeping the range reserved
decommit @5 (addr :UInt64, size :UInt64) -> ();
}
The exact Go allocator contract is frozen in
Go VirtualMemory Contract: map
stays a compatibility operation, while reserve, commit, and decommit
separate virtual address reservation from physical frame commitment and make
guard-page behavior explicit.
Threading (critical for goroutines)
| Go runtime needs | Linux syscall | capOS equivalent |
|---|---|---|
| Create OS thread | clone(CLONE_THREAD) | Thread capability / in-process thread lifecycle |
| Thread-local storage | arch_prctl(SET_FS) | ThreadControl.setFsBase; per-ThreadRef TLS ownership for Go integration |
| Block thread | futex(WAIT) | ParkSpace compact CAP_OP_PARK |
| Wake thread | futex(WAKE) | ParkSpace compact CAP_OP_UNPARK |
| Thread exit | exit(thread) | ThreadControl.exitThread capability operation |
capOS baseline: process-local thread lifecycle and private ParkSpace
wait/wake exist as the kernel substrate. The remaining Go work is runtime
integration: capos-rt clients, newosproc glue, per-ThreadRef TLS ownership,
and GC/runtime coordination across those kernel threads.
ThreadControl.setFsBase is a current-ThreadRef operation, not a
process-global mutation. Go integration must allocate a distinct TLS block and
FS base for each runtime M/OS thread, and context switch must preserve FS base
as per-thread state before true multi-threaded Go is treated as supported.
Design alternatives considered:
Option A: Kernel threads. The kernel manages threads (multiple execution contexts sharing one address space). Each thread has its own stack, register state, and FS base, but shares page tables and cap table with the process. This is what Linux does and what Go expects.
Option B: User-level threading. The process manages its own threads (like green threads). The kernel only sees one execution context per process. Go’s scheduler already does M:N threading, so it could work with a single OS thread per process — but the GC’s stop-the-world relies on being able to stop other OS threads, and the network poller blocks an OS thread.
Option A is the selected substrate for Go compatibility. Option B is more capability-aligned (threads are a process-internal concern), but it requires larger Go runtime modifications and does not fit the current kernel-thread checkpoint.
Synchronization
| Go runtime needs | Linux syscall | capOS equivalent |
|---|---|---|
| Park wait | futex(FUTEX_WAIT) | ParkSpace compact CAP_OP_PARK |
| Park wake | futex(FUTEX_WAKE) | ParkSpace compact CAP_OP_UNPARK |
| Atomic compare-and-swap | CPU instructions | Already available (no kernel support needed) |
Linux futexes are a kernel primitive (block/wake on a userspace address). capOS
exposes park authority through a ParkSpace capability from the start. Go
futex glue should target the compact capability-authorized park operations
defined in the ParkSpace architecture rather than introducing a Linux-style
futex syscall namespace or routing failed wait / empty wake through generic
Cap’n Proto method dispatch. Blocked/resume performance still needs measurement
under Go’s runtime workload, but that does not change the authority or key
model.
Time
| Go runtime needs | Linux syscall | capOS equivalent |
|---|---|---|
| Monotonic clock | clock_gettime(MONOTONIC) | Timer cap .now() |
| Wall clock | clock_gettime(REALTIME) | Timer cap or RTC driver |
| Sleep | nanosleep or futex with timeout | Timer cap .sleep() or park timeout |
| Timer events | timer_create / timerfd | Timer cap with callback or poll |
Timer cap now and sleep are implemented for monotonic time and bounded
sleep. Wall-clock time and timerfd-style event sources remain future work.
ThreadControl getFsBase and setFsBase are implemented for current-process
runtime FS-base ownership; making FS base per-thread remains part of kernel
threading.
I/O
| Go runtime needs | Linux syscall | capOS equivalent |
|---|---|---|
| Network I/O | epoll_create, epoll_ctl, epoll_wait | Async cap invocation or poll cap |
| File I/O | read, write, open, close | Directory/File or Namespace/Store caps through Go’s OS adapter |
| Stdout/stderr | write(1, ...), write(2, ...) | Console cap |
| Pipe (runtime internal) | pipe2 | IPC caps or in-process channel |
Go’s network poller (netpoll) is pluggable per-OS — each GOOS provides
its own implementation. For capOS, it would use async capability invocations
or a polling interface over socket caps.
Signals (for preemption)
| Go runtime needs | Linux syscall | capOS equivalent |
|---|---|---|
| Goroutine preemption | tgkill + SIGURG | Thread preemption mechanism |
| Crash handling | sigaction(SIGSEGV) | Page fault notification |
| Profiling | sigaction(SIGPROF) + setitimer | Profiling cap (optional) |
Go 1.14+ uses asynchronous preemption: the runtime sends SIGURG to a
thread to interrupt a long-running goroutine. On capOS, alternatives:
- Cooperative preemption only. Go inserts yield points at function prologues and loop back-edges. This works but means tight loops without function calls won’t yield. Acceptable for initial support.
- Timer interrupt notification. The kernel notifies the process (via a cap invocation or a signal-like mechanism) when a time quantum expires. The notification handler in the Go runtime triggers goroutine preemption.
Implementation Strategy
Phase 1: Minimal GOOS (single-threaded, cooperative)
Fork the Go toolchain, add GOOS=capos GOARCH=amd64. Implement the minimum
runtime changes:
What to implement:
osinit()— read Timer cap from CapSet for monotonic clocksysAlloc/sysFree/sysReserve/sysMap— translate to VirtualMemory capsettls()— translate Go’s FS-base install to ThreadControlnewosproc()— stub (single OS thread, M:N scheduler still works with M=1)futexsleep/futexwake— spin-based fallback (no real futex yet)nanotime/walltime— Timer capwrite()(for runtime debug output) — Console capexit— sys_exit for current-thread termination; the process exits when its last live thread exitsexitThread— terminalThreadControl.exitThreadcapability operationnetpoll— stub returning “nothing ready” (no async I/O)
What to stub/disable:
- Signals (no SIGURG preemption, cooperative only)
- Multi-threaded GC (single-thread STW is fine initially)
- CGo (no C interop)
- Profiling
- Core dumps
Deliverable: GOOS=capos go build ./cmd/hello produces an ELF that
runs on capOS, prints “Hello, World!”, and exits.
Current capOS status: the single-thread-runtime QEMU demo proves the
capability-side checkpoint for this phase without a Go fork yet. It maps,
protects, and frees heap pages through VirtualMemoryClient, uses TimerClient
for monotonic now and sleep, keeps newosproc unsupported, and exercises the
temporary park fallback path locally.
Estimated effort: ~2000-3000 lines of Go runtime code (mostly in
runtime/os_capos.go, runtime/sys_capos_amd64.s,
runtime/mem_capos.go). Reference: runtime/os_js.go (WASM target) is
~400 lines; runtime/os_linux.go is ~700 lines. capOS sits between these.
Phase 2: In-Process Threading + Park
Build on implemented kernel support for:
- multiple threads per process on the single-CPU scheduler first;
- private
ParkSpacecompact wait/wake; - current-thread FS-base updates through
ThreadControl.
Update Go runtime:
newosproc()creates a real kernel threadfutexsleep/futexwakeuse theParkSpacecompact park ABI- thread creation allocates and owns distinct TLS state per
ThreadRef - GC can coordinate across multiple kernel threads in one process
- Enable real blocking instead of the temporary single-thread park fallback
Deliverable: Go programs can create multiple in-process kernel threads and block/wake through futexes on one CPU. Multiple CPU-core execution remains a later SMP milestone after the threading/park contract is settled.
The 7.1.0 thread/process ownership contract is now frozen in
In-Process Threading. It keeps address space,
cap table, CapSet, and the capability ring process-owned; makes saved context,
kernel stack, block state, and FS base thread-owned; charges thread records and
kernel stacks to process-owned ledgers; and preserves a single process ring
waiter until a later ring-sharding design exists.
The 7.1.1 park authority contract is frozen in
Park Authority. It defines process-local
ParkSpace authority for private park keys, a future MemoryObject-derived
SharedParkSpace model for shared park-words, and compact CAP_OP_PARK /
CAP_OP_UNPARK operations as the starting ABI for the Go runtime
synchronization path.
Phase 3: Network Poller
Implement runtime/netpoll_capos.go:
- Register socket caps with the poller
- Use an async notification mechanism (capability-based
poll()or notification cap) net.Dial(),net.Listen(),http.Get()work
This depends on the networking stack being available as capabilities.
Deliverable: Go HTTP client/server runs on capOS.
Phase 4: CUE on capOS
With Go working, CUE runs natively. This enables:
- Runtime manifest evaluation (not just build-time)
- Dynamic service reconfiguration via CUE expressions
- CUE-based policy enforcement in the capability layer
Kernel Prerequisites
| Prerequisite | Roadmap Stage | Why |
|---|---|---|
| Capability syscalls | Stage 4 (sync path done) | Go runtime invokes caps (VirtualMemory, Timer, Console) |
| Scheduling | Stage 5 (core done) | Go needs timer interrupts for goroutine preemption fallback |
| IPC + cap transfer | Stage 6 | Go programs are service processes that export/import caps |
| VirtualMemory capability | Stage 5 | mmap equivalent for Go’s memory allocator and GC |
| ThreadControl capability | Extends Stage 5 | settls equivalent before full in-process threads |
| Thread lifecycle | Extends Stage 5 | Implemented substrate for multiple execution contexts per process; Go integration remains |
ParkSpace capability | Extends Stage 5 | Go runtime synchronization through compact park/unpark |
VirtualMemory Capability
This is the biggest new kernel primitive. Go’s allocator requires:
- Reserve large virtual ranges without committing physical memory (Go reserves 256 TB of virtual space on 64-bit systems)
- Commit pages within reserved ranges (back with physical frames)
- Decommit pages (release frames, keep virtual range reserved)
- Set permissions (RW for data, none for committed inaccessible pages; pure guard pages should stay reserved but uncommitted)
The existing page table code (kernel/src/mem/paging.rs) supports mapping
and unmapping individual pages. It needs to be extended with:
- Virtual range reservation (mark ranges as reserved in some bitmap/tree)
- Lazy commit (map as
PROT_NONEinitially, page fault handler commits on demand — or explicit commit via cap call) - Permission changes on existing mappings
The concrete ABI for the first explicit-commit path is in
Go VirtualMemory Contract. It
chooses explicit commit/decommit before demand paging, permits
VM_PROT_NONE through reservation metadata plus non-present user PTEs, and
requires separate virtual-reservation and physical-commit quota ledgers.
Committed VM_PROT_NONE intentionally retains allocated frames and page
contents for later protection restore. Pure guard pages should use reserved
uncommitted pages so they consume virtual quota but no physical commit budget.
Thread Support
Extending the process model (kernel/src/process.rs) now follows the contract
in In-Process Threading. See the
SMP proposal for the PerCpu struct layout (per-CPU
kernel stack, saved registers, FS base); Thread extends this for
multi-thread-per-process. See also the In-Process Threading section in
Roadmap for the roadmap-level view.
#![allow(unused)]
fn main() {
struct Process {
pid: u64,
address_space: AddressSpace, // shared by all threads
caps: CapTable, // shared by all threads
threads: Vec<Thread>,
}
struct Thread {
tid: u64,
state: ThreadState,
kernel_stack: VirtAddr,
saved_regs: RegisterState, // rsp, rip, etc.
fs_base: u64, // for thread-local storage
}
}
The scheduler (Stage 5) schedules threads, not processes. Each thread gets its own kernel stack and register save area. Context switch saves/restores thread state. Page table switch only happens when switching between threads of different processes.
Alternative: Go via WASI
For comparison, the WASI path from the userspace-binaries proposal:
| Native GOOS | WASI | |
|---|---|---|
| Performance | Native speed | ~2-5x overhead (wasm interpreter/JIT) |
| Go compatibility | Full (after Phase 3) | Limited (WASI Go support is experimental) |
| Goroutines | Real M:N scheduling | Single-threaded (WASI has no threads yet) |
| Net I/O | Native async via poller | Blocking only (WASI sockets are sync) |
| Kernel work | VirtualMemory, threads, park | None (wasm runtime handles it) |
| Go runtime fork | Yes (maintain a fork) | No (upstream GOOS=wasip1) |
| GC | Full concurrent GC | Conservative GC (wasm has no stack scanning) |
| Maintenance burden | High (track Go releases) | Low (upstream supported) |
WASI is easier but limited. Go on WASI (GOOS=wasip1) is officially
supported but experimental — no goroutine parallelism, no async I/O, limited
stdlib. For running CUE (which is CPU-bound evaluation, no I/O, single
goroutine), WASI might be sufficient.
Native GOOS is harder but complete. Full Go with goroutines, concurrent
GC, network I/O, and the entire stdlib. Required for Go network services
or anything using net/http.
Recommendation: Start with WASI for CUE evaluation. The in-tree path is
WASI Host Adapter Phase W.8 (and Task 9 of
WASI Host Adapter): a CUE
evaluator binary built against TinyGo or upstream Go’s GOOS=wasip1, loaded
through the host adapter against a future ScriptPackage cap. Phase W.8 is
blocked on the same std-userspace decision as W.7 today, but it is the
smaller-step bridge to running Go logic on capOS before the native runtime
port exists. If Go network services or full goroutine/GC semantics become a
goal, invest in the native GOOS=capos track described here; the
Userspace Binaries “Phase W.8” entry keeps
both paths sequenced from the language-track view.
Relationship to Other Proposals
- Userspace Binaries — owns the overall
language-runtime track. This proposal adds concrete Go implementation
details to the future “Future: Go (
GOOS=capos)” branch there. The POSIX compatibility adapter is not sufficient for native Go because Go does not use libc on Linux; it makes raw syscalls. The GOOS approach bypasses POSIX entirely. The same userspace-binaries doc tracks Phase W.8 as the Go-on-WASI interim path. - Programming Languages — the matrix entry
for Go points here for the native track and to the WASI host adapter’s
Phase W.8 for the TinyGo /
GOOS=wasip1interim. Any change to the sequencing between native Go and Go-on-WASI must keep that row in sync. - WASI Host Adapter — Phase W.8 of the
WASI host adapter ships a TinyGo or upstream Go
GOOS=wasip1CUE evaluator binary that runs inside the in-tree wasmi-backed host. That slice is blocked on the same std-userspace decision as W.7 today and bridges to the native Go track described here once it lands. The detailed plan lives in WASI Host Adapter Task 9. - Service Architecture — Go services participate in the capability graph like any other process. The Go net poller (Phase 3) uses TcpSocket/UdpSocket caps from the network stack.
- Storage and Naming — Go’s
os.Open()/os.Read()map to Namespace + Store caps via the GOOS file I/O implementation. Go doesn’t use POSIX for this — it has its ownruntime/os_capos.gowith direct cap invocations. - SMP — later multi-core scaling for Go after Phase 2. The first Phase 2 target is single-CPU in-process threads plus parking; per-CPU scheduling belongs to the later SMP milestone.
Open Questions
-
Fork maintenance. A
GOOS=caposfork must track upstream Go releases. How much drift is acceptable? Could the capOS-specific code eventually be upstreamed (like Fuchsia’s was)? -
CGo support. Go’s FFI to C (
cgo) requires a C toolchain and dynamic linking. Should capOS support cgo, or is pure Go sufficient? CUE doesn’t use cgo, but some Go libraries do. -
GOROOT on capOS. Go programs expect
$GOROOT/libat runtime for some stdlib features. Where does this live on capOS? In the Store? Baked into the binary via static compilation? -
Go module proxy.
go getneeds HTTP access. On capOS, this would use aFetchcap. But cross-compilation on the host is more practical than building Go on capOS itself. -
Debugging. Go’s
runtime/debugandpprofexpect signals and/procaccess. What debugging capabilities should capOS expose? -
GC tuning. Go’s GC is tuned for Linux’s mmap semantics (decommit is cheap, virtual space is nearly free). capOS’s VirtualMemory cap needs to match these assumptions or the GC will need retuning. The first matching point is the reserve/commit/decommit contract in Go VirtualMemory Contract.
Estimated Scope
| Phase | New kernel code | Go runtime changes | Dependencies |
|---|---|---|---|
| Phase 1: Minimal GOOS | ~200 (VirtualMemory cap) | ~2000-3000 | Stages 4-5 |
| Phase 2: Threading | ~500 (threads, park) | ~500 | In-process threading/park (7.1/7.2) |
| Phase 3: Net poller | ~100 (async notification) | ~300 | Networking, Stage 6 |
| Phase 4: CUE on capOS | 0 | 0 | Phase 1 (or WASI) |
| Total | ~800 | ~2800-3800 |
Plus ongoing maintenance to track Go upstream releases.
Proposal: Lua Scripting
How capOS should add Lua as a small capability-aware scripting environment without turning scripts into ambiently privileged shell fragments.
Problem
capOS needs a lightweight scripting path for operator workflows, demos, service glue, and eventually interactive shell automation. The native shell already exposes typed capabilities and explicit child grants, but a shell REPL is not a full programming language. Lua is attractive because it is small, embeddable, and designed to let a host provide the domain API.
The risk is predictable: “system scripting” often becomes an escape hatch
around the operating system model. A script runner that receives broad
ProcessSpawner, BootPackage, filesystem, network, or terminal authority
and then exposes io, os, package.loadlib, or raw handle integers would
recreate the ambient authority capOS is trying to avoid.
The target is not “make Lua root.” The target is:
- Lua as ordinary userspace code.
- Capabilities as the only authority.
- Host-provided Lua libraries that map to typed capOS interfaces.
- Exact grants for script processes, with no default filesystem, network, process, terminal, or debug authority.
Scope
In scope:
- A
capos-luauserspace runner for trusted operator and service scripts. - A small Lua host API over
capos-rttyped clients. - A policy for standard Lua libraries on capOS.
- Script packaging and shell launch shape.
- Validation through QEMU scripts that prove granted and ungranted paths.
Out of scope for the first implementation:
- LuaJIT.
- Dynamic native Lua C modules.
- A POSIX-compatible Lua environment.
- Treating in-process Lua sandboxing as the isolation boundary for hostile scripts.
- Kernel awareness of Lua.
Current Manual Pages
- Programming Languages is the language-status
index. The Lua row tracks the in-tree
demos/lua-smoke/runner against the Rust, Python, Go, C/C++, WASI, and POSIX adapter rows and is the page to update whenever the runtime label or phase status changes. - Userspace Runtime documents the
implemented
capos-rtsurface (entry, allocator, syscall, CapSet lookup, typedConsoleClient/TimerClient/VirtualMemoryClient) that the Lua runner consumes today throughhost::Host::register_console,register_timer, andregister_memory. Any new Lua binding starts by identifying the matching typed client on this page, not by reaching into raw ring SQEs or method IDs. - Shell proposal defines the spawn-plan shape that the
shell uses to launch ordinary userspace processes with exact grants. The Lua
runner is a launched workload in that model, not a shell-embedded
interpreter; future
lua scripts/admin/inspect.lua with { ... }sugar must desugar to the same explicit spawn plan rather than inheriting the shell’s current CapSet. - Userspace Binaries proposal owns the
userspace runtime, language-support, and compatibility-adapter plan that the
Lua runner sits inside. Its “Future: Lua” section names this proposal as the
authoritative design for
capos-lua, and the Lua runner must keep matching its rules for unforgeable capability userdata, exact grants, curated standard libraries, no raw CapIds, and the C/libcapos dependency for the upstream PUC Lua port.
Research Grounding
Relevant research:
- Capability research survey: keep typed Cap’n Proto interfaces as the permission boundary and avoid parallel rights flags.
- Genode: route service access structurally; sessions are typed and resource-accounted.
- Plan 9 and Inferno: per-process namespaces are useful precedent, but capOS should not turn scripts into path-global clients.
- EROS, CapROS, and Coyotos: confinement depends on constructing the subject with only the capabilities it may use.
- seL4: keep the privileged kernel surface small and let userspace policy build higher-level systems.
External Lua references:
- The official Lua 5.5 manual describes Lua as an embeddable C library with a host program that registers C functions callable from Lua.
- The official Lua version history says
Lua 5.5.0 was released on 2025-12-22, while Lua 5.4.8 is the current 5.4
bug-fix release from 2025-06-04. It also says different
x.yversions have different APIs and virtual machines, and precompiled chunks are not portable between versions. - The official Lua 5.5 readme
says Lua is distributed as pure ISO C and normally builds into
lua,luac, andliblua.a. That makes Lua a plausible native port once capOS has the C userspace andlibcapossubstrate; it does not make Lua runnable on today’s no-std Rust-only userspace by itself.
Rust implementation candidates checked:
- mlua is a mature Rust binding layer for
PUC Lua, LuaJIT, and Luau. It is not a pure-Rust VM. Its
vendoredpath still builds C/C++ Lua-family sources throughmlua-sys,cc, andlua-src/luajit-src, and the public crate usesstd,libc,parking_lot, panic catching, and host linker/module assumptions. It is a useful API reference, but it does not avoid the native C/libcaposport. - piccolo is the only inspected pure-Rust
implementation that looks like a credible capOS bootstrap candidate. It has
a stackless VM, fuel-based stepping, memory tracking through
gc-arena, safe userdata downcasting, and most core language behavior. The current crate is stillstd-based, depends onanyhow,thiserror,rand,ahash, and a git-pinnedgc-arena, and its built-in I/O path writes to host stdout. Porting it to capOS would require ano_std + allocfork plus host-library replacement, but that is likely less work than bringing up C Lua beforelibcapos. - silt-lua, hematita, and luar were also inspected. They are pure Rust in varying degrees, but their own READMEs/code show early, incomplete, or CLI-oriented implementations. They are not good foundations for capOS runtime work today.
Design Principles
-
Lua is not a kernel feature. The kernel sees a normal process with a CapSet and a capability ring.
-
The runner’s CapSet is the authority. Script text, module names, global variables, and Lua tables are data. They cannot create authority.
-
In-process sandboxing is defense in depth, not confinement. A trusted service may embed Lua for local configuration or small trusted extensions. Untrusted user scripts must run in a separate process with a narrow CapSet, quotas, and no access to the host service’s private caps.
-
The standard libraries are curated. Base, coroutine, table, string, math, and utf8 are reasonable starting points.
io,os,package,debug, dynamic loading, and process execution are absent by default or replaced by capOS-specific libraries backed by explicit caps. -
No raw CapIds in Lua. A Lua capability value is host-owned userdata with a hidden metatable. Scripts can call methods exposed by the wrapper, but they cannot forge a handle by guessing an integer.
-
Lua version is part of the runtime contract. Precompiled chunks, language behavior, and C API details are series-specific. capOS should pin the runner to a declared Lua series and expose that in manifests and smoke output.
-
C module loading waits. Dynamic native modules need loader, linker, symbol, and authority policy. The first runner should statically link the selected Lua implementation and capOS host libraries.
Architecture
flowchart TD
Shell[capos-shell] --> Launcher[RestrictedLauncher]
Launcher --> Runner[capos-lua process]
Runner --> Lua[PUC Lua VM]
Runner --> Rt[capos-rt / libcapos host API]
Rt --> Ring[capability ring]
Ring --> Kernel[kernel CapObject dispatch]
Ring --> Services[userspace services]
ScriptPkg[ScriptPackage or Namespace cap] --> Runner
Terminal[TerminalSession cap] --> Runner
OtherCaps[Exact service caps] --> Runner
capos-lua is just another binary launched by the shell or init-owned
service graph, matching the “language runtime as ordinary process” rule from
Userspace Binaries. The parent chooses the
script source and the exact caps. The runner creates one Lua state, installs
selected libraries, wraps granted caps as userdata, loads the script with a
controlled environment, executes it in protected mode, flushes queued
releases, and exits with a normal process status.
The initial implementation should be a standalone runner, not Lua embedded in
capos-shell. Keeping the runner as a child process prevents script bugs,
Lua VM bugs, and accidental infinite loops from corrupting the interactive
shell state. It also gives QEMU smokes a clear process boundary to inspect.
Version Choice
Use PUC Lua, not LuaJIT, for the first runner.
As of 2026-05-13, Lua 5.5.0 (released 2025-12-22) is still the current upstream series and Lua 5.4.8 (released 2025-06-04) is still the latest 5.4 bug-fix release. Lua 5.5 has features that fit capOS scripting: explicit global declarations, compact arrays, and static fixed binaries. It is the right default target for new capOS-native scripts.
Keep a narrow compatibility option open for Lua 5.4.8 if imported scripts or libraries require it. Do not mix bytecode or native modules between Lua series. A script package should declare:
language = "lua"
series = "5.5"
entry = "main.lua"
Source scripts are preferable to precompiled chunks for reviewability. If precompiled chunks are allowed later, they must be tied to the exact runtime series and treated as trusted build inputs.
There is one practical sequencing exception: a piccolo-based
capos-lua-smoke may be the fastest way to prove the capOS host API before C
userspace support exists. That should be treated as an implementation
bootstrap, not as a promise of exact PUC Lua compatibility. If capOS takes that
route, the smoke should declare the runtime as piccolo rather than lua-5.5.
Host API
The first host API should be explicit and boring:
local capos = require("capos")
local terminal = capos.require_cap("terminal", "TerminalSession")
terminal:write_line("hello from Lua")
local now = capos.require_cap("timer", "Timer"):now()
terminal:write_line("now_ns=" .. tostring(now))
capos.require_cap(name, interface) looks up a bootstrap cap by manifest name
and checks the expected interface metadata before returning userdata. It fails
closed if the cap is absent or has the wrong interface.
Generated or handwritten bindings should expose method names, not method
numbers. The binding owns Cap’n Proto serialization through capos-rt or
libcapos; scripts should not construct raw SQEs, raw method IDs, transfer
descriptors, or cap_enter calls.
Transferred result caps become owned Lua userdata. Release is deterministic when possible:
do
local h <close> = launcher:spawn({
name = "child",
binary = "timer-smoke",
grants = { terminal = terminal },
})
local code = h:wait()
end
Finalizers may queue cleanup, but they are not the primary lifetime contract. The runner must flush owned-handle releases at script return and process exit.
Standard Library Policy
Initial allowed libraries:
| Library | Policy |
|---|---|
base | Load selected safe functions. load is allowed only with text mode and a supplied environment. |
coroutine | Allowed for cooperative script structure. It does not map to OS threads. |
table, string, math, utf8 | Allowed. |
debug | Denied by default. It pierces ordinary Lua abstraction and should require an explicit developer-profile cap. |
io | Denied by default. Replace with capos wrappers over TerminalSession, future File, ByteStream, or Namespace caps. |
os | Denied by default. Replace time, exit, and process operations with cap-backed methods. |
package | Restricted. require searches a script package or namespace cap, not host paths or environment variables. |
| dynamic C modules | Denied until native module loading has a reviewed authority model. |
Lua _ENV is useful for presenting a small global namespace, but it is not a
security boundary by itself. The security boundary is the process plus its
CapSet.
Script Sources
The current ProcessSpawner.spawn shape names a binary and grants caps; it
does not yet pass arbitrary argument vectors or script blobs. That creates an
implementation dependency for useful Lua scripting.
Near-term options, in order:
-
Smoke-only compiled script:
capos-lua-smokestatically embeds one script string in.rodataand proves the host API. This is not the general product, but it verifies the Lua VM, allocator, CapSet lookup, and terminal output without new startup ABI. -
Runner config cap: init or the shell grants a read-only
ScriptPackageorConfigBlobcap tocapos-lua. The runner asks that cap formain.luaand module bytes. This keeps script data out of the kernel and fits the existing capability model. -
Storage-backed scripts: after Store/Namespace exists, scripts live under a granted namespace.
requiresearches only that namespace and only through a read-only script-package view unless the script also receives a writable namespace cap.
Do not add a Lua-specific boot manifest field or kernel cap. Script packaging belongs to init, shell, storage, or a userspace package service.
Shell Integration
The launch shape comes from the Shell proposal; Lua adds no new spawn primitive. The shell should treat Lua as a launched workload:
run "capos-lua" with {
terminal: @terminal
timer: @timer
scripts: @home.sub("scripts/admin")
}
Later, the shell can add sugar such as:
lua scripts/admin/inspect.lua with { terminal: @terminal, timer: @timer }
That sugar must compile to the same explicit spawn plan. There is no implicit inheritance of the shell’s full current CapSet.
Agent mode can also use Lua, but Lua should be a tool target rather than the model itself. The agent runner may advertise “run this approved Lua script” as a consent-gated tool. The model still does not receive session caps.
Adventure Game Use
The adventure game is a good later demonstration target because it needs both strict authority and authorable behavior. The kernel and service capabilities still enforce authority; Lua should only express deterministic scenario logic over the caps granted to the script runner.
Suitable Lua-owned behavior:
- mission beat selection,
- deterministic NPC dialogue state machines,
- quest-board text,
- hint selection,
- debrief variants,
- scripted reactions that call typed game APIs through granted object caps.
Unsuitable Lua-owned behavior:
- deciding whether a player has authority,
- mutating relic custody without a typed service call,
- applying combat damage outside the game service,
- minting or transferring caps,
- holding broad spawn, debug, filesystem, or network authority by default.
The useful proof is language independence: a Rust adventure service and a Lua scenario script should both demonstrate proper capability use, including bounded failures when a script lacks a required cap.
Blocking, Async, and Coroutines
The first runner can use synchronous typed client calls over the existing single-owner ring client. A blocking Lua method blocks the runner process, which is acceptable for the first operator-script use case.
Coroutines provide script-local cooperative structure, not OS scheduling. A future runtime reactor can resume Lua coroutines when capability completions arrive, but that should wait until the capOS runtime has a general demux path for threaded and async clients. Do not design Lua-specific CQ demultiplexing.
Security Model
Threat boundaries:
- Script source is untrusted input until parsed and loaded in protected mode.
- Script packages are trusted build or storage inputs only when their source, digest, author, and runtime series are review-visible.
- The Lua VM is not trusted to confine hostile code inside a privileged host process.
- Capability wrappers must validate method parameters, buffer sizes, transfer counts, and result-cap interface IDs before translating Lua values into ring calls.
- Terminal and audit output must not print secrets. Lua error rendering should use bounded messages and avoid dumping arbitrary cap userdata internals.
Default deny list for untrusted scripts:
- no
debug, - no dynamic module loading,
- no raw
os/io, - no broad
ProcessSpawner, - no broad network manager,
- no boot package,
- no mutable namespace unless that is the explicit script purpose,
- no host environment variables.
Quotas matter. The first useful quota is process memory. CPU budgets, timer budgets, and capability-call quotas should follow the normal capOS scheduling and resource-accounting path rather than special Lua hooks.
Implementation Phases
Phase 0: Contract and Host Surface (in tree)
- Proposal landed and
docs/programming-languages.mdrecords the Phase 0 status. - Initial runtime label is
capos-lua-subset, notlua-5.x. Bytecode portability is explicitly out of scope. - Phase 0 ships a tiny hand-written tree-walking interpreter under
demos/lua-smoke/that exists to validate the long-term capability-aware host API design without committing capOS to a particular Lua dialect. Piccolo was investigated and not adopted: upstream does not compile no_std and the swap surface (anyhow, thiserror, std::io, std::sync, ahash::RandomState entropy) is large enough that the maintenance cost of a fork was judged to outweigh the benefit at this stage. The hand-written interpreter is replaced or kept as a research-grade sandbox once the C/libcapos PUC port lands. - Host surface in tree:
- typed userdata over
capos-rt::ConsoleClientandcapos-rt::TimerClient, obj:method(args)dispatch throughhost::Host::call_method,- errors flow back as Lua runtime errors via
EvalError::Lua, never Rust panics on script-controlled inputs, - bounded execution via a per-run step counter (
MAX_STEPS).
- typed userdata over
- Future Phase 0 items (still open):
- generalised
capos.require_caplookup, capos.interfacesreflection for typed errors,- owned-cap release semantics for granted result handles.
- generalised
Phase 1: Native Runner Smoke (in tree)
demos/lua-smoke/builds ascapos-demo-lua-smoke, gets embedded insystem-lua-smoke.cue, and runs undermake run-lua-smokewith QEMU’sisa-debug-exitto gate cleanly on script success or failure.- The smoke loads no Lua standard library at all (no
io,os,package,debug,string,table,math); the only callable surface is the typed cap bindings registered inhost::Host::register_*. - Iteration L.1 (
2026-05-04 18:42 EEST, merge050ac735) shipped the initialconsole:write_lineandtimer:nowbindings. - Iteration L.2 (
2026-05-05 19:30 UTC) added the third host binding,memory, wrappingcapos-rt::VirtualMemoryClient. The Lua surface ismemory:alloc(size) -> userdata,memory:write(buf, off, byte),memory:read(buf, off) -> int,memory:size(buf) -> int. The host binding owns the kernel-mapped address and the page-aligned size; the Lua side only ever sees an opaque userdata id and the byte values that came back through the typed binding. Eachread/writeis bounds- checked host-side before the single-bytevolatile_*access. Per-call (MAX_MEMORY_ALLOC_BYTES = 64 KiB), aggregate (MAX_MEMORY_TOTAL_BYTES = 256 KiB), and buffer-count (MAX_MEMORY_BUFFERS = 64) ceilings rejected as typed Lua errors keep hostile scripts from exhausting the per-process virtual-memory quota before the kernel does. The smoke proof lines ([lua-smoke] memory:alloc size=4096,[lua-smoke] memory roundtrip 65,66,67,[lua-smoke] memory sum=198) are gated bytools/qemu-lua-smoke.sh. - Iteration L.3 (
2026-05-13 09:28 EEST, commit430ccd0e) added deterministicmemory:release(buf)for the same smoke-only host binding. The host callsVirtualMemory.unmapwith the exact mapped(addr, size)pair stored for the opaque buffer userdata, marks that buffer dead after the unmap succeeds, credits the live byte budget, and rejects laterread,write,size, orreleasecalls on that stale userdata as Lua runtime errors. The proof line[lua-smoke] memory:release size=4096is gated bytools/qemu-lua-smoke.sh. This remains language-support behavior only: Lua receives no broader memory authority, raw address, raw cap id, or new kernel behavior. - Expected QEMU output is asserted by
tools/qemu-lua-smoke.sh: smoke produces[lua-smoke] hello from lua-smoke v0, anelapsed_ns=measurement throughtimer:now, the L.2 memory round-trip lines, the L.3 release line, and a[lua-smoke] script okproof line; init exits viaexitWhenServiceExits. - Future Phase 1 items (still open):
- typed wrong-interface and missing-cap failure modes returned as Lua runtime errors,
- explicit denied-API proof (currently denied by construction because no Lua stdlib is loaded at all),
TerminalSession.writeLineparity in addition to the currentConsole.writeLinebinding,- the next typed cap binding (process spawning or endpoint IPC).
Phase 2: Script Package Input
- Add a userspace-owned script source cap or startup-config path.
- Let shell/init launch
capos-luawith a selected package and exact grants. - Implement restricted
requireover the package. - Add QEMU proof for a granted
TerminalSessioncall and a denied ungranted cap lookup.
Phase 3: Generated Capability Bindings
- Generate Lua binding metadata from
schema/capos.capnpor from the same interface registry used by the native shell. - Expose method names and structured params/results.
- Add transfer-result cap adoption and deterministic release tests.
- Keep raw Cap’n Proto builders out of script code unless a separate developer diagnostic cap grants that power.
Phase 4: Shell and Service Use
- Add shell sugar for script execution after the exact spawn plan exists.
- Permit trusted services to embed Lua only when they can prove the embedded state holds no extra authority beyond what the script should use.
- Add audit records for script launch, script package digest, grants, exit status, and authority-touching cap calls when audit caps are available.
Validation
The first implementation is not complete until it has QEMU evidence:
- A Lua script prints through a granted
TerminalSession. - The same script cannot use
io,os.execute,debug, or an ungranted cap. - A missing or wrong-interface cap lookup returns a bounded Lua error.
- An owned result cap is released deterministically.
- The runner exits cleanly and does not wedge the shell.
Host tests should cover Lua value conversion and binding generation once those pieces are pure enough to test outside QEMU. Do not claim “Lua scripting works” from host tests alone; the useful behavior is authority-shaped process execution in capOS.
Open Questions
- Whether the initial implementation should wait for
libcaposC support or use a temporary Rust Lua VM to prove the host API earlier. - The exact startup-config mechanism for selecting
main.luabefore storage and general process arguments exist. - Whether Lua 5.5 should be the only supported series or whether a 5.4 runner is worth carrying for ecosystem compatibility.
- How much schema reflection the Lua binding should expose before the native shell’s generic call surface lands.
- Which audit fields belong in
AuditLogonce script launch becomes an operator workflow rather than a smoke.
Proposal: WASI Host Adapter
How capOS should host WebAssembly modules through the WebAssembly System Interface, without recreating ambient authority and without committing to a runtime that the userspace baseline cannot support today.
Problem
WASI is the natural sandboxed-execution path for capOS:
- It is already designed to remove ambient authority. Preview 1 requires preopens — every file descriptor a module sees was granted by the host at startup. Preview 2 makes typed handles first-class through the Component Model.
- A single host adapter unlocks every language with a useful WASI target:
Rust, C/C++, Go (
GOOS=wasip1), TinyGo, Python, Zig, AssemblyScript, any interpreter compiled to wasm. - Wasm linear-memory bounds checks plus capability scoping give defence in depth for untrusted plugins and third-party code without weakening the capOS isolation model.
The risk pattern is the same as POSIX: a host adapter that grants ambient authority would erase the property that makes WASI worth doing. Every WASI import must be backed by a typed capability the host process already holds. If the host does not hold the cap, the module cannot reach it.
WASI is not a substitute for native ports of languages that need real OS threads, full asynchronous I/O, signals, or large POSIX surfaces. Those remain the native runtime tracks. WASI is the right tool for sandboxing untrusted plugins, third-party scripts, isolated workloads, CPU-bound portable tools, and language ecosystems whose native capOS port has not yet been built.
Scope
In scope:
- A
capos-wasmuserspace host adapter built oncapos-rt. - A WASI Preview 1 surface whose imports map 1:1 to typed capOS capabilities.
- Per-instance CapSet projection: each module sees only the caps the host grants for that instance.
- Phase decomposition that picks one runtime for v0, lets later phases migrate to the Component Model and richer runtimes, and stays explicitly outside ambient authority.
- Validation through QEMU smokes that prove granted and ungranted paths.
Out of scope for the first implementation:
wasi-threads(requiresshared-memory+atomics+bulk-memory).fork()-shaped semantics. Cannot clone wasm linear memory; same constraint as the browser-wasm proposal.- Synchronous signal delivery inside a wasm module. Fuel exhaustion plus host-driven termination are the only deterministic interruptions.
- File-backed
MAP_SHAREDmmap. - Treating the wasm sandbox as the only isolation boundary for hostile modules — the capOS process boundary remains the primary boundary.
- A custom non-portable WIT dialect with externref-typed cap handles. This proposal explicitly defers richer cap handles to Component Model resources (Phase W.7).
Current Manual Pages
- Programming Languages summarizes WASI’s current status relative to Rust, Python, Go, C/C++, Lua, and POSIX adapter tracks.
- Userspace Binaries Part 5 sketches the WASI host adapter at a higher level. This proposal supersedes that sketch with a full design surface; the userspace-binaries proposal continues to own the broader native-binary, language, and POSIX-adapter roadmap.
- Userspace Runtime documents the
implemented
capos-rtsurface that the host adapter consumes. - Browser/WASM covers the separate browser-hosted wasm experiment. The two proposals share wasm-runtime insight but target different substrates: WASI host adapter runs on capOS hardware; the browser proposal runs capOS concepts in a browser tab.
- Lua Scripting covers a similar capability-scoped script runner shape; the WASI track is the untrusted / portable counterpart to that proposal’s trusted native runner.
- Go Runtime covers the native
GOOS=caposalternative to Go-on-WASI.
Research Grounding
Relevant research and external references:
- WASI Preview 2 launch — Bytecode Alliance, “WASI 0.2 Launched”.
- Component Model status — eunomia, “WASI and the WebAssembly Component Model: Current Status”.
- WIT resources / portable plugins — Medium, “WASI 2.0 Components: Portable, Fast Plugins”.
- Externref design — Bytecode Alliance, “WebAssembly Reference Types in Wasmtime”.
- Rust target stabilization — Rust Blog, “Changes to Rust’s WASI targets” and “wasm32-wasip2 Tier 2”.
- TinyGo WASI — TinyGo WASI guide, wasmCloud, “Compile Go directly to WebAssembly components with TinyGo and WASI P2”.
- Runtime survey — Wasmi v0.32 release notes, arXiv 2404.12621 “Research on WebAssembly Runtimes”, Colin Breck, “Choosing a WebAssembly Run-Time”.
- Runtime repos — wasmi, WAMR, wasmtime, wasm3, wasmer.
In-tree references: this proposal lifts the capability-mapping table from
docs/proposals/userspace-binaries-proposal.md Part 5 and the runtime
survey/phase decomposition shape from comparable language-runtime planning
work; concrete repo evidence appears inline below.
Design Principles
- WASI is not a kernel feature. The kernel sees a normal userspace
process with a CapSet and a capability ring. The host adapter is one
of many
capos-rt-based binaries. - The host adapter’s CapSet is the authority. WASI module bytes are data. They cannot create authority. Every import is satisfied by a cap the host already holds; absent caps are refused, not synthesised.
- Per-instance CapSets are subsets, not supersets. Each loaded module gets only the caps the manifest grants for that instance. The host’s own CapSet may be larger; the module never sees the parent.
- The wasm sandbox is defence in depth, not the isolation boundary.
The capOS process boundary remains primary. Wasm bounds checking and
immutable
Modulevalidation add a second software-enforced boundary inside the host process so an entire untrusted module image can be confined. - Schema-first capability mapping. Each WASI function is backed by a typed capability, not by emulated POSIX semantics. POSIX-shaped integer fds in Preview 1 are a Preview 1 ABI requirement, not a capability model concession.
- Pick portable WASI, skip non-portable extensions. Custom imports
with
externref-typed cap handles would lock capOS into a non-portable WIT dialect that no other host implements. The Component Model’s typed resources are the right answer for first-class typed cap handles in wasm; defer to that path rather than inventing a one-vendor dialect. - Fail closed. Any unimplemented WASI call returns
ERRNO_NOSYS. Any cap lookup that fails returns the appropriate Preview 1 errno (ERRNO_BADF,ERRNO_ACCES,ERRNO_NOSYS). Modules cannot probe absent caps for ambient behavior.
Architecture
flowchart TD
Manifest[boot manifest:<br/>system-wasm-host.cue] --> Host[capos-wasm process]
Host --> Runtime[wasm runtime<br/>wasmi v0]
Host --> Rt[capos-rt typed clients]
Rt --> Ring[capability ring]
Ring --> Kernel[kernel CapObject dispatch]
Ring --> Services[userspace services]
Runtime --> Module[wasm module instance]
Module --> Imports{WASI imports}
Imports --> FdTable[per-instance fd table /<br/>Preview 2 resource handles]
FdTable --> Caps[granted typed caps]
Caps --> Rt
capos-wasm is one userspace process. It hosts one or more wasm module
instances. The runtime engine (wasmi for v0; see Runtime Selection below)
is linked into that process. WASI imports are resolved by the host
adapter’s import-resolver module against typed capOS clients. Each instance
has its own per-instance fd table (Preview 1) or resource bundle
(Preview 2) populated from the manifest grants for that instance.
The runtime exposes only what the host process can fulfil. If the host
does not hold an EntropySource cap, random_get returns ERRNO_NOSYS.
If the manifest did not grant a home namespace, the module’s preopen
table does not contain it and path_open("/home/...") resolves to
nothing.
Runtime Selection
For v0 (Phases W.1 through W.6), use wasmi. For W.7+, evaluate
migration to wasmtime when capOS userspace gains std support and a
futures executor, or to WAMR if minimal footprint becomes the dominant
constraint and the C build path lands.
| Constraint | wasmi | WAMR | wasm3 | wasmtime |
|---|---|---|---|---|
| Pure Rust, drops into capOS workspace | yes | C (needs cc/build glue, no libcapos yet) | C (same problem) | yes |
no_std + alloc | yes, advertised explicitly | partial (embedded, libc-shaped) | yes (bare metal) | no (needs std and a futures executor) |
| License | Apache-2.0 / MIT | Apache-2.0 with LLVM exception | MIT | Apache-2.0 |
| Footprint | small register-based bytecode (v0.32 5x speedup) | ~29 KB AOT, ~58 KB interpreter | ~64 KB code, ~10 KB RAM | large (Cranelift JIT) |
| Sandboxing | wasm spec + execution-engine isolation | wasm spec + AOT validation | wasm spec | wasm spec + Cranelift verifier |
| Fuel/gas metering | yes, built-in | not advertised | yes | yes |
| Capability transfer | externref since 0.24; component model on roadmap | reference types yes; component model partial | partial reference types | full component model (best-in-class) |
| WASI versions | preview1 stable; preview2 on roadmap | preview1 stable; preview2 partial | preview1 partial | preview1 + preview2 + components |
| Host function interface | mirrors wasmtime API | C API; Rust through wamr-rust-sdk | C API | Rust + C |
| Maintenance | wasmi-labs, two security audits (2023, 2024) | Bytecode Alliance, TSC-governed | maintainer in minimal-maintenance phase | Bytecode Alliance flagship |
| Threading | not in current scope | yes (wasi-threads) | no | yes |
Why wasmi for v0:
- Pure Rust drops directly into the capOS workspace. No C build chain
required — the same chain
libcaposdoes not yet provide. - Genuine
no_std + allocsupport means no host-side OS abstraction is required for the runtime itself; it sits cleanly oncapos-rt. - Built-in fuel metering matches capOS’s preference for explicit resource accounting.
externrefsupport is sufficient for any future v1 capability-handle experiment that does not block on the Component Model.- Mirroring the wasmtime API means that migrating to wasmtime in W.7 is rewiring imports, not rewriting host calls.
Not chosen for v0:
- wasmtime needs
stduserspace and a futures executor. capOS userspace isno_std + alloctoday; this is the same blocker that keeps the Rustcapnp-rpccrate (v0.25) offcapos-rtand queues the remote-session-client capnp-rpc rewrite behind an async runtime decision. - wasm3 is in maintainer-declared minimal-maintenance phase; not a good fit for a long-horizon capOS substrate.
- wasmer has similar weight to wasmtime and does not align as cleanly with the Bytecode Alliance Preview 2 trajectory.
- WAMR is a strong candidate when a C toolchain and
libcaposexist and minimal footprint is the goal. It is the migration target for high-density wasm hosting later, but it is not the v0 baseline because the C substrate is not in tree.
WASI Version Stance
- Preview 1 for v0 (Phases W.1 through W.6). POSIX-shaped,
file-descriptor-based, C-friendly. Tier 2 in upstream Rust since 1.78
(May 2024); supported by Go 1.21+ (
GOOS=wasip1 GOARCH=wasm), TinyGo, Clang--target=wasm32-wasi, Zig. This is the immediate unlock. - Preview 2 / Component Model for W.7+. Resources are first-class
typed handles. They are the natural mapping for capOS capabilities —
closer in shape to
OwnedCapability<T>than to integer fds. WIT interfaces let cap-aware Rust crates export typed APIs that a wasm component on capOS or a native capOS service can consume the same way it consumes a capnp interface.
Skipping Preview 1 entirely and starting at Preview 2 is possible with wasmtime today, but harder with wasmi; doing so would push the entire v0 unlock behind the std-userspace decision. The Preview 1 first / Preview 2 later sequencing is the smaller-step path to running C, Rust, Go, Python, TinyGo on capOS.
Capability Mapping Surface
Preview 1: per-import mapping
Each Preview 1 import is backed by a typed capOS capability the host adapter already holds. POSIX inherits ambient authority through global path namespaces, integer fds, and a process credential table; WASI removes that by requiring preopens, and capOS pushes it further by requiring an explicit per-import cap mapping in the host adapter.
| WASI preview1 import | capOS host-adapter implementation |
|---|---|
args_get / args_sizes_get | Read from a future capOS LaunchParameters cap or per-instance arena. Empty by default until that surface lands. |
environ_get / environ_sizes_get | Read from a KeyValueScope / ConfigOverlay cap when one exists; empty by default. Open question §6. |
clock_time_get(MONOTONIC) | Timer.now() over the host’s TimerClient. |
clock_time_get(REALTIME) | Future wall-clock cap; until then return ERRNO_NOSYS or ERRNO_INVAL. |
proc_exit(code) | Map to a host-internal “instance exited with code” status. The host process does not exit; the wasm instance does. |
random_get | The kernel EntropySource cap (the in-tree CSPRNG capability; see schema/capos.capnp interface EntropySource and KernelCapSource::EntropySource). Refuse with ERRNO_NOSYS when the host adapter was not granted entropy authority. |
fd_write(1, ...) / fd_write(2, ...) | Pre-opened fd 1 to host’s Console / TerminalSession write path; fd 2 to same or a separate log cap if granted. |
fd_read(0, ...) | Pre-opened fd 0 to a granted TerminalSession or future StdIO input cap if available; else ERRNO_BADF. No bare in-tree StdinReader cap exists today; non-terminal stdin requires a future input cap. |
path_open(preopened_dir_fd, path, ...) | Resolve path inside the Namespace cap mounted as that preopen, then open through the namespace’s Store / File capability. |
fd_read / fd_write on opened files | Translate to the typed File capability behind the host-side fd table entry. |
fd_close | Drop the typed cap handle (release-on-drop in capos-rt). |
fd_seek / fd_tell / fd_filestat_get | Methods on the File cap. |
fd_prestat_get / fd_prestat_dir_name | Enumerate the host adapter’s preopened-directory table built from manifest grants. |
sock_send / sock_recv / sock_shutdown | Translate to typed TcpSocket / UdpSocket cap calls. |
poll_oneoff | Multiplex over the host’s capability ring; CQEs are the event source. Open question §3. |
fd_advise / fd_allocate / fd_renumber | Stub or ERRNO_NOSYS until needed. |
sched_yield | No-op or single-tick yield through the runtime’s scheduler. |
Preview 2: WIT-resource mapping
When the host adapter migrates to Preview 2 (Phase W.7+), the imports become typed capOS capabilities directly through WIT resources:
| WIT package / interface | capOS host-side cap |
|---|---|
wasi:io/streams (input-stream, output-stream resources) | Wrap one capOS cap per stream (Console / TerminalSession / File / TcpSocket). The resource handle in wasm corresponds 1:1 to a host-side OwnedCapability<T>. |
wasi:filesystem/types (descriptor resource) | One OwnedCapability<File> or OwnedCapability<Directory> per descriptor. Preopened dirs become resource handles passed at instantiation. |
wasi:clocks/{monotonic-clock,wall-clock} | Timer / future wall-clock cap. |
wasi:random/{random,insecure} | EntropySource cap. |
wasi:sockets/tcp (tcp-socket resource) | TcpSocket cap. |
wasi:cli/{stdin,stdout,stderr,environment,exit} | Per-instance CapSet projection. |
wasi:http/incoming-handler / outgoing-handler | Match capOS HttpEndpoint / Fetch (drafted in service-architecture-proposal.md). |
Components in the same store can pass resources to other components; the host mediates the move. This maps directly to capOS capability transfer semantics — the same shape as the kernel’s result-cap insertion for typed cap returns from a CALL.
Capability Handle Path in the Module
How a wasm module receives and refers to a capOS capability is one of the load-bearing design questions. Three options:
- Preview 1 + integer fds, host-side fd table only (recommended for v0).
All caps live in the host process. The module sees integer fds. The
host adapter maps fds to
OwnedCapability<T>slots in its own per- instance table. Works with every existingwasip1binary unchanged. A wasm module cannot pass a typed cap to another wasm module without going through the host. - Custom
externrefimport (alternative; not recommended). Requires thereference-typesproposal (supported by wasmi >=0.24, wasmtime, wasmer; partial in wasm3). The host adapter exports custom imports likecap_call_refthat take anexternreftyped handle. This is non-standard and locks capOS into a one-vendor WIT dialect that no other host implements; it would also delay Preview 2 adoption because the dialect would need its own mapping policy. - Preview 2 / Component Model resources (target for W.7+).
Resources in the Component Model are unforgeable typed handles.
Components that import
wasi:filesystem/types.descriptorreceive a handle that is the host-sideOwnedCapability<File>. Components can pass resources to other components in the same store; the host mediates. Direct match to capOS capability transfer semantics.
Recommendation: ship Preview 1 + integer fds for v0; defer rich
typed-cap-in-module support to Preview 2 in W.7. Skip the externref
custom-import path entirely.
Per-Instance vs Per-Process Model
Two reasonable shapes:
- One wasm instance per
capos-wasmprocess (recommended for v0). Faults are isolated at the capOS process boundary. Fuel and budget enforcement are per-process and use the existing capOS resource accounting. Manifest-grant shape stays simple: each manifest entry names one binary and one cap bundle. - Many instances per
capos-wasmprocess (alternative). Better density. Suits hosting many small modules (plugin systems, embedded scripts). Adds host-side scheduling concerns: a runaway instance can starve siblings; fuel/budget enforcement now has to demultiplex; thepoll_oneoffreactor question becomes load-bearing.
Recommendation: one instance per process for v0. Revisit when instance count actually matters. The capOS process boundary is already a strong isolation primitive; trading it away for density before density is needed adds complexity for no v0 unlock.
Per-Instance CapSet Plumbing
Each loaded module gets a per-instance capability bundle. The host adapter receives manifest grants and projects them onto WASI imports.
The shape needs to land alongside argv/env passing — argv for wasm
modules has the same lifecycle question as argv for native processes.
When a future capOS LaunchParameters surface lands it becomes the
canonical source for both argv and env. Until then, a small bounded
text grant in the host adapter manifest is acceptable for v0
(Open Question §6 / §7).
Sketch of the manifest shape (pre-LaunchParameters):
wasm_host: {
binary: "thing.wasm"
args: ["--input", "data"]
caps: {
console: @console
timer: @timer
random: @random
// preopen 3 → home namespace; preopen 4 → tmp namespace, etc.
preopens: [
{ fd: 3, namespace: @home_namespace, name: "/home" }
{ fd: 4, namespace: @tmp_namespace, name: "/tmp" }
]
}
}
Same authority model the rest of capOS uses: every cap the module sees is named in the manifest and granted by the parent. The wasm sandbox is defence in depth on top of capability scoping, not a replacement.
Trust Boundaries
| Boundary | Native capOS service | WASI host adapter + module |
|---|---|---|
| Authority source | Process CapSet | Host CapSet then per-instance subset |
| Memory isolation | Page tables | Wasm linear-memory bounds-check plus page tables (host process) |
| Code integrity | W^X + NX | Wasm module validation plus immutable WebAssembly.Module |
| Cap forgery | Kernel-owned CapTable | Host-owned per-instance fd table or resource-handle table; module sees opaque ints/handles only |
| Resource limits | Kernel quotas | Wasm fuel + memory cap + host-side per-instance time/byte budgets |
| Side channels | Hardware-level (Spectre etc.) | Same hardware level, plus wasm-specific (e.g. timer resolution) |
Wasm does not weaken capOS isolation; it adds a second software-enforced boundary that contains an entire untrusted module image. This is exactly the property that makes WASI a good fit for plugin and script loading.
What WASI Does Not Solve
fork(): cannot clone wasm linear memory mid-execution. Same reason the browser-wasm proposal documents. POSIX programs that fork-then-exec must useposix_spawn-shaped equivalents, or the host adapter must spawn a new wasm instance.- Synchronous signals: no preemption inside a wasm module without cooperative yield points or interrupted execution. Fuel exhaustion is the only deterministic interruption; gross preemption is “host kills the instance”. Acceptable for plugins.
- Threads without
wasm-threads: requiresshared-memory+atomics+bulk-memoryfeatures and a runtime that supports them. Out of scope for v0. - Live
mmapof files: wasm linear memory is not file-backed. Workable only for small read-or-write cycles.
Phase Decomposition
Smallest reviewable slices ordered by dependency. Each phase is independently demoable and gates the next.
Phase W.0 — Decision and host runtime selection (planning)
- Decide runtime: wasmi vs WAMR (recommendation above).
- Land this proposal and the matching
docs/tasks/task record for the first WASI host-adapter slice. - Resolve cross-cutting open questions §1, §3, §6, §7, and §8 below (the §8 vendoring posture decision gates the W.1 scaffold layout).
Deliverable: agreed proposal plus dispatchable task record. No code.
Phase W.1 — capos-wasm host process scaffold (no WASI yet)
Status: host-runtime scaffold landed 2026-05-05 19:12 UTC. Manifest
and make run-wasm-host smoke moved into Phase W.2 (see Status note
below).
- New crate
capos-wasm/— userspace process built oncapos-rt. - Vendor the chosen runtime (wasmi recommended; one local cargo dep
patched for
no_std + allocif needed). - Host process can
WebAssembly.compile(bytes)theninstantiate(no imports)then run an empty_start. No imports resolved yet. - Manifest: new
system-wasm-host.cueboots one host process with one embedded.wasmblob (the smoke binary). - Smoke:
make run-wasm-hostboots, host loads the empty blob, prints[wasm-host] empty module instantiated and exited, host exits cleanly.
Status note (revised 2026-05-06 20:19 UTC): the v0 W.1 slice landed
only the host-runtime substrate — the capos-wasm/ standalone crate,
the vendored vendor/wasmi-no_std/wasmi-1.0.9/ snapshot, and the make capos-wasm-build target — without a wasm-host binary,
system-wasm-host.cue manifest, or make run-wasm-host smoke. The
binary/manifest/smoke trio was rolled into Phase W.2 and landed there
in W.2 sub-slice 1 (2026-05-06 20:19 UTC) using an inline 8-byte empty
wasm module as the payload. Earlier drafts of this status note worried
about re-cutting the same host binary twice (once empty, once with a
Preview 1 surface) and proposed deferring the empty-module smoke until
“hello, wasi” was ready; the actual outcome went the other way: the
empty-module regression is its own slice that exercises wasmi’s
Module::new + Linker::instantiate_and_start end-to-end on capOS,
and later W.2 sub-slices extend the same binary in place with the
Preview 1 import resolver and language-level smokes.
Deliverable: a wasm runtime crate compiles and links inside the
capOS userspace no_std + alloc build. No imports, no host functions,
no WASI. Validates the runtime crate works in no_std + alloc
userspace and that the vendored wasmi snapshot exposes Engine and
Store<HostState> to a future host binary.
Validation: make capos-wasm-build succeeds against
targets/x86_64-unknown-capos.json with no_std + alloc; make fmt-check and the host test gates remain green; the kernel and other
userspace crates are untouched (no kernel surface, no
schema/capos.capnp change, no init/ change).
Phase W.2 — WASI Preview 1 stdout-only
Inherits from W.1: the wasm-host binary, system-wasm-host.cue
manifest, and make run-wasm-host smoke originally listed under W.1
land here in sub-slice 1, so the same binary that future sub-slices
extend with the Preview 1 import surface also provides the
empty-instantiation smoke.
The phase is landing in four sub-slices, not one big drop, to keep
each diff reviewable. random_get production wiring stays owned by
Phase W.4 (entropy + clocks production-ready); W.2 leaves it stubbed
as ERRNO_NOSYS:
-
W.2 sub-slice 1 (landed): wasm-host binary,
system-wasm-host.cueempty-instantiation manifest,make run-wasm-hostsmoke, and the one-time userspace ABI bump (USER_STACK_BASE etc.) that wasmi’s ~3 MiB BSS forced. -
W.2 sub-slice 2 (landed 2026-05-07 08:03 UTC): Preview 1 stdout-only imports (args/environ as empty,
clock_time_get(MONOTONIC),proc_exit,fd_write(1,…)/fd_write(2,…)); everything else stubs asERRNO_NOSYSincludingrandom_get(Phase W.4 promotes that to production). The wasm-host smoke now drives a 114-byte hand-encoded probe module that callsrandom_get, stores the returned errno in an exported global, and refuses to print thenosys=52proof line unless it equalsERRNO_NOSYS. -
W.2 sub-slice 3 (landed 2026-05-07 09:36 UTC): Rust
hello, wasismoke (demos/wasi-hello-rust/,system-wasi-hello-rust.cue,make run-wasi-hello-rust). The wasm-host binary now optionally reads aBootPackagecap, walks the manifest’sbinaries[]for thewasi-payloadentry, instantiates it through the same Preview 1 linker, and explicitly invokes the_startexport (wasmi’sinstantiate_and_startruns the WebAssemblystartsection, NOT WASI’s_start). The sub-slice 1+2 regression keeps running first; the existingmake run-wasm-hostsmoke continues to pass because it does not grantboot. -
W.2 sub-slice 4 (landed 2026-05-07 10:53 UTC): C
hello, wasismoke (demos/wasi-hello-c/,system-wasi-hello-c.cue,make run-wasi-hello-c). The wasm-host payload-load path landed in sub-slice 3 carries the C.wasmpayload too — sub-slice 4 only added the C toolchain wiring (system clang-18 with--target=wasm32-wasi --sysroot=/usragainst the Ubuntu wasi-libc +libclang-rt-18-dev-wasm32packages), the second manifest, the matching smoke harness, and these closeout stamps. Phase W.2 is done. -
W.2 sub-slice 1 (landed 2026-05-06 20:19 UTC): the wasm-host userspace binary,
system-wasm-host.cueempty-instantiation manifest,tools/qemu-wasm-host-smoke.shassertion harness, and the userspace-image budget bump that wasmi’s ~3 MiB BSS requires. USER_STACK_BASE moved from 0x60_0000 to 0x100_0000 incapos-config/src/process_layout.rs; RING_VADDR (capos-config/src/ring.rs) and CAPSET_VADDR (capos-config/src/capset.rs) shifted in lockstep, and every linker.ld assertion (init/,capos-rt/,demos/,shell/,capos-wasm/) and thesystem-spawn.cuestack-overlap-elf fixture were updated to match. No Preview 1 imports yet — the binary instantiates the inline 8-byte empty wasm module and exits cleanly through the existing capos-rt entrypoint. -
W.2 sub-slice 3 (landed 2026-05-07 09:36 UTC) and W.2 sub-slice 4 (landed 2026-05-07 10:53 UTC): language-level Rust + C
hello, wasismokes plus the manifest-payload load path on the wasm-host binary. Phase W.2 is closed by sub-slice 4.
Sub-slice 1 (landed) delivered:
- The
wasm-hostuserspace binary built on the W.1 scaffold, instantiating an inline 8-byte empty wasm module throughwasmi::Linker::instantiate_and_start. - Manifest
system-wasm-host.cue(empty-instantiation regression). - Smoke
make run-wasm-host(asserted bytools/qemu-wasm-host-smoke.sh).
Sub-slice 2 (landed) delivered:
capos-wasm/src/wasi/preview1.rsPreview 1 import resolver on top of the existing wasm-host binary, registering 46wasi_snapshot_preview1imports against a fixed-aritywasmi::Linker<HostState>.- Implemented surface:
args_get,args_sizes_get,environ_get,environ_sizes_get(all return zero counts / empty buffers);clock_time_get(CLOCKID_MONOTONIC)via the host’sTimerClient(CLOCKID_REALTIMEreturnsERRNO_NOSYSuntil a wall-clock cap exists);proc_exitviacapos_rt::syscall::exit;fd_write(1, …)andfd_write(2, …)via the host’sConsole.writebyte path with a fixed 4 KiB scratch ceiling (oversize total →ERRNO_INVAL); all other Preview 1 imports stubbed asERRNO_NOSYS(includingrandom_get, which Phase W.4 promotes againstEntropySource). - Manifest update (
system-wasm-host.cuenow grants Console + Timer) and smoke harness update (tools/qemu-wasm-host-smoke.shasserts the new[wasm-host] preview1 imports linked: ...; nosys=52proof line in addition to the empty-instantiation regression). - Probe-driven evidence: a 114-byte hand-encoded probe module imports
random_get, calls it once at instantiation, stores the returned errno in an exported global, and the host refuses to print the proof line unless the global reads back asERRNO_NOSYS = 52.
Sub-slice 3 (landed 2026-05-07 09:36 UTC) delivered:
demos/wasi-hello-rust/standalone crate built against the upstreamwasm32-wasip1target. Source is a singleprintln!; the producedhello.wasm(~40 KiB) importsenviron_get,environ_sizes_get,fd_write, andproc_exitfromwasi_snapshot_preview1, all of which the sub-slice 2 resolver already implements.capos_wasm::payloadhelper module: streams the capnp-encodedSystemManifestblob throughBootPackage.readManifestChunk(4 KiB chunks) and walksbinaries[]via raw capnp readers to return the bytes for a named entry. The wasm-host binary calls this only when the manifest grants the optionalboot(BootPackage) cap, so the sub-slice 1+2make run-wasm-hostsmoke – which does not grantboot– keeps passing unchanged.system-wasi-hello-rust.cuemanifest: lists the wasm-host ELF and thewasi-payloadblob, grants Console + Timer + BootPackage to the wasm-host, and reuses the sharedcue/defaultspackage.tools/qemu-wasi-hello-rust-smoke.shsmoke harness: asserts the existing sub-slice 1 + 2 proof lines, the newHello from WASI on capOSpayload stdout (the load-bearing evidence), and the clean process/scheduler exit pair. The wasm-host payload-stage proof line is not asserted because wasi-libc’s_startis allowed to terminate viaproc_exitfrom inside the Preview 1 import handler, in which case the host process exits before the wasm-host can print its post-_start proof line.make wasi-hello-rust-buildcargo wrapper that clearsRUSTFLAGS/CARGO_ENCODED_RUSTFLAGSso the kernel-target rustflags pinned in the repo.cargo/config.tomldo not leak into the wasm build.- capos-rt re-export additions:
capos_capnpanddefault_reader_optionsare now reachable fromcapos_rt::*socapos-wasmkeeps a single direct path-dep on capos-rt and the vendored wasmi tree (addingcapos-configdirectly to capos-wasm triggered an unrelated cargo workspace-inheritance error against the vendored wasmi atvendor/wasmi-no_std/wasmi-1.0.9/).
Sub-slice 4 (landed 2026-05-07 10:53 UTC) delivered:
demos/wasi-hello-c/standalone C smoke (NOT a Cargo crate; built directly with system clang-18 + lld via the Makefilewasi-hello-c-buildtarget). Source is a singleprintf("Hello, wasi from capOS C\n")main()compiled with--target=wasm32-wasi --sysroot=/usragainst the Ubuntu wasi-libc +libclang-rt-18-dev-wasm32apt packages; the producedhello-c.wasm(~46 KiB) imports five functions fromwasi_snapshot_preview1:fd_close,fd_fdstat_get,fd_seek,fd_write, andproc_exit.fd_writeandproc_exitreach the host’s granted Console cap and the clean capos-rt exit path implemented in sub-slice 2;fd_close,fd_fdstat_get, andfd_seekreturnERRNO_NOSYS = 52from the same sub-slice 2 stub surface, which is sufficient for wasi-libc’s stdout-only path.system-wasi-hello-c.cuemanifest: same shape assystem-wasi-hello-rust.cue, lists the wasm-host ELF and thewasi-payloadblob, grants Console + Timer + BootPackage to the wasm-host, and reuses the sharedcue/defaultspackage.tools/qemu-wasi-hello-c-smoke.shsmoke harness: asserts the existing sub-slice 1 + 2 proof lines, the newHello, wasi from capOS Cpayload stdout (the load-bearing evidence), and the clean process/scheduler exit pair.make wasi-hello-c-buildtarget that runs system clang withRUSTFLAGS/CARGO_ENCODED_RUSTFLAGScleared (matching thewasi-hello-rust-buildshape so the two flows stay symmetric).- No host-side change to
capos-wasm/: the manifest-payload load path landed in sub-slice 3 carries the C.wasmpayload through the same wasm-host binary unchanged.
Deliverable: the first WASI-hosted, sandboxed portable-payload
language path lands on capOS. Both Rust (wasm32-wasip1) and C
(wasm32-wasi) hello, wasi payloads run inside the wasmi
interpreter under the wasm-host capOS process and reach the host’s
granted Console cap through Preview 1 fd_write. Native C already
boots through the libcapos C-substrate (make run-c-hello) and the
POSIX adapter (make run-posix-dns-smoke); this phase specifically
adds the WASI-hosted path – in particular, C runs on capOS through
the WASI surface without requiring any libcapos/POSIX work in
tree, because the wasm-host’s host-side imports cover everything
the wasi-libc stdout-only path needs.
Phase W.2 closed 2026-05-07 10:53 UTC. Phase W.3 closed 2026-05-07 18:25 UTC. Phase W.4 closed 2026-05-07 20:09 UTC.
Phase W.3 — Per-instance CapSet plumbing + LaunchParameters
Status: landed 2026-05-07 18:25 UTC. Per-instance CapSet selection
keeps using the existing manifest cap-grant block on
initConfig.init.caps (no new cap needed for the v0 argv path); the
new surface is the bounded-text argv grant on
initConfig.init.wasiArgs. The wasm-host pulls it out of the
manifest blob through its already-granted BootPackage cap, validates
it against the bounds in capos-wasm/src/payload.rs
(WASI_ARGS_MAX_COUNT = 32, WASI_ARGS_MAX_ARG_BYTES = 4096,
WASI_ARGS_MAX_TOTAL_BYTES = 8192), packs it into a per-instance
HostState argv buffer, and reflects it back through Preview 1
args_get / args_sizes_get. A 2026-05-13 successor mirrors the
same bounded-text pattern for environment variables through
initConfig.init.wasiEnv, validated against
WASI_ENV_MAX_COUNT = 32, WASI_ENV_MAX_ENTRY_BYTES = 4096, and
WASI_ENV_MAX_TOTAL_BYTES = 8192, with interior NULs rejected before
the payload instantiates. Open Question §5 / §6 / §7 status is
recorded in the section below; a future capOS LaunchParameters cap
is still the migration path for argv and environment together.
- Per-instance CapSet selection: keeps using the manifest-defined
cap-grant block (
initConfig.init.caps) the W.2 sub-slice 3 / 4 smokes already exercised. Phase W.3 does not add a new cap; it adds thewasiArgsbounded-text grant alongside the cap list. Future phases (W.4 entropy, W.5 namespaces, W.6 sockets) will extend the samecapsblock with their respective surfaces. - Bounded-text argv grant:
initConfig.init.wasiArgsis a CUE text list. Schema/schema/capos.capnpis unchanged becauseinitConfigis alreadyCueValueand unknown sub-fields underinitConfig.initare ignored by the existing manifest decoder. The wasm-host walks the field directly through raw capnp readers incapos-wasm/src/payload.rs::read_wasi_args. An absent or emptywasiArgskeeps the W.2 “no argv” behaviour (args_sizes_getreports zero,args_getwrites nothing) so the existingmake run-wasm-host,make run-wasi-hello-rust, andmake run-wasi-hello-csmokes stay unchanged. - Bounded-text environment grant:
initConfig.init.wasiEnvis a CUE text list of entries such asKEY=value. It uses the same raw capnp reader path aswasiArgs, the same no-schema-changeinitConfigCueValueextension point, and the same empty-by- default behavior: absent or emptywasiEnvmakesenviron_sizes_getreport zero andenviron_getwrite nothing. Oversized entry count, oversized individual entries, oversized packed total bytes, and interior NUL bytes make wasm-host abort with stable exit codes rather than truncating or corrupting the WASI Preview 1 NUL-terminated layout. - Migration to a future
LaunchParameterscap: when capOS gains a capability-shapedLaunchParameterssurface (the same one envisioned bydocs/proposals/userspace-binaries-proposal.mdPart 5 and the future shell launch flow), the wasm-host will swapread_wasi_argsfor a typedLaunchParametersClientlookup and the manifest-sidewasiArgsfield becomes redundant. The bounds constants stay relevant either way (a typedLaunchParameterscap will still need byte ceilings before it ships argv into wasm linear memory). - Smoke:
demos/wasi-cli-args/(Rust,wasm32-wasip1) readsargv[1]and prints it throughprintln!->fd_write(1, …)-> the host’sConsolecap. The harness (tools/qemu-wasi-cli-args-smoke.sh) asserts the existing sub-slice 1 + 2 regression lines plus the load-bearingcapos-wasi-cli-args-sentinelline.
Deliverable: per-instance CapSet selection works (commit landed
2026-05-07 18:25 UTC; smoke make run-wasi-cli-args).
Phase W.4 — WASI Preview 1 random + clocks production-ready
Status: landed 2026-05-07 20:09 UTC. The wasm-host looks up an
optional per-instance EntropySource cap from the CapSet under the
well-known name random. When the manifest grants it, the typed
EntropySourceClient is installed on HostState after the W.2
sub-slice 2 probe regression runs (so the probe’s
random_get(0, 0) call still observes the closed-fail
ERRNO_NOSYS = 52 path byte-identically with the W.2/W.3 proof
line). Preview 1 random_get then drains arbitrary wasm-supplied
byte ranges into the manifest-granted entropy stream by chunking
against the kernel cap’s per-call MAX_ENTROPY_FILL_BYTES = 64
ceiling and walking up to RANDOM_GET_MAX_BYTES = 65_536 total
bytes per Preview 1 invocation. Truncated kernel responses, RDRAND
unavailable status, and any transport-level error surface as
ERRNO_IO; out-of-bounds wasm pointer writes surface as
ERRNO_FAULT; oversized requests surface as ERRNO_INVAL. The
ungranted-variant manifest still routes Preview 1 random_get
through the no-grant refusal branch which never enters the kernel,
so an instance without an EntropySource grant cannot leak
entropy.
- Wire the kernel
EntropySourcecap (the in-tree CSPRNG capability; seeEntropySourceClientandKernelCapSource::EntropySource) through the host adapter as the backing forrandom_get. The same cap is the natural future analogue of the browser’scrypto.getRandomValuessurface. - Wall-clock support stays deferred until capOS has a typed
WallClock/RealTimeClockcap.clock_time_get(CLOCKID_REALTIME)keeps returning the W.2 sub-slice 2 sentinelERRNO_NOSYSso a Preview 1 guest can distinguish “host refused” from a kernel / transport failure; future phases promote it once the wall-clock cap lands. The monotonic clock keeps using the manifest-grantedTimercap unchanged. - Smoke:
demos/wasi-random/(Rust,wasm32-wasip1) reads N=64 bytes via a raw Preview 1 import binding (avoiding wasi-libc’s panic-on-errno wrapper so the ungranted-variant payload can print a refusal sentinel and exit with code 52 rather than aborting). The granted-variant smoke (make run-wasi-random/tools/qemu-wasi-random-smoke.sh) asserts the W.2 sub-slice 1 + 2 regression proof lines, the load-bearing[wasi-random] entropy_bytes=64 entropy_bound_ok=trueline, and a clean exit; the ungranted-variant smoke (make run-wasi-random-ungranted/tools/qemu-wasi-random-ungranted-smoke.sh) asserts the same regression lines plus the load-bearing[wasi-random] random_get returned errno=52 (ENOSYS)refusal sentinel and refuses the granted-variant entropy line.
Deliverable: Preview 1 random_get is wired to the kernel
EntropySource cap with the closed-fail refusal contract, the
clock_time_get(REALTIME) deferral is documented, and the
ungranted-variant smoke proves both. A 2026-05-13 compatibility slice
also promotes authority-free Preview 1 imports that need no new cap:
clock_res_get(CLOCKID_MONOTONIC) returns the monotonic nanosecond
resolution, sched_yield returns success as a no-op, fd_fdstat_get
for stdout/stderr returns character-device write metadata, and
fd_seek for stdout/stderr returns ERRNO_SPIPE. The direct-import
make run-wasi-stdio-fd smoke requires all promoted imports to return
non-ERRNO_NOSYS results. The remaining non-filesystem / non-socket
Preview 1 imports that still return ERRNO_NOSYS – poll_oneoff,
proc_raise, fd operations that need file or close-state authority,
and the path_* paths – stay future work; promoting each to
“honest” needs either the typed capability it would route through
(for example a WallClock / RealTimeClock cap for REALTIME or
namespace/file caps for storage fds and paths) or an explicit decision
to keep the NOSYS refusal as the v0 honest behaviour. Phase W.4 closed
2026-05-07 20:09 UTC.
Harness-hardening landed on 2026-05-13: make run-wasi-preview1-refusals boots a direct-import payload that calls
representative blocked filesystem/socket imports with no
Namespace/File/Store/socket authority in the manifest and requires each
return to equal ERRNO_NOSYS = 52. The initial slice (2026-05-13
08:50 UTC) covered path_open, fd_prestat_get, fd_read,
sock_send, sock_recv; a follow-up (2026-05-13 21:15 UTC) extended
the harness to also cover fd_pread, fd_pwrite,
path_create_directory, and sock_shutdown, bringing the total to nine
covered imports. As each filesystem import gains a real implementation its
no-preopen errno migrates from ERRNO_NOSYS = 52 to ERRNO_BADF = 8
(path_open / fd_prestat_get / fd_read with Phase W.5;
path_create_directory on 2026-05-24 10:09 UTC; fd_pread / fd_pwrite
when positional I/O landed – see below); the harness asserts the
current errno per import rather than a blanket NOSYS. Only the socket
imports (sock_send / sock_recv / sock_shutdown) still return
ERRNO_NOSYS = 52. This records fail-closed evidence for the current
surface only; it does not implement W.6 behavior.
Phase W.5 — WASI Preview 1 filesystem (landed 2026-05-17 05:42 UTC)
- Map preopened-dir fds to a manifest-granted root
Directorycap from the per-instance CapSet. The v0 surface ships a single preopen at fd 3 named/preopen-0; the manifest CapSet slot name isroot(matching the POSIX adapter P1.4 Slice 4 bootstrap).Namespace/Storeintegration is deferred until a use case requires the content-addressed pseudo-fs shape – the kernel caps remain available for a future slice (storage Phase 3 slice 3 landed them). - Implement
path_open,fd_read,fd_write,fd_seek,fd_close,fd_filestat_get,fd_prestat_get, andfd_prestat_dir_nameagainst the kernelDirectory/Filecap interface incapos-wasm/src/wasi/fs.rs. The resolver mirrors POSIX P1.4 Slice 4 (libcapos-posix/src/path.rs): non-leaf segments walkDirectory.sub; the leaf mints either an existing or freshly createdFileviaDirectory.open(flags=CREATE|TRUNCATE). - Preview 1 base and inheriting rights are stored in the host fd table.
The single preopen advertises only implemented directory/path rights and
inheritable File rights;
path_openrefuses requested base or inheriting rights outside the preopen’s inheriting set, and opened File fds retain exactly the requested rights.fd_fdstat_getreports those stored rights, andfd_fdstat_set_rightscan only attenuate them.fd_read,fd_write,fd_pread,fd_pwrite,fd_seek,fd_tell,fd_filestat_get, andfd_filestat_set_sizecheck the stored File rights before constructing aFileClient;path_create_directory,path_remove_directory,path_unlink_file,path_filestat_get,fd_readdir, and preopenfd_filestat_getcheck the preopen rights before constructing aDirectoryClientor resolving the path. - WASI
fd_closeonly releases the local cap-table slot. The kernel-sideFile.close()would invalidate theArc<FileCap>that the parentDirectoryholds keyed by entry name, breaking re-open of the same path; WASI semantics expectfd_closeto release the per-process fd without deleting the underlying file. Newpath_opencalls for the same path mint a fresh local handle against the same kernel-side entry. - Preopen sandbox: the resolver refuses absolute paths (leading
/) and parent-escape segments (..,.) withERRNO_NOTCAPABLE = 76. The single preopen has no parent reachable through any path syntax. - The
make run-wasi-fssmoke (system-wasi-fs.cue,demos/wasi-fs/,tools/qemu-wasi-fs-smoke.sh) completes a fullpath_open(CREAT+TRUNC)/fd_write/fd_close/ re-open /fd_filestat_get/fd_seek/fd_readround trip, asserts both the absolute-path refusal and the parent-escape refusal, and proves narrowed File/preopen rights fail closed withERRNO_NOTCAPABLEbefore the underlying File/Directory client call. Themake run-wasi-preview1-refusalssmoke continues to prove the fail-closed contract for an ungranted manifest:path_open(3, ...),fd_prestat_get(3), andfd_read(3, ...)now returnERRNO_BADF = 8(no preopen) instead of the pre-W.5 stubERRNO_NOSYS = 52(path_create_directoryjoined this BADF group 2026-05-24 10:09 UTC, andfd_pread/fd_pwritejoined when positional I/O landed – see below); only the socket imports continue to returnERRNO_NOSYS. - Kernel authority surface landed 2026-05-14 (RAM-backed
File,Directory,Store, andNamespacekernel caps with QEMU smokesrun-file-server-smoke,run-directory-server-smoke,run-store-namespace-smoke). W.5 wires the wasm-host adapter to theDirectory/Filesubset of that authority;Store/Namespaceintegration is deferred until a use case requires it. fd_readdirlanded 2026-05-24 08:44 UTC over the existing preopenDirectorycap (DirectoryClient::list– no schema or generated-bindings change).fs::fd_readdir_implenumerates the preopen, rejecting open file fds withERRNO_NOTDIR = 54and unknown fds withERRNO_BADF = 8;preview1::fd_readdirserializes the fixed 24-byte little-endian Preview 1direntrecords (d_next, zerod_ino,d_namlen,d_typefromDirEntry.is_dir) followed by name bytes, with cookie-based resume and a short-buffer truncation contract that never writes pastbuf_len. Themake run-wasi-fssmoke now also enumerates thesmoke.txtit created (readdir_found_smoke=true) and proves the short-buffer truncation.fd_tellandfd_filestat_set_sizelanded 2026-05-24 09:34 UTC, completing the File-cap method triad (no schema or generated-bindings change –File.truncatealready shipped).fs::fd_tell_implis a pure host-side read of the maintainedFileEntry::position(symmetric withfd_seek’s SET/CUR branches);fs::fd_filestat_set_size_implcallsFileClient::truncate_waitand leaves the file offset unchanged per the WASI contract.preview1::fd_tellreturnsERRNO_SPIPE = 70on a stdio fd (mirroringfd_seek) and writes the position as LE-u64;preview1::fd_filestat_set_sizerejects a negativesizewithERRNO_INVAL = 28and maps non-file fds toERRNO_BADF = 8. Themake run-wasi-fssmoke now assertsfd_tellreports the post-write position (tell_ok=true) andfd_filestat_set_sizeshrinks the file (truncate_size=4), plus the stdio refusals for both imports.path_create_directoryandpath_remove_directorylanded 2026-05-24 10:09 UTC over the preopenDirectorycap (DirectoryClient::mkdir/remove– no schema or generated-bindings change;Directory.mkdir/removealready shipped).fs::path_create_directory_impl/path_remove_directory_implreuse thepath_openresolve-parent-and-leaf path and the same preopen sandbox, so absolute paths and..segments are refused withERRNO_NOTCAPABLE = 76before any kernel call; themkdirresult-cap (a freshDirectoryhandle the WASI layer does not retain) is released immediately to avoid leaking a cap-table slot. Themake run-wasi-fssmoke now createssubdir, confirms it viafd_readdir(directoryd_type), removes it, confirms it is gone, and asserts the directory-write sandbox refusals (mkdir_ok=true rmdir_ok=true dir_escape_refused=true). Implementingpath_create_directorymoves its no-preopen errno fromERRNO_NOSYS = 52toERRNO_BADF = 8(the base-fd preopen lookup precedes the path), so themake run-wasi-preview1-refusalsharness now asserts it in the BADF group.fd_preadandfd_pwritelanded 2026-05-30 14:49 UTC as positional I/O over the hostFilecap (no schema or generated-bindings change – the kernelFile.read/File.writemethods already carry an explicit byte offset, andfd_read/fd_writealready drive them).fs::fd_pread_impl/fs::fd_pwrite_implmirrorfd_read_impl/fd_write_file_implbut use the WASI-suppliedoffsetand, per the WASI Preview 1 contract, leaveFileEntry::positionuntouched – the defining positional-I/O invariant.preview1::fd_pread/fd_pwritereuse the same guest-memory iovec gather/scatter helpersfd_read/fd_writewere refactored onto (one walker, not two), reject a negativeoffsetwithERRNO_INVAL = 28, and returnERRNO_SPIPE = 70on a stdio fd (mirroringfd_seek/fd_tell). Themake run-wasi-fssmoke now writes “ABCD” at offset 2, reads it back at offset 2, and asserts the fd’s stream position is unchanged (pwrite_pread_ok=true pos_unchanged=true), that a negative offset is refused (pread_neg_offset_inval=true), and that a stdio fd surfaces a non-ERRNO_NOSYSerror (ppos_stdio_refused=true). Themake run-wasi-preview1-refusalsharness moves both imports into the BADF group (fd 3 is a bad descriptor against an absent preopen).path_filestat_getandpath_unlink_filelanded 2026-05-30 as path-resolved metadata/removal over the hostFile/Directorycaps (no schema / generated-bindings change).fs::path_filestat_get_implresolves the leaf under the preopen, opens a transient read-onlyFile(flags = 0), runsFile.stat, and releases the transient cap before returning the size;fs::path_unlink_file_impldeletes the named entry throughDirectory.remove(the same void-result oppath_remove_directoryuses, which removes file leaves). Both enforce the absolute/..ERRNO_NOTCAPABLEsandbox inresolve_parent_and_leafbefore any kernel call;preview1::path_filestat_getaccepts and ignores thelookupflagssymlink-follow bit (no symlinks in v0) and writes the 64-byte filestat viawrite_filestat. Themake run-wasi-fssmoke statssmoke.txtby path (size 4, regular-file type) and unlinks it, andmake run-wasi-preview1-refusalsmoves both imports into the BADF group. The remainingERRNO_NOSYSreturns are the deliberately deferred surfaces (fd_advise,fd_allocate, the sync family, the path timestamp/symlink/link family (path_filestat_set_times,path_symlink,path_readlink,path_link,path_rename),poll_oneoff,proc_raise, and the W.6-blocked socket family).
Deliverable: a wasm module can read and write files inside a preopened capOS directory.
Phase W.6 — WASI Preview 1 sockets (gated on userspace network stack)
sock_send,sock_recv, etc. againstTcpSocket/UdpSocketcaps when the userspace network stack lands.- Until then, an HTTP client over
Fetch/HttpEndpointis a reasonable shim for HTTP-only use. make run-wasi-preview1-refusalsproves representative socket imports (sock_send,sock_recv,sock_shutdown) fail closed withERRNO_NOSYS = 52when no socket cap is present. This is current refusal evidence only; W.6 remains blocked until the networking authority exists.
Deliverable: a wasm module can serve HTTP requests inside a capOS process.
Phase W.7 — Move to wasmtime or migrate to WASI Preview 2 / Component Model
- If the runtime selected in W.0 was wasmi, decide whether to swap to
wasmtime once
std/futures runtime is available in capOS userspace. - Or instead promote wasmi to wasip2 / Component Model support (wasmi roadmap covers components, but maturity is behind wasmtime).
- Map WIT resources to typed
OwnedCapability<T>slots. This is the natural place to bridge capOS capabilities into wasm as first-class typed handles. Capability transfer between wasm components becomes a host-mediated resource handoff. - Component-Model support enables cap-aware Rust crates to export their typed interfaces as WIT, which a Rust capOS service can consume the same way it consumes a capnp interface.
- Schema serial-surface coordination: this phase will likely add new
variants under
schema/capos.capnpfor component-model resource bridging. Serialise with other schema-touching plans (docs/backlog/index.mdConcurrency Notes).
Deliverable: a wasm component on capOS exports a typed interface that a native capOS process can call.
Phase W.8 — TinyGo / Go-on-WASI integration for CUE
- Build a CUE evaluator binary against TinyGo or upstream Go’s
GOOS=wasip1. Run it in the host adapter against a CUE source blob granted as aScriptPackage(future package-cap surface, same shape as the plannedLaunchParameterswork). - Reuses existing CUE workflows; capOS just hosts the evaluator.
Deliverable: capOS can evaluate CUE manifests at runtime
without the host toolchain. Bridges to the eventual native Go track
(go-runtime-proposal.md).
Languages Targeting WASI
What capOS gets “for free” once the host adapter exists, ranked by how mature each language’s WASI target is. This is the leverage argument: one host adapter unlocks every row at once.
| Language | WASI status | Toolchain | Native capOS alternative | When WASI wins |
|---|---|---|---|---|
| Rust | wasm32-wasip1 Tier 2 since 1.78; wasm32-wasip2 Tier 2 since 1.82 | cargo build --target wasm32-wasip2 | targets/x86_64-unknown-capos.json (implemented) | Untrusted Rust plugins. Cross-compiled tools. |
| C / C++ | wasi-libc + Clang --target=wasm32-wasi; wasi-sdk packaged | clang --target=wasm32-wasi | future libcapos | Any C/C++ tool needing portability before libcapos lands. CPython-on-WASI today is the canonical example. |
| Go (upstream) | GOOS=wasip1 since Go 1.21 (Aug 2023). Single-thread, blocking I/O, no goroutine parallelism. | GOOS=wasip1 GOARCH=wasm go build | future GOOS=capos (go-runtime-proposal.md) | CUE evaluation, go run style tools, single-goroutine compute. |
| TinyGo | wasip1 supported; wasip2 supported in dev branch | tinygo build -target=wasip2 | n/a | Smaller Go binaries; Component Model export of typed interfaces. |
| Python (CPython) | wasm32-unknown-wasip1 Tier 2 (PEP 11) | Upstream CPython build | future native CPython through POSIX adapter | Sandboxed Python plugins, configuration scripts. |
| AssemblyScript | Designed for wasm; WASI host integration via runtime | asc | n/a | Lightweight typed scripting. Less interesting on capOS than Lua. |
| Zig | Native wasm32-wasi target; no runtime overhead | zig build-exe -target wasm32-wasi | n/a | Zig systems code in a sandbox. |
| Lua / interpreters in general | A Lua interpreter compiled to wasi runs Lua scripts in a wasm sandbox | Compile any C interpreter to wasm32-wasi | Lua piccolo runner (lua-scripting-proposal.md) | When Lua scripts are untrusted. The piccolo native-Rust runner remains the right answer for trusted capOS scripting. |
| JavaScript | QuickJS-on-wasi works today | Compile QuickJS to wasm32-wasi | QuickJS native runner (future) | Untrusted JS plugins; portable JS without writing a native QuickJS runtime. |
| .NET (mono-wasi) | Experimental | dotnet wasi-experimental | n/a | If a port of a .NET tool is required. Low priority. |
When WASI vs Native
These are complementary tracks, not competitors.
- Native wins for foundational services, performance-critical code, anything calling typed capOS caps directly, anything needing real threads, full async I/O, or first-class participation in the cap graph.
- WASI wins for portability or untrusted code execution, for any
existing C/C++ program with wasi-libc support that cannot wait for
libcapos, for CPU-bound CUE evaluation before native Go lands, and for sandboxed user-submitted scripts.
The browser-wasm proposal captures the same intuition: the cap-ring layer is the only stable interface that survives substrate swaps. The WASI host adapter is another substrate swap, this time at the language level instead of the hardware level.
Validation
The first implementation is not complete until it has QEMU evidence:
- A wasm module prints through a granted
Console/TerminalSession. - The same module cannot use
fd_writeto a fd it was not granted, cannot open a path outside its preopened namespaces, and cannot call an unimplemented WASI function without receivingERRNO_NOSYS. - A missing or wrong-interface cap lookup returns the appropriate WASI errno (not a host-side panic, not silent success).
- An owned result cap is released deterministically when the instance exits.
- The host adapter exits cleanly and does not wedge the kernel.
Host tests should cover WASI value conversion and import-resolver generation once those pieces are pure enough to test outside QEMU. Do not claim “WASI works” from host tests alone; the useful behavior is authority-shaped wasm execution in capOS.
Open Questions
-
Per-instance vs per-process. One wasm instance per
capos-wasmprocess (recommended) or many? Affects fuel/budget enforcement and the manifest shape. Resolved 2026-05-13 16:46 UTC — one wasm instance percapos-wasmprocess. Phases W.2–W.4 shipped on top of this shape:capos_wasm::Runtimeowns exactly onewasmi::Engineand oneStore<HostState>, andHostStateaggregates the per-instanceConsole/Timer/RingClient/ optionalEntropySource/ optionalBootPackageclients plus the per-instanceWasiArgs/WasiEnvbundles. That host state IS the per-instance state; there is no second instance to demultiplex against. The decision aligns with capOS capability discipline: the per-process CapSet is the authority boundary, manifest grants are scoped one binary at a time (docs/capability-model.md), and the capOS process boundary already provides the fault, fuel/budget, and audit isolation a multi-tenant wasm host would otherwise need to rebuild inside the runtime. Preview 2 / Component Model migration in Phase W.7 inherits the same per-process shape — onecapos-wasmprocess per top-level component — and gains nothing from packing many components into one process while the OS-level isolation is free. A future multi-instance host (plugin sandboxes, embedded scripts) is allowed but must come back as a separate proposal that names the density target, the fuel andpoll_oneoffreactor design, and the audit/observability shape; it does not block any current phase. -
Capability handle path: extension import or pure WASI-only? Custom externref imports lock capOS into a non-portable WIT dialect. Working answer: skip the custom-import path entirely; jump straight to Preview 2 / Component Model in Phase W.7.
-
poll_oneoffsemantics over the capOS ring. Block the host process’scap_enter(simple, scales to one instance per process), or run a single-thread reactor that drives multiple instances in round-robin (scales to many instances per process)? Coupled to Q1. Resolved 2026-05-13 16:46 UTC — blockingcap_enteragainst the single per-process instance, with the surface expanded one subscription kind at a time as the underlying caps land. v0 keeps the W.2 sub-slice 2ERRNO_NOSYSstub already incapos-wasm/src/wasi/preview1.rs: there is no portable subset ofpoll_oneoffwe can answer correctly withoutNamespace/File/TcpSocket/UdpSocketcaps, and the existingmake run-wasi-preview1-refusalsharness proves the refusal closes cleanly. Phase W.5 (filesystem) is the first phase that consumes a real subscription kind —eventtype_clockagainst monotonic time pluseventtype_fd_read/eventtype_fd_writeagainst preopen-fdFilehandles — and will implement those subscription kinds by walking the subscription array, demultiplexing each subscription onto a single blockingcap_enterover the per-process ring, and returning the events the kernel completes. Phase W.6 adds the socket subscription kinds againstTcpSocket/UdpSocketonce the userspace network stack lands. A multi-instance reactor stays out of scope: §1 resolves to one wasm instance percapos-wasmprocess, sopoll_oneoffonly ever has to demultiplex one instance’s subscription set, and the kernel ring is already a completion-queue primitive that fits that shape directly. Realtime clock subscriptions remainERRNO_NOSYSuntil a typedWallClockcap exists (same ceiling asclock_time_get(CLOCKID_REALTIME)). -
Fuel budget defaults and exhaustion semantics. wasmi exposes fuel; what is the default budget per instance, and what is the exhaustion behaviour (instance traps and exits, or instance pauses pending refill from a
FuelGrantcap)? Affects the cap surface. Working answer: trap-and-exit default; defer theFuelGrantcap until long-running plugins exist. -
Typed result-cap from a host call into a wasm module. Preview 1 has no
externref. How does the host hand a typed cap back to the instance after a CALL that returns a transferred result cap? Working answer: v0 reifies result caps as integer fds in the per-instance fd table; the host returns fd numbers from capability-issuing imports. Defer typed caps in wasm imports to Preview 2 / Component Model in Phase W.7, where WIT resources match the shape directly. Phase W.3 status (2026-05-07 18:25 UTC): unchanged. W.3 does not introduce any capability-issuing import, so no result-cap reification path landed; the working answer carries forward into W.5 (filesystem) / W.6 (sockets), which are the first phases that will exercise it. -
environ_getsource. Empty-by-default, or backed by aKeyValueScope/ConfigOverlaycap? Resolved by Phase W.3 (2026-05-07 18:25 UTC) and the 2026-05-13 follow-up — bounded manifest-provided text grant, empty when absent. Migration to a futureLaunchParameterscap remains the open path. Original working answer: empty for v0 unless the manifest supplies a bounded text environment grant; bind to whatever environment cap a future capOSLaunchParameterssurface produces (no in-tree plan owns this yet; the shell proposal sketches the broader launch-args/environment discussion). Phase W.3 decision (2026-05-07 18:25 UTC): kept empty-by-default and shipped the argv text grant only. 2026-05-13 update: the same bounded manifest-text pattern now exists asinitConfig.init.wasiEnv, a CUE text list under the existinginitConfigCueValuefield (noschema/capos.capnpchange). Capacity bounds incapos-wasm/src/payload.rs:WASI_ENV_MAX_COUNT = 32environment entries.WASI_ENV_MAX_ENTRY_BYTES = 4096per entry (NUL terminator not included).WASI_ENV_MAX_TOTAL_BYTES = 8192for the packed environment buffer including per-entry NUL terminators. Interior NUL bytes inside an entry are rejected. The decoder tolerates an absent or emptywasiEnv, in which case Preview 1environ_get/environ_sizes_getreport zero entries (the W.2 behavior). A futureLaunchParameterscap remains the migration path for argv and environ together.
-
args_getsource. Reuse a future capOSLaunchParameterssurface (not yet in tree), or ship a wasm-host-specific text grant in the manifest until that surface lands? Resolved by Phase W.3 (2026-05-07 18:25 UTC) — bounded manifest-provided argv text grant oninitConfig.init.wasiArgs, migrating to the futureLaunchParameterscap once it exists. Original working answer: ship a small bounded text grant for v0; migrate to the futureLaunchParameterssurface once it exists. Phase W.3 decision (2026-05-07 18:25 UTC): shipped asinitConfig.init.wasiArgs, a CUE text list under the existinginitConfigCueValuefield (noschema/capos.capnpchange). Capacity bounds incapos-wasm/src/payload.rs:WASI_ARGS_MAX_COUNT = 32argv entries.WASI_ARGS_MAX_ARG_BYTES = 4096per entry (NUL terminator not included).WASI_ARGS_MAX_TOTAL_BYTES = 8192for the packed argv buffer including per-entry NUL terminators. Interior NUL bytes inside an argv entry are rejected (would corrupt the WASI Preview 1 NUL-terminated layout). Each violation surfaces through a stable wasm-host exit code so harnesses can distinguish them from generic decode failures. The decoder tolerates an absent or emptywasiArgs, in which case Preview 1args_get/args_sizes_getreport zero entries (W.2 behaviour). Migration to the futureLaunchParameterscap stays the open path per the original working answer.
-
Vendoring posture for wasmi.
vendor/wasmi-no_std/(forked, patched) or acargo-vendor-style mirror of upstreamdefault-features = false? Same question as the piccolo Lua track. Resolved 2026-05-05 19:12 UTC: mirror-as-is. The vendored snapshot atvendor/wasmi-no_std/wasmi-1.0.9/is a static-pinned copy of upstreamv1.0.9with no source patches; cargodefault-features = falsestripsstd/watcleanly out of the box. Provenance and refresh procedure are recorded invendor/wasmi-no_std/VENDORED_FROM.md. This posture is independent of what the Lua track chooses; if the two tracks diverge, document the divergence in each track’sVENDORED_FROM.md. -
WASI module distribution and versioning. Shipped inline in a manifest blob (today), or via a future
Store/Namespace? Working answer: inline blobs for v0; revisit after the storage proposals land. -
Component-Model adoption timeline. Skip Preview 1 entirely and target Preview 2 from day one? Possible with wasmtime, harder with wasmi today. Working answer: ship Preview 1 first because it unlocks Rust, C, Go, Python, TinyGo immediately; layer Preview 2 on once wasmi’s component support hardens or migrate to wasmtime.
-
Out-of-tree wasm packaging. Will capOS ship pre-built
.wasmbinaries from the boot manifest only, or will operators bring their own? Same scoping question as the futureLaunchParameters/ package-cap surfaces. Working answer: in-tree only for v0–v6; out-of-tree once aStorecap can hold blobs. -
Audit cap shape for wasm instance lifecycle events. Same open question as Lua scripting Phase 4. Component-Model paths benefit from per-instance audit because resource handoffs are interesting events to record. Working answer: defer until the userspace audit cap surface exists.
Progress 2026-05-13 16:46 UTC: §1 (per-instance vs per-process) and §3
(poll_oneoff semantics) resolved. §1 is locked at one wasm instance
per capos-wasm process, matching the per-process Runtime +
Store<HostState> shape shipped through Phases W.2–W.4 and the
per-process CapSet authority boundary; future multi-instance hosting
must come back as a separate proposal. §3 keeps the W.2 sub-slice 2
ERRNO_NOSYS poll_oneoff stub for v0 and pre-commits Phases W.5 / W.6
to extend it one subscription kind at a time (monotonic clock + fd
read/write in W.5 against Namespace/File caps, sockets in W.6
against TcpSocket/UdpSocket caps), demultiplexed onto a single
blocking cap_enter over the per-process ring; multi-instance reactors
remain out of scope. §6 (environ_get) and §7 (args_get) reclassified
as resolved by Phase W.3 (2026-05-07 18:25 UTC) with the bounded
manifest-text grants on initConfig.init.wasiEnv /
initConfig.init.wasiArgs; the migration path to a future
LaunchParameters cap is preserved.
Relationship to Other Proposals
- Userspace Binaries owns the broader native-binary, language, and POSIX-adapter roadmap. This proposal supersedes Part 5 of that proposal with the full WASI host adapter design.
- Programming Languages is the reader-facing summary of language support; the WASI row points at this proposal.
- Browser/WASM is the separate browser-hosted wasm experiment. Both proposals share wasm-runtime insight but target different substrates.
- Lua Scripting is the trusted capability-scoped script runner using a native (likely piccolo) Lua VM. WASI-hosted Lua is the untrusted alternative.
- Go Runtime is the native
GOOS=caposalternative to Go-on-WASI. Go-on-WASI is the v0 path for CUE evaluation; native Go is the path for full Go runtime semantics. - Storage and Naming defines the
Directory/File/Store/Namespacesurfaces that Phase W.5 consumes. - Networking defines the
TcpSocket/UdpSocketsurfaces that Phase W.6 consumes. - Service Architecture defines
Fetch/HttpEndpoint, useful as the v0 networking shim before the full userspace network stack lands.
Proposal: POSIX Compatibility Adapter
How capOS should host POSIX-shaped C software without recreating the ambient authority that makes POSIX hard to confine, and which two ports validate the adapter for the first time.
Problem
capOS is not POSIX and is not trying to become POSIX. But useful software – DNS resolvers, line-editing libraries, shells, archivers, compilers, network clients – assumes a POSIX surface. Rewriting each of these in capability- native Rust would forfeit decades of debugging, security review, and performance work for no isolation gain: a POSIX program whose only authority is a typed capability set is already as confined as an equivalent native one.
The risk pattern is the one POSIX historically gets wrong: a translation layer
that synthesises ambient authority (a global /, an inherited credential
table, a kernel-managed file descriptor map) rebuilds the property capOS is
trying to leave behind. A useful adapter must do the opposite – every POSIX
call must be backed by a typed capability the calling process already holds,
or it must fail closed with a documented errno.
Two upstream programs are the natural first validators of that adapter:
- A POSIX shell exercises the broadest surface (process, pipe, file, env, signal stubs, stdio).
- A DNS resolver exercises the smallest network surface (UDP socket, one-shot poll-equivalent, time, log).
Both are already small, mature, and BSD/MIT-licensed. Picking the smallest representative of each category makes the adapter’s first job a real port, not a synthetic test.
Scope
In scope:
- A two-layer C substrate:
libcapos(thin Rust staticlib, capability ring + CapSet + raw syscalls + heap, C ABI) andlibcapos-posix(POSIX shape on top: fd table, errno, path resolution, posix_spawn shim, signal stubs, pthread mapping). - A first POSIX shell port that builds against
libcapos-posixwith no hidden ambient authority. - A first DNS resolver port that builds against
libcapos-posixwith no hidden ambient authority. - Phase decomposition (P1.1, P1.2, P1.3) that defers the adapter’s biggest dependencies (Namespace + File caps for the shell file path; UDP cap for the resolver) into clearly-named gating phases.
- Validation through QEMU smokes that prove granted and ungranted paths.
Out of scope for the first implementation:
- Binary compatibility with Linux ELFs. Both ports are sources-on-disk
recompiled against
libcapos-posix. - Full POSIX compliance. The adapter ships exactly the surface dash and dns.c exercise, plus any free additions that fall out.
- Real
fork()(parent state inheritance, COW, sibling address-space surgery before exec). Onlyfork()followed promptly byexecve()is supported, via aposix_spawn-shaped shim. - Real signal delivery.
signal()/sigaction()accept the call, store the handler, never invoke it.kill(2)requires a futureProcessHandlecap. - Job control, process groups, sessions, controlling terminals.
- musl, glibc, or any other host libc. The substrate is Rust-authored and exposes a C ABI; it is not a libc port.
- Hosted C++. ABI decisions for C++ remain tracked in
docs/proposals/userspace-binaries-proposal.md.
Current Manual Pages
- Programming Languages summarizes POSIX
adapter status relative to Rust, C/C++, Python, Go, Lua, and WASI tracks;
the C row records the shipped
libcapos.a+libcapos_posix.asurface, and the POSIX-shaped software row records P1.1/P1.2/P1.3 closeouts plus the in-progress P1.4 dash-port phase shape over the bootstrap-granted rootDirectorycap surface, including the signal/time stub closeout. - Userspace Binaries Part 4: POSIX Compatibility Adapter sketches the POSIX adapter at a higher level. This proposal supersedes that sketch with the full design surface; the userspace-binaries proposal continues to own the broader native-binary, language, and adapter roadmap.
- Userspace Runtime documents the
implemented
capos-rtsurface thatlibcaposmirrors for C consumers. - Networking defines
NetworkManager,TcpListener, andTcpSocketand explicitly defersUdpSocketuntil DNS / userspace-network work needs it. The DNS resolver port in this proposal defines the UDP cap surface; the TCP cap surface is reused unchanged. - Storage and Naming defines the
Namespace,Directory,File, andStorecap shape; these gate the shell port’s filesystem surface (Phase 2/3 of that proposal). - Service Architecture frames the future
Resolvercap as the long-term consumer of the resolver process built in this track. - Shell covers the native
capos-shell. The POSIX shell port (dash) is for porting validation, not as a replacement for the native shell. - WASI Host Adapter is the parallel untrusted-portable execution path; both proposals share fd-table and per-import authority insight, but target different substrates.
Research Grounding
Relevant research and external references:
- POSIX shell candidates surveyed: dash (Debian Almquist Shell, ~13 kSLOC,
BSD; the canonical small POSIX-strict shell); busybox
ash; OpenBSD ksh (oksh); toyboxtoysh. Source repositories cited inline in the candidate comparison table. - DNS resolver candidates surveyed:
dns.cby William Ahern (single-file MIT, ~10 kSLOC, no dependencies); c-ares; GNU adns; udns; SPCDNS; musl’s embeddedres_query; trust-dns-resolver. Source repositories cited inline in the candidate comparison table. - libcapos prior art: this proposal builds on the
libcaposshape sketched in Userspace Binaries “Future: C vialibcapos” / “Future Phase: libcapos for C”. The C substrate is designed as a Rust staticlib with a C ABI rather than musl, redox relibc, or a hand-rolled libc. Fuchsia’s fdio + musl pattern and Redox’s relibc pattern are the comparable points; capOS deliberately picks neither. - POSIX surface translation: Cygwin’s
fork()emulation is the closest prior art for fork-for-exec semantics on top of a non-fork substrate; the capOS shim inverts the default (capOS cannot fork; the shim emulates the useful case) but uses the same call-pattern recognition.
In-tree research grounding:
- Genode – per-session typed service interfaces and resource accounting are the closest precedent for routing every POSIX wrapper through a typed cap rather than through an ambient kernel syscall table. POSIX adapter wrappers should follow the same pattern at the library boundary instead of the kernel boundary.
- OS Error Handling – cross-OS
comparison of error-model surfaces. Informs the bidirectional mapping
between
CapError/CapExceptionand POSIX errno (Open Question §4) and the decision to keep one shared mapping table at the C boundary rather than per-wrapper bespoke mappings. - LLVM Target – target triple, calling
convention, and bare-metal toolchain options for capOS C consumers.
Informs Open Question §11 on the linker / toolchain choice (
clang --target=x86_64-unknown-none-elf -nostdlib -static).
This proposal also lifts the capability-mapping shape and the “every
translation has authority backing” property from the WASI host adapter
proposal, and the libcapos staticlib shape from the userspace-binaries
proposal Part 2. It deliberately does not adopt the musl + __syscall
hook pattern noted in the userspace-binaries proposal “musl as a Base
(Optional, Later)” section, because the layered Rust staticlib shape is
preferred over a libc port for the v0 surface.
External:
- dash – Debian Almquist Shell, ~13 kSLOC, Debian’s
/bin/shsince Squeeze (2011). - busybox
ash– alternative Almquist port, embedded. - oksh – portable OpenBSD ksh, public domain, larger surface.
- toybox toysh – 0BSD, currently incomplete.
- c-ares – modern async DNS resolver, MIT, larger.
- dns.c – single-file non-blocking DNS, MIT, no deps.
- GNU adns – async DNS resolver, GPL-2.0+.
- musl resolver – embedded in musl libc; not available without linking musl.
- udns – small async stub-only resolver, LGPL-2.1.
Design Principles
- POSIX is not a kernel feature. The kernel sees ordinary userspace
processes with a CapSet and a capability ring.
libcaposandlibcapos-posixare static libraries linked into those processes. - Two layers, one C ABI per layer.
libcaposis the C-ABI mirror ofcapos-rt: capability ring, CapSet, raw syscalls, heap. It has no errno, no fd table, noopen/read/write.libcapos-posixbuilds the POSIX shape on top. Programs that do not need POSIX semantics may link onlylibcapos. - Authority is per-process, granted at spawn. Every fd a POSIX program
sees was granted to its parent process at spawn time and projected onto
an fd by
libcapos-posix. There is no ambient/, no inherited credential table, no global signal source. - Schema-first, not POSIX-first, at the boundary. Each POSIX wrapper is backed by a typed capability call with a documented errno mapping. POSIX-shaped integer fds and POSIX-shaped errno are an ABI requirement of the C substrate, not a capability-model concession.
- Fail closed. Any unimplemented POSIX call returns
ENOSYSand sets errno. Any cap lookup that fails returns the documented errno. Programs cannot probe absent caps for ambient behaviour. - No fork without exec. Only
fork()followed byexecve()is supported. The shim turns the pair intoposix_spawn(). Barefork()used to clone state in-process fails on the next non-trivial syscall. - No real signals. Handlers are accepted and stored, never delivered.
kill(2)requires a futureProcessHandlecap and even then is limited toSIGKILL. Programs that depend onSIGCHLDjob control are out of scope. - The C substrate is Rust.
libcaposandlibcapos-posixare Rust crates withcrate-type = ["staticlib"], all symbols#[no_mangle] extern "C". This is not musl, not a hand-rolled libc.
Architecture
flowchart TD
Shell["POSIX shell binary<br/>(e.g. dash)"]
Resolver["DNS resolver binary<br/>(e.g. dns.c)"]
Posix["libcapos-posix<br/>(POSIX adapter, Rust staticlib, C ABI)"]
PosixDetail["fd table per process<br/>path resolver over Namespace + Store<br/>errno mapping (TLS cell)<br/>posix_spawn over ProcessSpawner<br/>signal stubs<br/>pthread over ThreadSpawner"]
Posix --> PosixDetail
Capos["libcapos<br/>(thin Rust staticlib, C ABI)"]
CaposDetail["cap_call / capset_get / capset_iter<br/>sys_exit / sys_cap_enter<br/>heap (malloc/free over capos-rt allocator)<br/>typed wrappers for Console / Terminal / etc."]
Capos --> CaposDetail
Rt["capos-rt<br/>(no_std + alloc Rust)"]
Ring["capability ring"]
Kernel["kernel CapObject dispatch"]
Services["userspace services"]
Shell -->|"open/read/write/exec/..."| Posix
Resolver -->|"socket/sendto/recvfrom"| Posix
Posix -->|"extern C"| Capos
Capos -->|"Rust FFI re-export"| Rt
Rt --> Ring
Ring --> Kernel
Ring --> Services
libcapos is the C-ABI projection of capos-rt. libcapos-posix is the
POSIX projection on top. Every POSIX call ultimately resolves to either a
capability invocation through the ring or a synthetic answer (errno,
ENOSYS) computed without authority.
libcapos: C-Facing Substrate
Headers expected to ship under include/capos/:
// capos.h -- capability primitives only
typedef struct cap_ring cap_ring_t;
typedef uint32_t cap_id_t;
typedef uint64_t iface_id_t;
cap_ring_t *capos_ring(void); // process ring handle
int cap_call(cap_ring_t *ring,
cap_id_t cap, uint16_t method,
const void *params, size_t plen,
void *result, size_t rlen,
size_t *out_len);
int capset_get(const char *name,
cap_id_t *out_cap, iface_id_t *out_iface);
size_t capset_iter(void (*cb)(const char*, cap_id_t, iface_id_t,
void*), void *ud);
_Noreturn void sys_exit(int code);
uint32_t sys_cap_enter(uint32_t min_complete, uint64_t timeout_ns);
// Heap (backed by capos-rt fixed heap; grow-on-demand later if needed)
void *capos_malloc(size_t);
void capos_free(void*);
void *capos_calloc(size_t, size_t);
void *capos_realloc(void*, size_t);
There is no errno here, no open/read/write. Those live one
layer up. libcapos is the C-ABI mirror of capos-rt: startup, ring,
CapSet, raw syscalls, heap.
Build artifact: target/.../libcapos.a plus headers. Naming for the C
library is intentionally just libcapos, mirroring how the Rust
runtime crate is capos-rt. The C library name libcapos is
distinct from any Rust service framework that may carry a similar name;
this proposal owns the C-substrate name and treats Rust-framework
naming as out of scope.
libcapos-posix: POSIX Surface
Headers under include/capos/posix/: unistd.h, fcntl.h, errno.h,
sys/socket.h, netdb.h, sys/stat.h, dirent.h, string.h, stdlib.h
(subset), sys/types.h, pthread.h (subset), signal.h (stub).
Implementation language: Rust, same crate-type pattern as libcapos,
but linked separately so a binary that does not need POSIX can omit it.
Errno bridge: per-thread errno cell stored in TLS slot owned by
libcapos-posix; populated by every wrapper that maps a Rust CapError to
a POSIX errno value. See “errno Convention” below.
File descriptor table
Per-process userspace state inside libcapos-posix. Not a kernel object –
neither libcapos nor the kernel know anything about fds.
#![allow(unused)]
fn main() {
// libcapos-posix/src/fd.rs (sketch)
struct FdEntry {
backing: FdBacking, // Console / Stream / Listener / File / Dir
flags: i32, // O_NONBLOCK, FD_CLOEXEC, ...
cursor: u64, // for seekable backings
}
enum FdBacking {
Stdin, // Console / TerminalSession (read side)
Stdout, // Console (write side)
Stderr, // Console (write side)
File { file: Cap<File>, dirty: bool },
Dir { dir: Cap<Directory>, iter: usize },
Tcp { sock: Cap<TcpSocket> },
Udp { sock: Cap<UdpSocket> },
Listener { l: Cap<TcpListener> },
}
static FD_TABLE: Mutex<BTreeMap<i32, FdEntry>> = ...;
static NEXT_FD: AtomicI32 = AtomicI32::new(3);
}
dup/dup2/close operate on this table. dup increments a refcount on
the underlying cap; close releases when the last fd holding the cap drops.
Cap drop runs through capos-rt owned-handle release. The fd table is a
strict per-process userspace structure; it is not shared with the kernel
and is never serialised on the wire.
Standard fds wired at _start:
- fd 0:
stdincap from CapSet (TerminalSession, Console, or future StdinReader-shaped cap, whichever is granted). - fd 1:
stdoutConsole cap. - fd 2:
stderrConsole cap (or distinct Log cap if granted).
Process model: fork-for-exec only
capOS process creation is ProcessSpawner.spawn(name, binaryName, grants)
(kernel/src/cap/process_spawner.rs). There is no fork(), no
exec()-in-place.
Decision matrix (working answers; the policy choice is Open Question §6 and is not settled until that question is confirmed):
| Option | What it provides | Cost | Working answer |
|---|---|---|---|
Emulate fork() as posix_spawn with inherited cap-set, recording inter-call dup2/close as posix_spawn file actions | Existing fork+exec and fork+dup2+exec pipeline patterns work with one patch site | Daemonisation and arbitrary COW state inheritance between fork and exec still break | Recommended primary for the shell, with documented “fork-for-exec only” semantics. Whether the shim records inter-call file actions or requires the port to call posix_spawn with explicit file actions is Open Question §6. |
Return ENOSYS for any fork() | Honest | Every POSIX program that uses fork must be patched | Recommended safety net when fork-for-exec is misused |
| Process-shadow: a “POSIX process” wraps a capOS process | General | Large kernel + runtime change; doubles process accounting | Recommended reject for v0; revisit only if a real POSIX program needs it |
Working answer: fork-for-exec, with hard-fail as the safety net (subject to
Open Question §6 confirmation before P1.3 begins). Two libcapos-posix
shim variants are on the table; §6 selects between them:
- Variant A – recording shim.
libcapos-posixexposesfork()andexecve()as a coupled shim that:fork()records “next exec is the real spawn” in TLS and returns 0 unconditionally. Only theif (pid == 0)branch ever executes; the legacyelsebranch is unreachable becausepidis always 0. Porters MUST move the parent flow (drop unused write end, drain read end,waitpid) to AFTER the if-block, with the synthetic pid handed off viachild = execve(...);near the end of the if-body. Pictorially:
There is nopid_t child = fork(); // returns 0 unconditionally if (child == 0) { dup2(...); close(...); // recorded into TLS child = execve(...); // returns synthetic_pid > 0 if (child < 0) { // surface to error path goto exec_failed; } } /* parent flow runs here, NOT in an else branch */ close(...); read(...); waitpid(child, ...);elsebranch in the v0 contract, only the post-if parent flow.dup2()/close()calls betweenfork()andexecve()are recorded asposix_spawnfile actions on the pending spawn rather than mutating the parent’s fd table.execve(path, argv, envp)consumes the recorded intent, callsProcessSpawner.spawn()with attenuated grants and the recorded file actions, and returns the synthetic child pid as its own return value (a deliberate v0 deviation from POSIX). The pseudo-child branch is still the original parent process, so porters MUST NOT call_exit()on failure:_exit()would terminate the actual shell. The recommended pattern surfaces the failure to the caller’s normal error path:
On failureint spawn_pid = execve(...); if (spawn_pid < 0) { /* execve() failed before any spawn; recording state is * already cleared and the parent fd table is unchanged. * Return up to the caller with the matching errno. */ goto exec_failed; /* or equivalent error-recovery path */ } child = spawn_pid; /* parent flow: waitpid(child) */execve()returns -1 witherrnoset; callers MUST surface the failure to their normal error path rather than calling_exit(), because the pseudo-child branch is still the parent process and_exit()would terminate the actual shell.- Any
fork()not followed byexecve()before a syscall outside the recorded-action allowlist (e.g.setsid) returns -1 / ENOSYS on that downstream call.
- Variant B – patched-port shim.
libcapos-posixexposes onlyposix_spawn()with explicit file actions, plus stubfork()/execve()that return -1 / ENOSYS. Each port (dash and successors) is patched to translate its fork+dup2+exec sequence into a singleposix_spawn()call with the equivalent file actions.
posix_spawn() is the preferred primitive in either variant and gets a
direct mapping to ProcessSpawner.spawn(). The choice between Variant
A and Variant B is Open Question §6.
fd-backing-cap inheritance (kernel precursor). For a fork/execve child to
inherit a parent fd that is backed by an opened Directory/File cap, that cap
must be forwardable through ProcessSpawner.spawn. Read-only Directory/File
caps are now minted Copy/SameSession (directory::transfer_result_cap,
readonly_fs, installable_image, and the kernel:directory/kernel:file
bootstrap sources), so the shim can forward an opened read-only directory or
file to the spawned child as a Raw spawn grant; the child looks it up by name
from its CapSet and projects it back onto the inherited fd. The disk-backed
writable filesystem stays NonTransferable (single-writer policy), so a
writable fd cannot be inherited this way. The kernel handoff is proven in
isolation by make run-spawn-grant-directory; see
Capability Model “Read-only filesystem caps are
forwardable”. The recording shim emits these grants. As of posix-recording-shim-full-fd-inherit
(done 2026-05-27) inheritance is full-fd-table by default, matching POSIX
fork+execve: execve forwards every open parent slot – not only
dup2/close-touched ones – as a stdio_<child-slot> spawn grant, with the
recorded actions applied as edits on top of that baseline. Per backing:
Directory/Console/File/TerminalSession forward as SpawnGrantMode::Raw
over the Copy-transferable cap (the parent keeps its own fd; an aliased slot
Copy-shares to several child slots), and a Pipe end forwards as a single
Move (leaving the parent slot a Moved sentinel). A slot marked
FD_CLOEXEC/O_CLOEXEC is dropped from the child unless an explicit recorded
dup2 named that child slot (POSIX dup2 clears close-on-exec). A
non-forwardable backing inherited implicitly (Udp, an already-moved slot, or a
shared Pipe) is skipped non-fatally; an explicit dup2 of one fails closed.
The child’s posix_inherit_stdio() reconstructs each grant into the matching fd
slot by interface id, wrapping an inherited directory fd through fdopendir().
End-to-end proofs make run-posix-fd-inherit-default (parent inherits stdio +
directory by default with no stdio dup2; CLOEXEC fd excluded; terminal retained
via Raw; Copy-share alias) and make run-posix-execve-inherit-smoke (the
explicit-dup2 parent, now redundant but still correct). Because the v0 POSIX
open surface mints only Copy/SameSession File/Directory caps, the
disk-backed writable NonTransferable filesystem cannot enter the fd table
here; if a future writable open path mints one, full inheritance needs a
pre-spawn transferability check to skip it (today it would surface as the
whole-spawn ENOEXEC). An inherited File resets to offset 0 (the parent’s seek
position is userspace state that does not travel with the cap).
The recording-shim execve(path, argv, envp) path also forwards argv without
changing the generated ProcessSpawner.spawn(name, binaryName, grants) schema:
the parent validates the C argv vector, writes a bounded binary argv record into
a private Pipe, and grants only the read end to the child as posix_argv.
Child code opts in with posix_args(), which prefers posix_argv when present
and otherwise falls back to manifest initConfig.init.posixArgs through the
boot BootPackage cap. The pipe payload is capped by the existing 4 KiB Pipe
transport, so direct large manifest posixArgs remain the wider PID-1 channel.
Malformed or over-budget execve argv fails before fd-action replay; the focused
proof asserts this does not mutate the parent’s fd table.
Signals
Stubbed. capOS has no signal mechanism today and the cap model disagrees with ambient asynchronous interrupts.
signal()/sigaction()accept the call, store the handler in a per-process table, never invoke it. Return success.kill(pid, sig)returns -1 / EPERM unless the caller has aProcessHandlecap for the target – and even then the only signal honoured would beSIGKILL, which maps to a futureProcessHandle.kill()outside this v0 POSIX surface.raise(sig)returns -1 / ENOSYS. Self-delivery is still signal delivery, and capOS v0 intentionally does not fake it.sigemptyset/sigfillset/sigaddset/sigdelset/sigismemberare real bit operations on the caller’ssigset_t(auint64_t).sigprocmaskkeeps a per-process blocked mask so ports can save and restore it during job control, honoursSIG_BLOCK/SIG_UNBLOCK/SIG_SETMASK, and force-clearsSIGKILL/SIGSTOPper POSIX – but the mask is stored, never enforced, because there is no delivery to block.sigpendingalways reports an empty set for the same reason.pause()/sigsuspend()/sigwait()block forever (or with timeout) viasys_cap_enter(0, timeout); they never wake from a signal.SIGPIPEis never delivered. Writes on a closed connection return -1 / EPIPE.
This is acceptable for a shell + DNS resolver. Anything that depends on
real signals (job control with Ctrl-Z, Ctrl-C across pipelines, real
SIGCHLD) is out of scope for the first port. Job control in the shell
must be reimplemented over typed control caps, not signals.
errno convention
Per-thread errno cell in TLS owned by libcapos-posix. Mapping table
(libcapos-posix/src/errno_map.rs):
capOS CapError / CapException | POSIX errno |
|---|---|
CapError::NotFound | ENOENT |
CapError::PermissionDenied | EACCES |
CapError::Disconnected | ECONNRESET |
CapError::Timeout | ETIMEDOUT |
CapError::ResourceExhausted | ENOMEM / EMFILE (context dependent) |
CapError::InvalidArgument | EINVAL |
CapError::WouldBlock | EAGAIN |
| (fall-through) | EIO |
Wrappers always: clear errno, call, on error set errno + return -1 (int) or NULL (pointer). Same convention as glibc / musl.
Threading
pthreads -> capOS in-process threading. Substrate already exists in the
kernel: ThreadSpawner, ThreadControl, ThreadHandle, per-thread
FS-base, ParkSpace.
Mapping:
pthread_create->ThreadSpawner.spawn+ start-routine trampoline.pthread_exit->ThreadControl.exitThread.pthread_join->ThreadHandle.join(block viacap_enter).pthread_self-> TLS slot orThreadControl.currentId.pthread_mutex_*-> ParkSpace-backed mutex (futex-style park / unpark).pthread_cond_*-> ParkSpace + bounded waiter queue.pthread_key_*-> fixed-size TLS slot table per thread.
This is in scope but not on the critical path for the shell or DNS resolver – both can run single-threaded for v0. The pthread shim is deferred to a v1 successor.
First Port: POSIX Shell
Candidate survey
| Shell | License | Size | Deps | POSIX coverage | Verdict |
|---|---|---|---|---|---|
| dash (upstream) | BSD | ~13 kSLOC, ~134 KB | tiny libc subset; no readline; no termcap | Strict POSIX, no extensions | Recommended primary |
| busybox ash (upstream) | GPL-2.0 | ~8 kSLOC of shell/ash.c + busybox infra | Designed for embedded, modular | POSIX + selectable extensions | Heavier framework cost; useful later when capOS wants a coreutils set |
| toybox toysh (upstream) | 0BSD | currently incomplete | Designed for self-contained ELF | POSIX + Bash compat target, not finished | Skip – explicitly described upstream as still under development |
| oksh (upstream) | Public domain | ~308 KB binary, 0 deps | Optional ncurses for clear-screen only | Korn-shell superset of POSIX | Bigger surface than v0 needs to validate libcapos-posix |
| Custom Rust shell | n/a | n/a | n/a | n/a | Reject – defeats the purpose of porting C. Native shell already exists at shell/ (capos-shell). |
Recommended primary: dash.
Reasons:
- Smallest established POSIX-strict shell. ~13 kSLOC is small enough for the porting team to read the entire codebase.
- No readline / termcap dependency. The shell talks to whatever fd 0
gives it. This is exactly what
libcapos-posixprovides throughTerminalSessionorConsole. - Strict POSIX means the port does not accidentally validate Bash
extensions that
libcapos-posixdoes not implement. - Already proven as a porting target on Linux from Scratch, OpenWrt, and
Alpine. Patterns for replacing the libc layer (
__syscall, stubbedsigaction) are well documented. - Debian uses it as
/bin/shsince Squeeze (2011), so any “POSIX shell only” script base in the wild is dash-compatible.
Open Question §1 below records this candidate as the final decision
(Decided (P1.4 Slice 1, 2026-05-24 00:53 UTC)).
Required POSIX surface (v0)
What a dash instance actually exercises before printing a prompt and
running ls | grep foo:
| Group | Calls (minimum set) | Backed by |
|---|---|---|
| Process startup | _start shim, argv/envp parsing, exit | libcapos _start, sys_exit |
| Stdio | read(0,...), write(1,...), write(2,...) | Console / TerminalSession cap |
| Allocation | malloc/free/calloc/realloc | libcapos heap |
| String/format | printf/fprintf/memcpy/strlen/strcmp/strchr/strncpy/… | libcapos-posix string/printf subset |
| File I/O | open/close/read/write/lseek/stat/fstat/access/unlink | Namespace + File caps |
| Directory | opendir/readdir/closedir | Directory cap |
| Pipes | pipe(), dup2(), close() on fds | NEW Pipe capability (P1.3) |
| Process | fork+execve (fork-for-exec only), posix_spawn, wait/waitpid | ProcessSpawner + ProcessHandle.wait |
| Env | getenv/setenv/putenv | Per-process env vector in libcapos-posix; populated from a future LaunchParameters cap when one lands |
| Signals | signal/kill/sigaction (stubs) | TLS-stored handlers, never delivered |
| Time | time/gettimeofday/nanosleep | Timer cap |
| Control flow | setjmp/longjmp over jmp_buf | libcapos x86_64 SysV global_asm (<setjmp.h>); no sigsetjmp |
| Misc | getpid/getuid/getgid | getpid from capos-rt bootstrap pid; uid/gid hardcoded for v0 |
The control-flow row was absent from the original minimum set above; dash’s
exception/interpreter control flow is built on setjmp/longjmp over a real
jmp_buf (pervasive in error.h/main.c/eval.c/parser.c/trap.c), so it
is a hard precursor for the dash build pipeline. It landed via the
libc-setjmp-longjmp task: the x86_64 SysV primitive in libcapos/src/setjmp.rs
with a <setjmp.h> header, re-exposed under libcapos-posix/include/capos/posix/,
and proven in QEMU by make run-posix-setjmp. sigsetjmp/siglongjmp are
intentionally absent (dash uses only the plain primitive; the v0 signal layer
has no asynchronous delivery and thus no signal mask to save).
Like the control-flow row, the table above also understated the header
layout and breadth of the libc surface a program of dash’s size needs. A
-nostdinc compile/link probe of the full vendored dash TU set
(2026-05-25 21:40 EEST) showed dash uses bare POSIX includes
(<unistd.h>, <fcntl.h>, …) — not the capOS capos/posix/*.h namespace — so
it requires a -nostdinc capOS POSIX sysroot plus a missing surface. This
landed (2026-05-25 22:23 UTC, libc-dash-sysroot-surface):
libcapos-posix/sysroot/include/ is the bare-header sysroot forwarding into
the capos/posix/* namespace, and the surface was completed —
strerror/qsort/umask/abort/setlocale/getrlimit/times/tcgetattr/
strtoll/strtoull/sig_atomic_t/NSIG/sigsuspend, the str* set, the
<termios.h>/<sys/resource.h>/<sys/times.h>/<locale.h>/<sys/types.h>
headers, and further items the table still understated: the C/POSIX-locale
multibyte layer (<wchar.h>/<wctype.h>, mbrtowc/wctype/iswctype/…) that
expand.c uses unconditionally, strpbrk, lstat, getgroups, wait3,
vfork, byte-order helpers, environ, and the sys_siglist array. The full
vendored dash TU set now compiles -nostdinc against the sysroot with no
unresolved libc symbols; proof make run-c-libc-surface. The dash build
pipeline (posix-p1-4-dash-build-pipeline) landed on top of it
(2026-05-26 05:11 UTC): make dash builds and links target/dash/dash.elf.
See docs/backlog/posix-adapter-dash-port.md Slice 12.5.
Critical gap: pipe(). The shell pipeline ls | grep foo requires fd 1
of ls to feed fd 0 of grep. capOS has no pipe capability today. This is
the first-port-blocking item; see Phase P1.3.
What dash will not get in v0
- Job control (Ctrl-Z,
bg,fg,&background): requires realSIGCHLD/SIGTSTP. Skip; documented as out of scope. - Process groups, sessions, controlling terminals: same reason.
trapfor signals other thanEXIT: handlers stored, never fired.read -t(timeout): doable via Timer cap; defer to v1.ulimit: returns 0 / ENOSYS. Quotas are kernel-side capability ledgers, not POSIX rlimits.
Validation smoke
make run-posix-shell-smoke:
- Boot a manifest that grants
dashaTerminalSession(stdio), a read-only bootstrap-grantedDirectorycap rooted at a tiny in-rodata pseudo-fs (the resolver remainsNamespace-shaped for forward parity with the future userspaceNamespaceservice; the v0 manifest grants aDirectorybecause that is what Storage Phase 3 slice 2 ships as a kernelCapObjecttoday), aProcessSpawnernarrowed to one allowed binary (ls-shim), and aTimercap. - Pipe a heredoc into stdin:
ls; echo done. - Assert kernel log shows
doneand clean exit.
Stretch goal smoke: cat foo | grep bar end-to-end (depends on the pipe
primitive landing).
First Port: DNS Resolver
Status update (post-smoltcp). The original v0 DNS smoke (
posix-dns-resolver, Phase P1.2 Phase B) drove a hand-rolled A query through a raw kernelUdpSocketcap; that smoke is retired with the qemu-only kernel UDP owner. Name resolution now goes through a typed systemDnsResolvercapability (network-system-dnsresolver-cap-local-proof), andlibcapos-posixexposes the standard POSIX surface over it:getaddrinfo/freeaddrinfo/gai_strerror(src/netdb.rs,include/capos/posix/netdb.h) resolve one IPv4Aresult through a granteddns_resolverendpoint and map the typed resolver status ontoaddrinfo/EAI_*, with no ambient UDP fallback (a process without the cap gets a deterministicEAI_FAIL). A read-only/etc/resolv.confprojection is materialized atopen()time from the resolver status (writes fail closed withEACCES; absent without the cap). Proof:make run-posix-getaddrinfo. The candidate survey below is retained as the original design rationale; vendoreddns.cis no longer on the critical path for the resolver bridge. AAAA /sockaddr_in6,AI_*flags, and/etc/servicesremain follow-ups (each fails closed:EAI_FAMILY/EAI_BADFLAGS/EAI_SERVICE).
Candidate survey
| Library | License | Source size | Deps | Async style | Verdict |
|---|---|---|---|---|---|
musl res_query (upstream) | MIT | ~2 kSLOC for resolver core | Embedded in musl | Synchronous (parallel queries internally) | Available only if the build links musl; capOS does not. Skip. |
| c-ares (upstream) | MIT, C89 | ~30+ kSLOC, multi-file, configure-driven | POSIX sockets, optional threads | Native async (callbacks + select/poll/event loop) | Largest surface, most mature, most invasive port |
| dns.c (wahern) (upstream) | MIT | single-file C, ~10 kSLOC, no deps | None – caller provides socket I/O via three pluggable patterns (pollfd / events / timeout) | Non-blocking, no required callback shape | Recommended primary |
| GNU adns (upstream) | GPL-2.0+ | Multi-file, ~10-15 kSLOC | POSIX, no event-loop integration | Async, opaque state | License is GPL-2.0+, not BSD/MIT. Skip unless capOS accepts a GPL component in the demo path. |
| udns (upstream) | LGPL-2.1 | small | POSIX | Async stub-only | LGPL plus older project; skip unless dns.c blows up |
| SPCDNS | LGPL | small | encode/decode only, no socket | n/a | Skip – provides no resolver loop |
| trust-dns-resolver in Rust | Apache-2 / MIT | large | Tokio | async | Reject – defeats the purpose of porting C. Native Rust resolver is a separate path. |
Recommended primary: dns.c by William Ahern.
Reasons:
- Single-file, zero deps. Drops into the build with a minimal
ccrule. The build avoids configure scripts, pkg-config, optional feature matrices, and multi-file build orchestration. - No fixed I/O model. dns.c is designed around three common methods
(pollfd, events, timeout). The host adapter plugs capability-backed
socket I/O without rewriting the resolver core, replacing
socket()/sendto()/recvfrom()/poll()withlibcapos-posixwrappers that return fd-shaped results backed byUdpSocket/TcpSocketcaps. - MIT license is capOS-compatible.
- ~10 kSLOC means port review can read it end-to-end.
- C89, no threading assumption, no global state surprises (resolver handle is opaque per-instance) – fits a single-process v0 design.
Open Question §2 below records that the candidate is a recommendation, not a final decision.
Required POSIX surface (v0)
The DNS resolver port exercises a very narrow POSIX subset:
| Group | Calls | Backed by |
|---|---|---|
| Stdio (logs only) | write(2,...) | Console cap |
| Allocation | malloc/free/calloc/realloc | libcapos heap |
| Time | clock_gettime/gettimeofday | Timer cap |
| Sockets (UDP) | socket(AF_INET, SOCK_DGRAM, 0), sendto, recvfrom, bind, close, setsockopt (subset) | NetworkManager + UdpSocket cap |
| Polling | poll(fds, nfds, timeout_ms) | Synthesised: each fd carries its underlying cap; libcapos-posix uses cap_enter(min_complete=1, timeout_ns) with one CQE per ready fd. No new kernel surface needed for v0 if dns.c uses one fd per query. |
| Resolv config | One in-rodata bounded text blob inlined into libcapos-posix (single nameserver entry; v0 ships before any storage cap exists) | No open / Namespace cap required for v0 |
No pipes, no fork, no exec, no signals, no /etc/resolv.conf-by-path,
no Namespace or File caps required. The DNS resolver is strictly easier
than the shell.
The v0 surface intentionally omits TCP fallback for truncated responses
and intentionally omits any path-based config file. The optional TCP
fallback row uses socket(SOCK_STREAM), connect, send, recv
through the existing NetworkManager + TcpSocket cap, but only on a
later iteration once the v0 UDP-only smoke is green; see “What dns.c
will not get in v0” below.
Critical gaps:
UdpSocketcapability. The networking proposal Phase B implements TCP + listener only; UDP “is deferred until the userspace network stack or DNS work needs it; it is not part of the Telnet Shell Demo contract” (networking-proposal.md). The resolver port creates the UDP path; it does not consume an existing one.- The future
Resolvercap concept (inservice-architecture-proposal.md“DNS resolver – consumes aUdpSocket, exportsResolver”) is a target once the UDP path exists. The first port produces the exported shape.
What dns.c will not get in v0
- DNSSEC validation: dns.c supports it, depending on
/etc/resolv.conftrust anchor config. Defer. - TCP fallback for truncated responses: implement on a second iteration once the TCP capability path is reusable.
mDNS: out of scope.- Recursive mode (acting as a recursive resolver): out of scope; v0 ships stub-only.
Validation smoke
make run-posix-dns-smoke:
- Boot a manifest that grants the resolver process a
NetworkManager(or future narrowedUdpSocket-only authority), a Console cap, and a Timer cap. The single-nameserver resolv config is the in-rodata bounded text blob compiled intolibcapos-posix; no Namespace or File cap is needed for v0. - The resolver opens a UDP socket, sends a query for a known A record to QEMU’s user-mode 10.0.2.3 (slirp’s built-in DNS) or to an in-host test resolver.
- Resolver prints the resolved IPv4 address.
- Assert kernel log line matches.
Trade-offs and Ordering
Smallest-deps comparison
| Port | C surface needed | New capOS infrastructure required | Difficulty |
|---|---|---|---|
| DNS resolver (dns.c) | malloc, time, socket subset, write(2), open RO file, poll-equivalent | UDP socket cap + NetworkManager exposure of UDP; otherwise reuses Phase B TCP path infra | Smaller – strictly additive (UDP is missing today but the kernel-side smoltcp stack supports it) |
| POSIX shell (dash) | malloc, full stdio, file I/O, directory iteration, pipe(), fork-for-exec, exec, wait, env, time, signals (stub) | Pipe primitive (new), Namespace+File cap surface, ProcessSpawner sidecar work to honour fd-action grants, env-vector handoff | Larger – touches storage / IPC / process surfaces |
Which blocks which
- Both ports can run in parallel at the
libcapos/libcapos-posixlayer level: each pulls a disjoint subset of POSIX surfaces. - DNS resolver blocks on a new capOS surface (UDP cap exposure) but does
not block on
pipe(),fork(), orexec(). - Shell blocks on (in order of probable cost): pipe primitive,
ProcessSpawner fd-action support for stdin / stdout redirection,
Namespace+File cap availability, env vector /
LaunchParameters. - The library substrate (
libcaposstaticlib +libcapos-posixscaffold) blocks both. Once the substrate exists, the two ports proceed in parallel.
Recommended sequence
- libcapos staticlib v0 (Phase P1.1). The thin Rust
.awithcap_call,capset_get,sys_exit,sys_cap_enter, heap. Plus a “C hello world” smoke that callsconsole_write_line()(mirrors the userspace-binaries proposal “Future Phase: libcapos for C”). This phase is the prerequisite for both P1.2 and P1.3. - libcapos-posix scaffold – fd table, errno cell, stdio wrappers for
fd 0/1/2, stub signals,
_startglue that registersargv/envpfromLaunchParameters(or empty arrays if that surface has not landed), basicmalloc/freere-export. - dns.c port (Phase P1.2). The schema half of P1.2 (the
UdpSocketinterface andNetworkManager.createUdpSocketmethod) landed in Phase A and released the shared schema serial surface; Phase B (kernel UDP path,libcapos-posix,dns.cvendoring, demo) does not re-acquire the surface and so does not contend with P1.3 on the schema half. - dash port (P1.3 lays the pipe + fork-for-exec primitives;
Storage Phase 3 slices 1-3 land the kernel-side
File/Directory/Store/NamespaceCapObjects andKernelCapSourcegrant sources that back the dash v0 smoke’s read-only in-rodata pseudo-fs; the actual dash vendoring is a successor task that owns the libcapos-posix file / dir / stdio / env / printf surface and the smoke harness rather than new kernel surface). P1.4 does not touchschema/capos.capnpand so does not contend on the shared schema serial surface.
Critical path
The DNS resolver is the smaller-deps first slice only because of the
shell’s pipe / file dependencies. With P1.3 (pipe + fork-for-exec) and
Storage Phase 3 slices 1-3 (RAM-backed File / Directory / Store /
Namespace CapObjects) both landed, the shell-first prerequisite
gates are closed; the remaining P1.4 work is dash vendoring +
per-call-site patching, the multi-translation-unit C build, and the
smoke harness.
What this slice does not promise
- Not a path to running glibc-built binaries unchanged. Both ports are
sources-on-disk recompiled against
libcapos-posix. Binary compatibility with Linux ELFs is not in scope. - Not job control, not signals, not full POSIX session/pgrp model.
- Not a libc – the POSIX surface ships just enough for dash and dns.c.
printffamily lands inlibcapos-posixonly because both ports need it; this is not a<stdio.h>for general use. - Not a reason to skip the native Rust paths –
capos-shell(Rustshell/crate) remains the default capOS shell. dash is for porting validation, not as the system shell. - Not a foundation for hosted C++. C++ requires explicit ABI decisions
tracked separately in
docs/proposals/userspace-binaries-proposal.md.
Phase Decomposition
Phases are dispatch-ready. P1.1 closed 2026-05-05 13:28 UTC at merge
fe5f5208. P1.2 splits into Phase A (closed 2026-05-05 18:02 UTC,
schema additions + open questions + capos-rt typed client) and Phase B
(open, kernel UDP path + dns.c demo). P1.2 Phase B does not touch
schema/capos.capnp and so does not contend with P1.3 on the shared
schema serial surface; P1.3 still adds a Pipe interface and must
queue on the surface per docs/backlog/index.md Concurrency Notes when
selected.
Phase P1.1 – libcapos C-substrate v0 + C hello-world smoke
Closed 2026-05-05 13:28 UTC at merge fe5f5208 (initial slice
b2e09bce, transfer-record helper 81a88fab). Delivered scope:
- New crate
libcapos/withcrate-type = ["staticlib"](cargo[lib].name = "capos"so the archive lands aslibcapos.a) exposing the capos-rt syscall, ring CALL, CapSet lookup, typedConsole.writeLinewrapper, andmalloc/free/calloc/reallocheap shims throughextern "C". - Public C header at
libcapos/include/capos/capos.h. make c-hellobuilds the C smoke directly with clang + lld using the shareddemos/linker.ld, links againstlibcapos/target/.../libcapos.a, and reuses capos-rt’s_startthrough libcapos’scapos_rt_maintrampoline.- Demo
demos/c-hello/(single.cfile callingconsole_write_line). - Manifest
system-c-hello.cue. - No POSIX surface, no errno, no pthreads.
- Validation:
make run-c-helloboots; the C binary prints[c-hello] hello from c-hello(the markertools/qemu-c-hello-smoke.shgreps) and exits cleanly.
Phase P1.2 – UDP cap surface + dns.c stub resolver smoke
P1.2 splits into two dispatch waves so the kernel-side wave can
serialise behind the active DDF hostile-smoke work on
kernel/src/cap/network.rs and kernel/src/virtio.rs without holding
the schema-only wave.
Phase P1.2 Phase A – schema + open questions + capos-rt client
Closed 2026-05-05 18:02 UTC. Delivered scope:
- Open questions §2 (DNS resolver = dns.c by William Ahern), §4 (errno
via per-thread TLS cell exposed through
__errno_location()), §5 (static-array fd table inlibcapos-posix, 32-fd cap for v0), and §8 (four-method blocking UDP shape with the wait deadline owned by the ring client, not a per-methodtimeoutNsparameter) resolved in this proposal. - Schema additions to
schema/capos.capnp: newUdpSocketinterface (sendTo,recvFrom,close) plus the newNetworkManager.createUdpSocketmethod. Generated bindings refresh verified viamake generated-code-check. - New
UDP_SOCKET_INTERFACE_IDconstant incapos-config/src/lib.rs. - New typed
UdpSocketClientincapos-rt/src/client.rs, mirroring the existingTcpSocketClientshape (create/send_to/recv_from/close). - Schema serial-surface release: this slice held the surface during schema additions and released it at merge.
Phase P1.2 Phase B – kernel UDP path + dns.c + demo
Closed 2026-05-05 21:21 UTC. Delivered scope:
- Kernel: extended
kernel/src/cap/network.rswith the UDP path mirroring the existing TCP path (UdpSocketCap,handle_create_udp_socket/handle_udp_send_to/handle_udp_recv_from/handle_udp_socket_close, deferred-recv parking viaPendingUdpRecv), and added UDP runtime methods on the existing scheduler-polled smoltcp runtime inkernel/src/virtio.rs(create_udp_socket/send_udp/recv_udp/close_udp_socketover a boundedMAX_PUBLIC_UDP_SOCKETSslot table with generation-bumped handles). - New standalone Rust staticlib crate
libcapos-posix/(NOT a workspace member, mirrors the libcapos pattern) producinglibcapos_posix.a. Provides:- per-process static-array fd table (
MAX_FDS = 32), per Open Question §5; - single-thread errno cell exposed via
__errno_location(), per Open Question §4; socket(AF_INET, SOCK_DGRAM, 0)/sendto/recvfrom/close()overUdpSocketandclock_gettime(CLOCK_MONOTONIC, ...)/gettimeofday(&tv, NULL)overTimer(single-shotTimer.now()calls in v0; long retry loops handled by the consumer).- C headers under
libcapos-posix/include/capos/posix/:errno.h,sys/socket.h,time.h,unistd.h. - Reuses libcapos’s installed runtime through a renamed extern crate
libcapos_::runtime::with(...)(the underscore avoids colliding with libcapos’s C-sidecapos_*exports). libcapos was promoted tocrate-type = ["staticlib", "rlib"]to support this.
- per-process static-array fd table (
- Vendored
vendor/dns-c-wahern/(William Ahern dns.c atrel-20160808, commit4ec718a77633c5a02fb77883387d1e7604750251, MIT). Mirror-as-is; onlysrc/dns.candsrc/dns.hretained alongsideLICENSEandREADME.mdper the WASI W.1 vendoring discipline. Seevendor/dns-c-wahern/VENDORED_FROM.md. - New C smoke
demos/posix-dns-resolver/main.cthat links againstlibcapos.a+libcapos_posix.aand drives a hand-rolled DNS A query forexample.comto QEMU slirp DNS at 10.0.2.3:53. The binary uses the vendored dns.c as a reference but does NOT compile dns.c whole into the smoke. Rationale: dns.c expects a POSIX header set (signal.h,fcntl.h,poll.h,netinet/in.h,arpa/inet.h,netdb.h,sys/select.h,sys/un.h) substantially wider than the v0libcapos-posixsurface. Compiling dns.c whole would require either patching the vendored tree or shipping a much larger POSIX header surface than this slice scopes; documented as follow-on work inVENDORED_FROM.md. - New focused-proof manifest
system-posix-dns.cue(own CUE package, imports the sharedcapos.local/cue/defaultspackage per the slice-3 defaults pattern) granting the smokeconsole,network_manager, andtimer. - New Makefile target
run-posix-dns-smokeand harnesstools/qemu-posix-dns-smoke.sh. The smoke prints[posix-dns-resolver] resolved example.com -> <addr>(an arbitrary IPv4 dotted-quad slirp returns from upstream resolution) and exits cleanly. Verified at2026-05-05 21:21 UTC:make run-posix-dns-smokereturns 0 withresolved example.com -> 104.20.23.154in the kernel log;make run-netregression keeps S.11.2.7 / S.11.2.8 hostile-smoke proof lines green.
Depended on Phase P1.1 and Phase P1.2 Phase A.
Phase P1.3 – Pipe capability + fork-for-exec scaffolding
Closed 2026-05-07 09:55 UTC. make run-posix-pipe-smoke is
the load-bearing gate; it drives the dash-shaped pipeline pattern
end to end through the kernel Pipe capability and the
recording-shim fork+execve path.
What landed:
- Schema: new
Pipeinterface (read/write/close/isClosed) andProcessSpawner.createPipe(bufferBytes). The generatedtools/generated/capos_capnp.rsbaseline was refreshed through the canonical capnpc step andmake generated-code-checkpasses. - Kernel:
kernel/src/cap/pipe.rsships the bounded SPSC byte ring with EOF-on-close semantics, kept symmetric with the UDP recv ceiling (4 KiB). Each cap half stores anArc<PipeShared>plus a direction; close on one side flips the shared closed flag and the per-tick poll completes the peer.kernel/src/cap/mod.rsandkernel/src/sched.rsintegrate the new poll alongside the existing network poll. - Kernel:
kernel/src/cap/process_spawner.rsgainshandle_create_pipe, mirroring the UDP-socket result-cap transfer pattern. The existingspawnMove-grant path is reused; no changes to the spawn ABI. - Userspace runtime:
capos-rt/src/client.rsexposes typedPipeClient(read/write/close/isClosed and matching*_wait) plusProcessSpawnerClient::create_pipe / create_pipe_waitand theCreatePipeResultprojection of the two transferred halves. libcapos-posix: newpipe.rsandprocess.rsmodules. The fd table grows aFdBacking::Pipevariant;dup_for_dup2()clones theOwnedCapability<Pipe>so an aliased fd does not release the underlying cap until the last fd drops.pipe,read,write,dup,dup2,fork,execve,waitpid,_exit, andposix_inherit_stdioare exposed via C ABI.dup2andcloseinside a fork-recording window route throughprocess::maybe_record_dup2/maybe_record_closerather than mutating the parent fd table;execveconsumes the recorded actions asstdio_<N>spawn grants –Pipe/TerminalSessionforwardedMove,Console/Directory/FileforwardedRawover their Copy-transferable caps – and returns the synthetic child pid as its own return value so the user pattern becomesint spawn_pid = execve(...); if (spawn_pid < 0) /* surface error to the caller; do NOT _exit because the pseudo-child branch is still the parent process */; child = spawn_pid;(nosetjmp/longjmpinvolved – earlier iterations longjmp’d back to thefork()call site, which dropped back into a returned-and-deallocated stack frame and was undefined behaviour). After a successful spawn, eachMove-granted source fd slot is replaced with aFdBacking::Movedsentinel and the underlyingOwnedCapabilityis forgotten so the parent does not queue a stale CAP_OP_RELEASE for the moved cap_id; a subsequentclose(src)on the parent side (the dash-shaped pattern’s “I no longer hold the write end”) removes the sentinel without a kernel round trip. ARaw/Copy grant (Console/Directory/File) is non-destructive: the parent’s own fd is restored intact, since the kernel handed the child a separate alias. The child side adopts eachstdio_<N>grant back into slotNby interface id (fd::inherit_stdio_grants), wrapping an inherited directory fd throughfdopendir(); proofmake run-posix-execve-inherit-smoke.libcapos-posixsuccessor surface: directposix_spawnandposix_spawn_file_actions_init/destroy/adddup2/addclosereuse the same action-replay helper behind the recording-shimexecvepath. Recording-shimexecvenow delivers argv through the privateposix_argvPipe grant described above. Directposix_spawnstill acceptsargvandenvpfor source compatibility but does not deliver them to the child yet; direct-spawn argv/environment remain empty until a typed LaunchParameters / environment grant exists.libcapos-posixstdio successor: landed at commitaa6a56d7(2026-05-13 11:03 UTC). fd 1 and fd 2 initialize to the granted Console cap when present, but only after anystdio_<N>recording-shim grants have been adopted into their slots. fd 0 is not synthesized from Console;read(0, ...)stays closed unless a real stdin backing is granted.make run-posix-stdio-smokeprints distinct stdout/stderr markers through POSIXwriteand proves the no-stdin refusal path.- Demo:
demos/posix-pipe-shim/main.c(parent) anddemos/posix-pipe-child/main.c(child). The parent pipes, forks, the child-pseudo-context dup2()s the write end onto STDOUT_FILENO, closes both pipe fds, and execve()s the child; the child callsposix_inherit_stdio(), writes “hello via pipe” to fd 1, closes it, and exits 0; the parent drains the read end throughread()until EOF,waitpid()s, and emits[posix-pipe] read 14 bytes: hello via pipe. - New manifest
system-posix-pipe.cue(own CUE package, imports the sharedcapos.local/cue/defaultspackage). New Makefile targetrun-posix-pipe-smokeand harnesstools/qemu-posix-pipe-smoke.sh. Verified2026-05-07 09:55 UTC:make run-posix-pipe-smokereturns 0 with the proof line in the kernel log;make run-smokeandmake run-spawnregressions stay green. - Schema serial-surface coordination: held the surface for the P1.3 schema additions and released on merge.
Open Question §6 closed: Variant A (recording shim) is the
adopted answer. fork() records “next exec is the real spawn” in
TLS and returns 0; the shim translates inter-call dup2/close into
spawn-grant Move actions; and execve() performs the spawn and
returns the synthetic child pid as its own return value (the
caller forwards the pid to the parent flow’s waitpid via
int spawn_pid = execve(...); if (spawn_pid < 0) /* surface error to the parent's normal error path; the pseudo-child branch is still the parent process so do NOT _exit */ ; child = spawn_pid;).
Earlier iterations used setjmp / longjmp to
fake the fork-return-twice semantic; that approach was replaced
because the longjmp jumped back into fork()’s already-returned
(and deallocated) stack frame, which is undefined behaviour. Variant
B (patched-port posix_spawn only) is rejected for v0. Variant
A still requires a small dash-side patch – the four-line
“capture spawn_pid; bail on -1; assign back to child” snippet at
each fork-exec site – because successful execve() now returns the
synthetic pid where unmodified dash assumes execve only returns on
failure. That patch surface is much narrower than Variant B’s
“consolidate every fork+dup2+exec into a single posix_spawn call
with explicit posix_spawn_file_actions” rewrite, which is why
Variant A is the chosen v0 path. A 2026-05-13 successor exports the
direct posix_spawn() surface over the same code path. Recording-shim
execve argv now travels through a private posix_argv Pipe grant; direct
posix_spawn argv/envp remain ignored until LaunchParameters / environment
support lands.
Open Question §9 closed: kernel-allocated bounded SPSC ring
with EOF-on-close, exposed as two cap halves sharing
Arc<PipeShared>. Reader-closed surfaces bytesWritten = 0 to
the writer (the EPIPE-equivalent chosen to avoid expanding the
kernel ExceptionType vocabulary). Writer-closed surfaces eof = true to the reader after the buffered bytes drain. The shared
MemoryObject + userspace ring alternative is rejected
because EOF across process exits and bounded waiter wake
semantics need kernel-side state anyway.
Depended on Phase P1.1.
Phase P1.4 – dash vendoring + libcapos-posix file/dir/stdio/env/printf surface
Status (2026-05-23 07:52 UTC): in flight. Slice 3 (libcapos-posix
FdBacking File / Directory / Terminal variants + smoke) closed at commit
ae58f936; Slice 4 (absolute-path resolver over a bootstrap-granted root
Directory cap plus functional open()/opendir()) landed at commit
94b29177; the posix-file-directory-client-capos-rt closeout at commit
f97d9833 (2026-05-23 06:23 UTC) adds functional lseek(), lazy
readdir() over Directory.list, and the focused make run-posix-file
proof. Slice 7 adds the focused printf/string C library subset and proves it
with make run-posix-printf. Slices 8/9 add signal-registration stubs plus
Timer-backed time() / nanosleep() / sleep() and prove them with
make run-posix-signal-time. The kernel-side capability surface required for
the v0 dash smoke landed under
Storage and Naming Phase 3 slices 1-3:
RAM-backed File (kernel/src/cap/file.rs), Directory
(kernel/src/cap/directory.rs), and Store / Namespace
(kernel/src/cap/store.rs, kernel/src/cap/namespace.rs) CapObjects,
plus the matching KernelCapSource::file / directory / store /
namespace manifest grant sources, are sufficient backing for the
“read-only Namespace cap rooted at a tiny in-rodata pseudo-fs” the
smoke described in §Validation smoke needs. Earlier proposal drafts
called Phase P1.4 “blocked on the Namespace + File cap surface”;
that framing is stale – the open work has moved out of the kernel and
into the libcapos-posix userspace surface, the dash port itself, and
the smoke harness. A userspace Store / Namespace service over a
real backing store (the remaining Phase 3 item in the storage proposal)
is not a prerequisite for the v0 dash smoke; the kernel
bootstrap-grant Directory cap is the v0 backing.
The concrete checklist lives in docs/proposals/posix-adapter-proposal.md Task 4
and the long-form decomposition is in
docs/backlog/posix-adapter-dash-port.md. This proposal records the
phase shape and the substantive outstanding work groups; the backlog
file owns per-step ordering.
Current closed surfaces and outstanding work groups, all in userspace and userspace-adjacent harness surface (no further kernel cap work needed for the v0 smoke):
- dash vendoring + patch. Closed (
posix-p1-4-dash-vendor,2026-05-24 19:40 UTC). dashv0.5.13.4is vendored mirror-as-is (full upstream tree, byte-identical) undervendor/dash/withvendor/dash/VENDORED_FROM.md. The per-call-site Variant A patch (captureexecve()’s synthetic pid return value, bail on-1, assign back tochild) – the shape recorded in Open Question §6 and the Decisions §6 entry – lives undervendor/dash/patches/as two.patchfiles:0001-execve-return-synthetic-pid.patchpropagates the synthetic pid up throughtryexec()/shellexec()(theexecve()call site), and0002-vforkexec-adopt-synthetic-pid.patchadopts it at thevforkexec()fork-exec site. Cumulative diff 45 changed lines (< 50). dash’s inter-calldup2/closebetween fork and execve already records throughlibcapos-posixand needs no per-call patching. Design evidence only: nothing compiles/runs at this slice; the C-build and shell-smoke slices below prove the behavior. - C-build pipeline for vendored multi-file C sources. Landed
(
posix-p1-4-c-multifile-build). The existingc-buildhelper compiles single-filedemos/*/main.csmokes againstlibcapos.a+libcapos_posix.a. dash is a multi-translation-unit C codebase; the Makefile gained the reusablecapos-c-multitu-elfdefine(instantiated with$(eval $(call ...))) that compiles a list of vendored.cfiles each to an object and links them withlibcapos_posix.a+libcapos.ainto a userspace ELF without dragging in an external libc. Toolchain remainsclang --target=x86_64-unknown-none-elf -nostdlib -staticper Open Question §11 and the libcapos C-substrate plan. Proven by the two-TUdemos/c-multifile/demo andmake run-c-multifile, which asserts a cross-TU computed line. - dash build pipeline (autotools config.h + host table generators).
Landed (
posix-p1-4-dash-build-pipeline,2026-05-26 05:11 UTC). The generic multi-TU rule runs noconfigureand no host generators, so the dash-specific prerequisites live undervendor/dash/capos/: a pinnedconfig.h(derivation + host-table caveat invendor/dash/VENDORED_FROM.md) andgen-tables.sh, which stages a patched source copy (keepingvendor/dash/srcbyte-identical) and runs dash’s six host generators (mktokens,mksyntax,mknodes,mksignames,mkbuiltins,mkinit). The Makefiledashtarget funnelsdash_CFILES+ the five generated tables throughcapos-c-multitu-elfagainstlibcapos_posix.a+libcapos.ain the-nostdincsysroot mode, producingtarget/dash/dash.elf(static, 0 undefined symbols, both Variant A fork-exec patches compiled in).make dashproves build + link; the runtime QEMU proof is the dependent shell smoke below. - File / directory I/O surface in
libcapos-posix. TypedFileClientandDirectoryClientwrappers landed incapos-rt/src/client.rsat commit747a8611(2026-05-16 20:07 UTC);FILE_INTERFACE_ID/DIRECTORY_INTERFACE_IDconstants are already incapos-config/src/lib.rs. Slice 3 added theFdBacking::File/FdBacking::Directory/FdBacking::Terminalvariants inlibcapos-posix/src/fd.rsat commitae58f936and the matching smoke. The current surface implementsopen,close,read/write(joining the existing pipe/UDP read/write dispatch),lseek,opendir,readdir, andclosedir;make run-posix-fileproves these through a live POSIX C process. File-backed fds now store the POSIX access mode fromopen():readrejectsO_WRONLY,writerejectsO_RDONLY,ftruncaterequires a write-capable fd, andO_RDONLY | O_TRUNCis denied before the resolver can reachDirectory.open.dup/dup2preserve the stored mode, and the recording-shimexecvepath grants a privateposix_fd_rightsmetadata pipe so inherited File fds reconstruct the same attenuation in the child fd table.make run-posix-open-smokeandmake run-posix-filecarry the same-process denial checks;make run-posix-execve-inherit-smokeproves the recording-shim inheritance path preserves read-only and write-only File fd modes. - Path resolver over a root
Directorycap. A resolver inlibcapos-posix/walks a path through a bootstrap-granted rootDirectorycap and returnsFile/Directoryresult caps via existing IPC cap-transfer machinery. A v0 per-process current-working- directory string (getcwd/chdir,libcapos-posix/src/cwd.rs) plus cwd-relative resolution foropen/opendir/stat/access/unlink/mkdirlanded (make run-posix-cwd);chdirstores only the normalized path string and drops the validated cap, so cwd inheritance across spawn is still deferred...is not collapsed: escape is prevented by the kernelDirectorycap’s lack of a parent edge, not a resolver clamp. TheNamespace/Storeresolver shape remains documented for a future real filesystem service. - Remaining file metadata calls.
stat,fstat,access, andunlinkremain fail-closed stubs until a dash call site requires the stablestruct statand remove-contract shape. - Stdio over
TerminalSession.FdBacking::Terminaladopting the bootstrap-grantedTerminalSessioncap as fd 0 / fd 1 / fd 2 when the manifest supplies one. Implements Open Question §7’s decision (canonical fd 0 backing =TerminalSession). The existing pipe-backed inheritance path stays in place forposix_spawn-driven pipeline children.posix_inherit_stdio()becomes a one-shot adopter for the terminal grant too. - Env vector +
getenv/setenv/putenv. Per-process env vector inlibcapos-posix, populated at startup from manifest rodata (a bounded env grant oninitConfig.init, mirroring thewasiEnv :Textbounded grant the WASI host adapter already uses for Preview 1environ_get). The eventual typedLaunchParameterscap remains a follow-on; the v0 env source is the manifest rodata grant. - printf / string subset. Implemented in
libcapos-posix:printf/fprintf/vprintf/vfprintf/snprintf/vsnprintf;memcpy/memmove/memset/memcmp;strlen/strcmp/strncmp/strchr/strrchr/strcpy/strncpy/strcat/strncat/strdup;atoi/strtol/strtoul; and the ctype subset (isspace/isdigit/isalpha/isalnum/tolower/toupper). Formatted output is bounded to the documented v0 integer / string conversions and width/precision caps; floating-point,fopen, stream buffering, and locale stay out of scope.make run-posix-printfproves the surface from a live capOS C process.libcaposalready exportsmalloc/free/calloc/reallocfor C consumers. - Signal stubs. Implemented in
libcapos-posix:signal/sigactionvalidate and store handlers in a per-process table but never deliver them;killfails closed withEPERMbecause this POSIX surface has no targetProcessHandleauthority;raisefails closed withENOSYSbecause self-delivery is not implemented.make run-posix-signal-timeproves the documented behavior from live capOS C process output. RealSIGCHLD/SIGTSTPdelivery and job control remain out of scope. - Time additions. Implemented in
libcapos-posix:time(2),nanosleep, andsleepreuse the existingTimercap path already used byclock_gettime/gettimeofday.make run-posix-signal-timeproves monotonic-since-boottime()output, boundednanosleep(), and one-secondsleep()from live capOS C process output. - Identity stubs. Implemented:
getpidreturns the stable capos-rt bootstrap pid for the current process, while the recording-shim child pid allocator stays above the caller’s pid for thewaitpidtable;getuid/getgidreturn the hardcoded single-identity uid/gid0.make run-posix-identityproves a parent and fork/exec child observe distinct process-visible pids from live capOS C code. isatty/getppid(closed2026-05-24 08:47 UTC). Both are pure-userspace dash prerequisites over the existing fd table – no kernel, cap, IPC, or schema change.isatty(fd)returns1for anFdBacking::Terminalslot,0witherrno = ENOTTYfor any other live backing, and0witherrno = EBADFfor an empty/closed slot.getppid()returns the v0 single-identity parent constant (1); no kernel parent handoff exists yet, so it is an honest stub alongside thegetpidsingle-identity path.make run-posix-isattyprovesisatty(0/1/2)=1over bootstrap-granted TerminalSession stdio,isatty(pipe_fd)=0 errno=ENOTTY, andgetppid=1from live capOS C process output.fcntl(closed2026-05-24 09:23 UTC). A pure-userspace dash prerequisite over the existing fd table – no kernel, cap, IPC, or schema change.F_DUPFD/F_DUPFD_CLOEXECduplicate into the lowest free slot>= argover the samedup_for_dup2alias pathdup/dup2use;F_GETFD/F_SETFDround-trip a per-fdFD_CLOEXECbyte;F_GETFLreports a stable access mode (O_RDWRfor Console/Udp/Pipe/Terminal, the storedopen()mode for File,O_RDONLYfor the read-only Directory);F_SETFLfails closed withEINVALwhen the argument carriesO_NONBLOCK(the v0 ring calls block withWAIT_FOREVER, so there is no non-blocking mode to switch into), except on UDP socket fds, where it is accepted-and-ignored for the vendored dns.c snapshot whose documented contract already drives deadlines from userspace; other status bits (e.g.O_APPEND) stay accept-and-ignore. UnknowncmdyieldsEINVALand a closed/out-of-range fd yieldsEBADF. CLOEXEC is enforced at recording-shimexecvetime: the full-fd-table inheritance walk skips a slot whose flags byte carriesFD_CLOEXECunless an explicit recordeddup2named that child slot.make run-posix-fcntlproves theF_DUPFD,10relocation, theFD_CLOEXECround-trip,F_GETFL=O_RDWRfor a pipe, and theEBADF/EINVALerror paths from live capOS C process output.- Manifest + smoke harness (landed
2026-05-27 09:36 UTC).system-posix-shell.cuegrants dash aTerminalSession(stdio), a bootstrap RAMDirectory(root), aProcessSpawner, and aTimer. Newdemos/ls-shim/one-binary listing helper wraps the inherited directory fd withfdopendir()(the smoke’s only allowed spawn target).make run-posix-shell-smoke+tools/qemu-posix-shell-smoke.shfeed a heredoc into the shell’s fd 0 – the shell creates two RAM-root entries, opens the directory as fd 3 (exec 3< /), runs/ls-shim, and printsdone– and assert thealpha/betaentry lines,done, two clean-exit log lines, the scheduler halt line, and clean QEMU exit. Thels-by-bare-name vs/ls-shimPATH-stat workaround uses the slash-bearing path, which the recording-shim spawn maps to the manifest binary name by basename. Stretch: extend the smoke tocat foo | grep barend-to-end, exercising the P1.3Pipeprimitive through a shell pipeline. Stretch closed (2026-05-27,posix-dash-pipeline-exec-reconcile): dash patch0004-pipeline-evexit-recording-shim.patchreconciles theEV_EXITin-placeshellexecpath with the recording shim (everyevalpipeelement takes that path, which the original patch set had left unreconciled), andlibcapos-posixgained wildcardwaitpid(-1)/wait3reaping.make run-posix-shell-smokenow drives the pipeline (match bar herefiltered through, four clean child exits). Seedocs/backlog/posix-adapter-dash-port.mdSlice 14 andvendor/dash/VENDORED_FROM.md. readbuiltin over fd 0 (landed2026-05-31 20:35 UTC,posix-dash-read-builtin-terminal-line). Proves dash’sread VARbuiltin consuming interactive input off its fd 0TerminalSessioncooked-mode line discipline – the one stdin path every prior smoke skipped (run-posix-shell-smokefeeds no stdin). No dash patch or libcapos-posix change was needed: dash’stcgetattr(0)-derived canonical buffering takes the plainread(0, ...)branch, which theFdBacking::Terminaladapter satisfies one line at a time.make run-posix-read-builtin(system-posix-read-builtin.cue+tools/qemu-posix-read-builtin-smoke.sh) echoes back the harness-fed linesgot=[hello world]/raw=[raw\back\slash](the second underread -r, proving the no-escape path). The harness handshakes each feed on dash’s own terminal output because the kernel line discipline has no inter-read input buffer and the UART carries no EOF. Seedocs/backlog/posix-adapter-dash-port.mdSlice 18.- Open question closures (Slice 1, closed
2026-05-24 00:53 UTC). Open Question §1 (dash 0.5.13.x candidate) and §7 (fd 0 backing =TerminalSession) are promoted to final decisions in this proposal’s “## Open Questions” section ahead of vendoring.
Recommended dispatch ordering: P1.1 -> P1.2 Phase A (schema + client,
landed) -> P1.2 Phase B (kernel UDP path + dns.c, landed) and P1.3
(Pipe cap + fork-for-exec, landed) in either order, since they no
longer contend on the schema serial surface -> P1.4 dash-port
successors. P1.4 itself does not touch schema/capos.capnp and so does
not contend on the shared schema serial surface.
Trust Boundaries
| Boundary | Native capOS service | POSIX-shaped C binary on capOS |
|---|---|---|
| Authority source | Process CapSet | Process CapSet projected through libcapos-posix fd table |
| Memory isolation | Page tables | Page tables (no wasm-style sandbox; libc has no extra runtime check) |
| Code integrity | W^X + NX | W^X + NX |
| Cap forgery | Kernel-owned CapTable | Same; the fd table is per-process userspace state, not authority |
| Resource limits | Kernel quotas | Kernel quotas; ulimit is ENOSYS |
| Side channels | Hardware-level (Spectre etc.) | Same hardware level |
A POSIX binary on capOS is more constrained than on Linux, not less. The adapter provides familiar function signatures, not familiar authority.
Validation
The first ports are not complete until they have QEMU evidence:
- A POSIX binary prints through a granted Console / TerminalSession.
- The same binary cannot use
writeto a fd it was not granted, cannotopen()a path outside its preopened namespaces, and cannot call an unimplemented POSIX function without receivingENOSYS. - A missing or wrong-interface cap lookup returns the documented errno (not a host-side panic, not silent success).
- An owned result cap is released deterministically when the binary exits.
- Each demo binary exits cleanly and does not wedge the kernel.
Host tests should cover errno mapping and the per-process fd table once those pieces are pure enough to test outside QEMU. Do not claim “POSIX adapter works” from host tests alone; the useful behavior is authority- shaped POSIX execution in capOS.
Open Questions
The following design decisions are documented as open questions because the planning phase recommends an answer but has not yet committed to one.
- POSIX shell candidate. Decided (P1.4 Slice 1,
2026-05-24 00:53 UTC): dash 0.5.13.x, vendored at a pinned tag undervendor/dash/. Rationale: smallest established POSIX-strict shell (~13 kSLOC, readable in full by the porting team), no readline/termcap dependency (it talks to whatever fd 0 gives it), and a single-purpose/bin/shposture that does not accidentally validate Bash extensionslibcapos-posixdoes not implement. Rejected: busyboxash(heavier embedded framework cost), oksh (ksh-superset, larger surface than v0 needs), toysh (incomplete upstream), and a custom Rust shell (it defeats the purpose of porting a real C program; the nativeshell/capos-shellalready exists). Vendoring, the Variant A patch, the multi-TU C build, and the shell smoke are later P1.4 slices (11-14). - DNS resolver candidate. Decided (P1.2 Phase A,
2026-05-05 18:02 UTC): dns.c (William Ahern), vendored at a pinned tag undervendor/dns-c-wahern/. Rationale: single-file MIT C (~10 kSLOC.cplus header), no Cargo/CMake build system, no configure script, no required I/O model (caller plugs the socket layer), and a track record as a reusable resolver core in production software outside libc. The license is capOS-compatible and does not force a transitive libc port. Rejected: musl libresolv – tied to the rest of musl’s headers, build, and__syscallshape; pulling it in either drags musl as a transitive dependency or forces a per-symbol carve-out that defeats the “single .c plus header” cost profile. Rejected: c-ares (configure-driven, ~3x larger, more invasive port). Rejected: GNU adns (GPL-2.0+ license question). Rejected: pure-Rust trust-dns (defeats the C-port purpose). - libcapos versioning and naming. The C library is just
libcapos(mirrors the Rustcapos-rt). Open question: should the POSIX layer belibcapos-posix(current recommendation), or a different name that avoids any Rust-side framework name collision? The C-side naming is settled; the POSIX-layer name remains an open question pending confirmation that no Rust framework will reuse thelibcapos-posixidentifier. Working answer: keeplibcapos-posixfor the POSIX layer. - POSIX errno representation. Decided (P1.2 Phase A,
2026-05-05 18:02 UTC): per-threaderrnocell exposed via__errno_location()– the standard POSIX shape. Storage lives inlibcapos-posix, owned by a thread-local cell accessed through a stableextern "C" int *__errno_location(void);function so vendored ports (dns.c, dash, future C software) compile againsterrnoexactly as on Linux/musl. Rust internals keep the typedCapError/CapExceptionshape; one bidirectional mapping at the C boundary writes theintvalue into the TLS cell so internal callers cannot invent unmapped values. Rejected: per-fd error field – breaks source compatibility with every POSIX program that readserrnoafterread/recvfrom/open, requires every vendored port to be patched, and provides no isolation gain over the per-thread cell that the cap layer already exclusively writes. - File descriptor table location. Decided (P1.2 Phase A,
2026-05-05 18:02 UTC): static-array fd table inlibcapos-posixwith a small fixed cap (target: 32 open fds per process for v0). Rationale: the lookup is one bounds-check + one array index in userspace with no syscall; the kernel keeps zero knowledge of fds, so capOS authority remains exactly the per-processCapTableand is not duplicated in a parallel kernel-side fd map. The fixed cap matches the surfaces dns.c (single fd) and a v0 shell port (a handful of stdio + pipe fds) actually exercise. Rejected: capability-table-backed fd map that resolves fd numbers through the process cap table – larger blast radius (fd churn would touch the kernel cap table on everydup/close), and the cap-table object id is already a userspace-visible handle throughOwnedCapability, so a separate dense fd index in userspace is the right layer. The 32-fd cap can grow later (or migrate to a sparse representation) if a real consumer needs more, without changing the kernel surface. - Fork policy. Decided (P1.3,
2026-05-07 09:55 UTC; refined2026-05-07 10:30 UTCto dropsetjmp/longjmp): Variant A – the recording shim.fork()records “next exec is the real spawn” in TLS and returns 0 unconditionally.dup2()andclose()calls betweenfork()andexecve()route throughprocess::maybe_record_dup2 / maybe_record_closeand are not applied to the parent fd table.execve()consumes the recorded actions, dispatchesProcessSpawner.spawn()with the matching pipe halves moved into the child asstdio_<dst>grants, parks the resultingOwnedCapability<ProcessHandle>in a per-process table, and returns the synthetic child pid as its own return value (a deliberate v0 deviation from POSIX, whereexecveonly returns -1 on failure). The user pattern becomesint spawn_pid = execve(...); if (spawn_pid < 0) /* surface error to the parent's error path; do NOT _exit because the pseudo-child branch is still the parent */ ; child = spawn_pid;. After a successful Move-grant spawn the parent’s source fd slot is replaced with aFdBacking::Movedsentinel so a subsequentclose(src)(the dash-shaped pattern’s “I no longer hold the write end”) removes the sentinel without a kernel round trip. The earliersetjmp/longjmpdesign longjmp’d back tofork()’s call site afterexecve()had returned – the saved jmp_buf RSP/RIP pointed intofork()’s stack frame, which was deallocated whenfork()first returned, so the longjmp resumed inside a stale frame whose memory had already been reused bydup2/close/execve. A targeted dash patch is still required for the v0 contract:execve()returns the synthetic pid on success, where unmodified dash assumesexecve()only returns on failure (and falls into its post-exec error path). Variant A keeps that patch surface narrow – the change is the four-line “capture spawn_pid; bail on -1; assign back to child” snippet shown above per fork-exec call site, not a wholesale rewrite of the fork-dup2-exec pattern – and dash’s inter-call dup2/close still record into the spawn grants without per-call patching. Rejected: Variant B (patched-portposix_spawnonly) requires the port to consolidate every fork+dup2+exec sequence into a singleposix_spawncall with explicitposix_spawn_file_actions, a much wider patch surface. A 2026-05-13 successor now exports directposix_spawn()over the same execve-backed action replay. Recording-shimexecveargv now travels through a privateposix_argvPipe grant; directposix_spawnargv/envp remain ignored until LaunchParameters / environment support lands. - fd 0 backing for the shell. Decided (P1.4 Slice 1,
2026-05-24 00:53 UTC): the canonical fd 0 / 1 / 2 backing for the v0 dash smoke isTerminalSession– the natural mapping (read line + cooked-mode line discipline already exists in kernel and migrates to userspace at networking Phase C). For the DNS resolver fd 0 is unused and stays unmapped. The backing is realized by theFdBacking::Terminalvariant inlibcapos-posix/src/fd.rsplusposix_inherit_stdio()adopting the bootstrap-grantedTerminalSessioncap, mirroring the existing pipe-inheritance path; that implementation already shipped under P1.4 Slice 5 and is proven bymake run-posix-stdio-terminal-smoke. This slice only records the backing choice as final. - UDP cap surface scope. Decided (P1.2 Phase A,
2026-05-05 18:02 UTC): four-method blocking shape that mirrors the existing TCP cap pattern, with the wait deadline owned by the ring client (not the method parameter list). Methods:NetworkManager.createUdpSocket(localAddr :Data, localPort :UInt16) -> (socketIndex :UInt16)– bind a UDP socket to the given local(addr, port)(localAddrempty selects the configured interface;localPort = 0selects an ephemeral port). The result cap is transferred viasocketIndexin the CQE result-cap list, matchingconnectTcp.UdpSocket.sendTo(addr :Data, port :UInt16, data :Data) -> (bytesSent :UInt32).UdpSocket.recvFrom(maxLen :UInt32) -> (addr :Data, port :UInt16, data :Data)– blocking, no in-method timeout. Same CQE-on-completion shape asTcpSocket.recv: the kernel parks the SQE until a datagram arrives. The caller bounds the wait through the existingRingClient::wait(call_id, timeout_ns)mechanism; dns.c-style retry/deadline loops drive that bound from userspace. If the caller wants to abort a parkedrecvFromearly, it issuesclose()on the socket; the parked completion then returns aDisconnected-classCapException. The v0 surface deliberately does not introduce a newTimeoutexception class, since none exists today (ExceptionTypecovers onlyfailed,overloaded,disconnected,unimplemented) and inventing one for a single method would expand the kernel error surface ahead of any consumer that needs to distinguish wait-expiry from generic disconnect.UdpSocket.close() -> (). Rationale: the blocking shape maps directly onto dns.c’s existing retry/timeout loop (dns.c does its own resend and deadline tracking, then issues a bounded blocking read backed by the ring wait), so the v0 port plugs in without a separate readiness/poll surface. The shape also reuses every primitive already present for TCP – ring-sidecap_enterparking, transferred result caps, client-sideRingClient::waitdeadline – so the kernel UDP path in P1.2 Phase B is a near-mirror of the TCP path. Rejected (deferred): readiness/poll-stylerecvFrom– the cap surface decision (one-shot wait vs an event stream over anEndpoint) is itself unsettled, has no live consumer, and adding a second wait shape now would force every port to choose. Add a separate readiness method (or a genericPollablecap) when a real consumer needs it, not before. Rejected: per-methodtimeoutNsparameter – creates two competing deadlines (the in-method timeout and the ring wait) that race on the same call, would require either inventing a newTimeoutexception class or overloadingDisconnectedambiguously, and is redundant with the ring wait the client already issues.
- Pipe cap design. Decided (P1.3,
2026-05-07 09:55 UTC): kernel-allocated bounded SPSC ring (4 KiB ceiling, default to the maximum) with EOF on close. The two halves share anArc<PipeShared>and store their direction; close on one side flips the matching closed flag and the per-tick poll completes the peer. Both halves implement the samePipeinterface (read / write / close / isClosed); the kernel rejects wrong-direction calls with afailedexception. Reader-closed surfacesbytesWritten = 0to the writer (the EPIPE-equivalent chosen to avoid expanding the kernelExceptionTypevocabulary). Writer-closed surfaceseof = trueto the reader after the buffered bytes drain. **Rejected: shared MemoryObject- userspace ring** because EOF across process exits and bounded waiter wake semantics need kernel-side state anyway, and the userspace path would still need a kernel cap to coordinate close races.
- argv / envp source. This proposal assumes a future
LaunchParameterscap delivers argv / envp through a typed cap. Until that cap lands,libcapos-posixcan carry argv / envp via a fixed well-known cap or rodata blob. Confirm gate-on-LaunchParametersversus ship-stub. - Linker / toolchain for C consumers. Recommended:
clang --target=x86_64-unknown-none-elf -nostdlib -static, link againstlibcapos.a(and optionallylibcapos-posix.a), reuse the existingcapos-rtlinker script. Confirm clang vs gcc and whether the track ships a sharedcc-glueCargo crate or a Make rule invokingccdirectly. - Vendoring policy. In-tree
vendor/dash/,vendor/dns-c-wahern/versus out-of-tree submodule versus separate repo. Working answer: in-tree vendoring with pinned tags, mirroring the plannedvendor/piccolo-no_std/shape from the Lua track. - Audit / measure-mode interaction. The
libcapos-posixwrappers must not break measure mode (themeasurefeature). Most wrappers only calllibcapos, which only callscapos-rt, which is already measure-mode-clean, so this should be free; confirm whether the track adds amake run-measuresmoke for onelibcapos-posixbinary as a regression gate.
Relationship to Other Proposals
- Userspace Binaries owns the broader native-binary, language, and POSIX-adapter roadmap. This proposal supersedes Part 4: POSIX Compatibility Adapter of that proposal with the full POSIX adapter design.
- Programming Languages is the
reader-facing summary of language support. The C row records the
shipped
libcapos.a+libcapos_posix.asurface (P1.1 + P1.2 + P1.3, plus the 2026-05-13posix_spawnsuccessor and Console-backed stdio slice). The POSIX-shaped software row cross-links this proposal as the long-form design source and records the P1.4 dash-port block onNamespace+Filecaps. - Networking defines
NetworkManager,TcpListener, andTcpSocketand defers UDP. The DNS resolver port in Phase P1.2 adds theUdpSocketcap surface; the TCP cap surface is reused unchanged. - Storage and Naming defines the
Directory/File/Store/Namespacesurfaces that the shell port consumes. Phase 2/3 of that proposal gates the dash file I/O surface. - Service Architecture defines
the future
Resolvercap that the resolver port eventually exports. - Shell covers the native
capos-shell. The POSIX shell port is for porting validation and does not replacecapos-shell. - WASI Host Adapter is the parallel untrusted-portable execution path. POSIX adapter targets trusted source-recompiled C; WASI adapter targets sandboxed wasm modules. Both share the per-process fd-table and per-import authority pattern.
- Lua Scripting is the
capability-scoped trusted-script path; PUC Lua’s native build assumes
a C substrate, so it eventually consumes
libcapos.
Proposal: POSIX fork/execve Full-fd-table Inheritance
Make the capOS POSIX fork+execve recording shim inherit the parent’s full
live fd table by default, honoring close-on-exec, so unmodified POSIX software
(the dash port is the headline case) gets working stdin/stdout/stderr and an
inherited cwd in its children without the application explicitly dup2-ing every
descriptor. This reverses the v0 explicit-grant-only default, which is the
inverse of real POSIX semantics, while keeping the capability model’s
no-ambient-authority guarantee.
Why this is needed
capOS has no real fork (no address-space copy, no shared open-file
descriptions). fork+execve is emulated by a recording shim
(libcapos-posix/src/process.rs): fork() opens a recording window and returns
0; dup2/close between fork and execve are recorded as deferred fd
actions; execve() replays them against a virtual child fd-view and forwards the
resulting fds as CapGrants through ProcessSpawner.spawn. The child
reconstructs its fd table from the named stdio_<N> grants
(libcapos-posix/src/fd.rs inherit_stdio_grants).
The v0 contract is explicit-grant-only: in spawn_path_with_actions, only
fd slots a recorded dup2/close touched become grants; untouched live slots
are deliberately not inherited (the touched array gate). This is the inverse of
POSIX, where a child inherits the parent’s entire fd table across
fork+execve – every descriptor not marked O_CLOEXEC/FD_CLOEXEC – sharing
the underlying open file descriptions.
The consequence is decisive for arbitrary POSIX software. Vanilla dash compiled
JOBS=0 does not dup2 stdio for a foreground external command – only the
FORK_BG path in vendor/dash/src/jobs.c (forkchild) manipulates fds. So
dash -> ls-shim replays an empty action list and hands the child an empty
CapSet: no stdout to print to, no Directory to list. This is not a dash bug;
it is correct POSIX behavior (the child is expected to inherit dash’s stdio). The
v0 shim’s inverted default breaks every POSIX program that relies on inheritance,
which is essentially all of them.
The project directive is explicit: do not solve this with per-app dash patches (posix-p1-4-dash-shell-smoke). A fd-inheritance fix that must be re-applied to every POSIX program is not POSIX compatibility. The correct fix is to make the recording shim inherit the full fd table by default, like real POSIX, reconciled with the capability model.
Current state vs target
| Aspect | Realized (done 2026-05-27) | Notes |
|---|---|---|
| Inheritance default | full-table: every open slot forwards unless FD_CLOEXEC or a non-forwardable backing | spawn_path_with_actions walks every open parent slot; recorded dup2/close are edits on the baseline |
FD_CLOEXEC | enforced: an implicitly-inherited CLOEXEC slot is dropped at execve forward time; open(O_CLOEXEC) sets the byte | an explicit recorded dup2 keeps its child slot (POSIX dup2 clears close-on-exec) |
| Terminal stdout | non-destructive: the recording shim forwards TerminalSession via SpawnGrantMode::Raw (process.rs Terminal arm) over the Copy/SameSession bootstrap cap | parent keeps its terminal across the spawn (proof make run-posix-fd-inherit-default); kernel mint proven by make run-posix-terminal-forward |
| Writable File/Directory | NonTransferable -> kernel rejects grant -> whole-spawn ENOEXEC | documented divergence (single-writer policy). v0 POSIX open mints only Copy/SameSession RAM/read-only caps, so none enters the fd table; a future writable-fs open path needs a pre-spawn transferability check to skip it non-fatally (follow-up) |
Directory fd (open("/")) | EISDIR; forwardable dir fd via dirfd(opendir()) (inherits by default under full-table) | open(dir, O_RDONLY) -> FdBacking::Directory landed (§5, posix-open-directory-fd); non-O_RDONLY stays EISDIR |
Target design
1. Full-fd-table inheritance default
execve() should forward the parent’s entire live fd table to the child,
not only touched slots. The recording shim already builds a virtual child_view
seeded from every open parent slot (spawn_path_with_actions); the change is to
remove the touched-only gate so the forward list is built from every
child_view[slot] == Some(parent_slot) entry, then apply the recorded
dup2/close actions as edits on top of that baseline. The replay order is:
- Seed
child_view[k] = Some(k)for every open parent slotk(already done). - Apply recorded
Dup2(src, dst)/Close(fd)actions in order (already done). - New: skip any slot whose parent fd carries
FD_CLOEXEC(see §2). - Build a forward for every remaining
child_view[child_slot] == Some(parent)entry – not onlytouchedones.
This makes the dash-> child case work: dash’s open stdio fds (0/1/2) flow to the
child by default, exactly as POSIX requires, with no dup2 from dash.
A subtlety the v0 forward list already half-handles: the one-parent-slot-per-
forward rule. Under full inheritance multiple child slots can legitimately map to
the same parent fd (e.g. dash’s fd 0 and a child’s inherited fd 0 are the same
open description). For non-destructive (Copy/Raw) backings this is fine – the
parent keeps its cap and each child slot gets an independent Copy. For
destructive (Move) backings (Pipe), the existing unique-owner / one-forward
rule must hold: a single Move’d Pipe end cannot legitimately appear under two
child slot names. The forward builder must therefore Copy-share where the backing
permits and reject only the genuine Move-aliasing case, rather than the v0 blanket
“one parent slot per forward for every backing type” rule. This is the main
behavioral subtlety to get right and test.
2. close-on-exec enforcement
FD_CLOEXEC is currently stored per-fd (fd.rs FD_FLAGS) but never acted on,
because the v0 explicit-grant model has no full-table walk to enforce it against.
Under full inheritance there is now a walk: at execve forward-build time, a
parent slot whose FD_FLAGS byte has FD_CLOEXEC set is not forwarded
(equivalent to the recorded-Close path for that child slot). This needs a small
read API on the fd module (e.g. fd::is_cloexec(slot)); the FD_FLAGS array
already exists. O_CLOEXEC passed to open() must set the same byte at open
time so the two surfaces agree. This is the POSIX-correctness half: inherit-all
without CLOEXEC enforcement would leak descriptors a correct program expects
closed (e.g. a listening socket dash opened for itself).
3. The TerminalSession-stdout problem (core decision)
Real POSIX dup-inherits the controlling terminal to all children
non-destructively: a shell keeps its tty while every child writes to the same
tty. The kernel precursor for this is now landed: the bootstrap TerminalSession
cap is minted Copy/SameSession (boot_cap_hold, kernel/src/cap/mod.rs) and
forwards non-destructively via SpawnGrantMode::Raw, proven by
make run-posix-terminal-forward (a parent forwards its terminal to a child and
both write distinct lines; the parent’s post-spawn write proves it kept its cap).
The remaining gap is on the POSIX side: the recording shim still forwards a
Terminal fd via destructive Move (process.rs Terminal arm) and must switch
to Raw under posix-recording-shim-full-fd-inherit. Until then, forwarding fd 1
when it is a TerminalSession would still strip the parent under the shim path.
Decision (kernel mint landed): mint TerminalSession Copy/SameSession,
matching Console, so it forwards via SpawnGrantMode::Raw non-destructively.
This is safe because
TerminalSessionCap (kernel/src/cap/terminal_session.rs) is a stateless unit
struct – it carries no per-session ownership state; write/writeLine
dispatch onto the shared kernel terminal, and readLine resolves caller context
at call time (call_with_context). The Move/ServiceRegrantOnly choice was a
policy default, not a state-ownership requirement. Minting it Copy/SameSession
lets the parent keep its terminal cap while each child receives an independent
Copy to the same shared terminal – which is exactly the POSIX
all-children-share-the-tty semantics, realized through the capability model
rather than against it.
Security/scope: Copy/SameSession keeps the cap from escaping the session (the
same scope Console already uses); a child gains no authority the parent did not
already hold (a write/read view of the same terminal it was already attached to).
requires_live_caller_session stays true, so the child’s readLine still
resolves against the child’s own live session context. This must be confirmed in
the kernel slice’s security review, including that a forwarded terminal cap
cannot outlive the session improperly and that line-discipline interleaving of
two writers (parent + child) is acceptable for the research surface (it is: the
shared kernel terminal already serializes writes; cooked-mode interleaving at
sub-line granularity is a known, documented research-surface limitation, not a
capability leak).
Alternative considered and rejected: a separate narrower TerminalWrite
write-only cap (interface-is-the-permission). This is cleaner long-term but
introduces a new interface, a new bootstrap source, a new FdBacking variant,
and child-side adoption – disproportionate for v0 when the existing
TerminalSession write surface is already the right shape and can be shared by a
mint-mode change alone. Recorded as future work if a write-only child terminal
view is later wanted.
4. Writable File/Directory single-writer tension
Real POSIX shares writable fds across fork (parent and child write to the same
open description). capOS’s disk-backed writable filesystem enforces a
fail-closed single-writer policy: writable File/Directory caps are minted
NonTransferable (writable_fs::transfer_result_cap), so the kernel rejects the
spawn grant and execve surfaces ENOEXEC.
Decision: keep writable File/Directory NonTransferable; document the
divergence. Under full inheritance this means a child does not inherit a
parent’s writable disk fd – execve must treat a NonTransferable backing as a
non-fatal skip (drop that one fd from the child, like CLOEXEC) rather than a
fatal ENOEXEC for the whole spawn. The v0 path made it fatal because the fd was
explicitly dup2’d (the app asked for it); under full inheritance the fd is
inherited implicitly, so failing the entire spawn because one incidental writable
fd cannot transfer would break unrelated programs. The honest divergence: capOS
shares the read path of the filesystem across fork (read-only caps are
Copy/SameSession) but not the write path, because the single-writer policy
is a deliberate capOS guarantee that has no POSIX analog. RAM scratch
Directory/File (the kernel:directory/kernel:file sources) are
Copy/SameSession and do inherit, matching the common shell-scratch case.
A future revocation-aware writable share (refcounted or session-scoped) is possible but out of scope; recorded as a follow-up. v0’s stance is: writable disk fds are not inheritable, skipped non-fatally, documented.
5. cwd Directory representation and inheritance
A shell’s children should be able to list/open the cwd without the app doing
anything special. A forwardable directory fd is obtainable both via
dirfd(opendir()) and, since posix-open-directory-fd, via
open(dir, O_RDONLY) (libcapos-posix/src/file.rs). Two parts:
- cwd as an inheritable Directory fd. Under full inheritance, if the shell
holds an open
FdBacking::Directoryfd for its cwd, it forwards to the child by default (read-only RAM/readonly_fsdirs areCopy/SameSession). The child’s libc cwd resolution can then target the inherited dir fd. This is the primary mechanism and needs no new surface beyond full inheritance. open(dir, O_RDONLY)-> Directory fd (landed,posix-open-directory-fd).openon a directory now installs aFdBacking::Directoryfd instead of failing:readreturnsEISDIR,writereturnsEBADF,lseekreturnsEISDIR, andfdopendirconsumes it. A non-O_RDONLYdirectory open staysEISDIR; a missing path keeps its original error (ENOENT). This covers theN</dirredirection path (dash redir usessh_open->open) without the bespokedirfd(opendir())dance. Proofmake run-posix-open-dir-fd. It was decoupled from the headline path, which never depended on it.
6. Backward compatibility and re-verification
Changing the default from explicit-grant-only to full-inherit interacts with the just-landed explicit-grant contract and existing smokes. What must be re-verified when the behavior slice lands:
make run-posix-pipe-smoke– relies on explicit pipe-end Move grants. Under full inheritance the parent’s other open fds (e.g. its terminal stdio) would now also forward. The pipe child must still see EOF when the parent closes the write end, and the parent must not lose its own terminal (fixed by §3). The recordedclose(write_end)still drops that child slot. Re-verify.make run-posix-spawn-smoke–posix_spawnwith explicit file actions. The file-actions path must still honor explicitdup2/close; full inheritance is the baseline the actions edit on top of. Re-verify.make run-posix-execve-inherit-smoke– the bespoke parent that explicitlydup2s a Directory/Console. Under full inheritance the explicitdup2s become redundant (the fds would inherit anyway) but must remain correct. Re-verify.make run-posix-stdio-smoke/run-posix-stdio-terminal-smoke– stdio backing selection. Re-verify.
The capability-purity argument is unchanged: full-inherit is not ambient
authority. The child inherits exactly the capabilities in the parent’s fd
table (the same caps under the same slots), nothing more. There is no global
namespace, no inherited credential, no kernel-side fd knowledge – the kernel
still only sees an explicit List(CapGrant) from ProcessSpawner.spawn. The
shim now computes that list from the full table instead of the touched subset;
the kernel’s transfer-mode enforcement (process_spawner.rs) still gates every
grant. A child can receive only caps the parent already holds and that are
transferable; NonTransferable writable caps are skipped, not smuggled.
Implementation path (decomposed)
The work splits into a kernel cap-mode slice and a libcapos-posix behavior slice, with one optional narrow slice, all gating the dash shell smoke. See the ready task records:
posix-terminal-session-forwardable(behavior, kernel, done 2026-05-27) – mintTerminalSessionCopy/SameSessionso it forwards non-destructively viaSpawnGrantMode::Raw. Precursor for the terminal-stdout half of §3. Proven bymake run-posix-terminal-forward.posix-recording-shim-full-fd-inherit(behavior, libcapos-posix, done 2026-05-27) – full-table inheritance default (§1),FD_CLOEXECenforcement (§2), non-fatal skip of non-forwardable backings (Udp / already-moved / shared Pipe) when implicitly inherited (§4), and Copy-share of multi-aliased non-destructive backings (§1 subtlety). The recording-shimTerminalarm now forwards Raw (non-destructive). Proven bymake run-posix-fd-inherit-default. ANonTransferablewritable backing stays a documented whole-spawnENOEXECboundary; the v0 POSIXopensurface mints no such cap, so the §4 non-fatal skip is realized for the backings that can actually arise.posix-open-directory-fd(behavior, libcapos-posix, done) –open(dir, O_RDONLY)->FdBacking::Directory(§5); non-O_RDONLYstaysEISDIR, missing path keepsENOENT. Proofmake run-posix-open-dir-fd. Was off the headline critical path.
posix-p1-4-dash-shell-smoke (docs/tasks/) depends on the first two;
once they land it can run with no per-app dash patch (only the generic, already-
landed Variant A fork-exec patch set and the slash-bearing /ls-shim invocation
to skip dash’s PATH stat, which is a documented dash-config choice, not a
capOS workaround).
Per-app patch stance
The directive forbids per-app dash patches that would have to be repeated for
every POSIX program. This design needs none: full inheritance is a generic
capOS-side fix in the shim. The only acceptable vendored-dash touch is a generic
POSIX-correctness item (the existing Variant A fork-exec coupling under
vendor/dash/patches/, owned by posix-p1-4-dash-vendor), not a per-app
inheritance workaround. The EV_EXIT in-place-exec residual
(posix-p1-4-dash-shell-smoke)
is the one remaining dash-specific item; it is a recording-shim “exec without
prior fork” limitation, handled in the shell-smoke slice (disable the
optimization or a bounded generic patch), not by this proposal.
Design grounding
libcapos-posix/src/process.rs(spawn_path_with_actions,fork,execve, the recording-shim contract),libcapos-posix/src/fd.rs(FdBacking,FD_FLAGS/FD_CLOEXEC,inherit_stdio_grants),libcapos-posix/src/terminal.rs,libcapos-posix/src/directory.rs,libcapos-posix/src/file.rs.kernel/src/cap/mod.rsboot_cap_hold(Console and TerminalSession bothCopy/SameSessionsince 2026-05-27),kernel/src/cap/terminal_session.rs(TerminalSessionCapstateless unit struct),kernel/src/cap/process_spawner.rs(validate_spawn_transfer_scope, transfer-mode enforcement).schema/capos.capnpProcessSpawner.spawn(... grants :List(CapGrant)),CapGrant,CapGrantMode.docs/proposals/posix-adapter-proposal.md(recording-shim Variant A, fd-backing-cap inheritance),docs/capability-model.md(interface-is-the-permission, transfer modes/scopes).docs/tasks/done/2026-05-27/posix-execve-capability-inheritance.mdanddocs/tasks/done/2026-05-26/spawn-grant-forwardable-readonly-directory.md(the landed explicit-grant inheritance this proposal generalizes), posix-p1-4-dash-shell-smoke (the premise conflict this resolves).
Design Proposal: Installable capOS System
This is a design proposal with its bounded local/QEMU path landed. The
persistent data-region mount, config-overlay schema + init compose/merge with
fail-closed fallback (make run-installable-overlay), generation/rollback
machinery (make run-installable-generation), integrated installable disk
(make run-installable-disk), target-disk install
(make run-installable-install), first-boot provision
(make run-installable-provision), and update/rollback flow
(make run-installable-update) are implemented. The storage and disk-image
prerequisites it builds on have also landed (see
Build-On Relationship):
block-device-backed read-only and writable filesystems, a persistent
content-addressed Store, reboot-surviving writable persistence, and a hybrid
BIOS+UEFI disk image. This proposal has been reconciled against those landed
contracts and is decomposed separately (see
Closeout And Decomposition).
Throughout, landed behavior is written in the present tense and planned
behavior in the future/conditional tense. The installed-system proof remains a
bounded local/QEMU result: it does not claim secure boot/signing, production
release authority, public ingress, AWS/Azure live support, direct-remapping
production hardware, userspace smoltcp/L4 readiness, or a persistent
Namespace.
Problem
The baseline capOS boot path is a boot-from-image research system. The build
packs a Cap’n Proto SystemManifest (compiled from system.cue) plus the
userspace binaries into an ISO; Limine loads the manifest as a module; the
kernel parses it, builds init’s bootstrap caps, and enters the single
initConfig.init process. The boot-binary ISO layout (behind the boot_iso
feature) can instead read ELFs on demand from /boot/bins/ so the manifest
carries names only. Without the installable-system path, the system that boots
is still exactly the image that was built: the next boot re-reads the same
immutable manifest and rebuilds the same capability graph.
That baseline is correct for a research image and insufficient for an installed system. An installed capOS is one that:
- boots from a local disk rather than a re-imaged ISO each time;
- carries mutable system configuration – installed services, local accounts, network/runtime settings – that persists across reboot and is not baked into the image; and
- can be updated to a new system generation and rolled back to a known-good one.
The hard question is not the disk format. It is how persistent, mutable system configuration composes with the immutable boot manifest without reintroducing ambient authority or a single mutable blob that can brick the system. That composition is the center of this proposal.
Non-Goals
- Designing the block device, filesystem, or
Storepersistence mechanisms. Those are owned by Storage and Naming and the storage tracks in Hardware, Boot, and Storage. This proposal composes them and must not redesign them. - Defining the local-account schema. That is Local Users, Storage, and Policy; the account store is a consumer of the persistent-config region designed here.
- Secure boot, image signing, and manifest trust. Those are tracked as storage-proposal Open Question #5 and the security/verification track; this proposal notes where a signature check would attach but does not specify the cryptography.
- Any cloud-image or non-ATAPI boot-binary loader work; see the Cloud Device Tracks backlog.
On-Disk Layout
The installed system needs three regions with distinct mutability and authority: a read-only boot region, an immutable-per-generation system region, and a mutable data region. How those regions map onto physical disks is the first reconciliation point, because the landed building blocks already fix part of it.
Landed shape:
make imageproduces a single hybrid BIOS+UEFI raw image with one GPT ESP (FAT32) carrying Limine + the kernel +manifest.bin. That is the boot region (tools/mkdiskimage.sh,make run-disk/make run-disk-bios).- The persistent content-addressed
Store(CAPOSST1) and writable filesystem (CAPOSWF1) are co-located in the data-region image produced bytools/mkstore-image --writable. Focused storage and early data-region smokes can still attach that image as a separate virtio-blk device. - The installable-system disk path has folded those regions onto one bootable
disk: GPT partition 1 is the ESP and GPT partition 2 carries the co-located
CAPOSST1+CAPOSWF1data region at a fixed base LBA read throughcap::data_region_base_lba(no GPT parser in the kernel).make run-installable-diskproves boot from that integrated disk. capos-system-installwrites the same layout to a manifest-selected targetBlockDeviceand then boots the installed disk standalone (make run-installable-install). Provisioning and update operate on the installed data region after that floor exists.
Region placement decision (reconciled). The storage model is the
co-located CAPOSST1 Store + CAPOSWF1 writable filesystem data region,
not a persistent Namespace and not three independent mutable partitions. The
separate data-region disk remains a focused proof packaging, while the installed
system packages the ESP and data region onto one target disk. The kernel relies
on the fixed tool/kernel data-region LBA contract for the installable path.
flowchart TD
InstallDisk[Installed disk] --> Boot[GPT partition 1: ESP with Limine + kernel + boot manifest]
InstallDisk --> DataRegion[GPT partition 2: fixed-LBA data region]
DataRegion --> System[CAPOSST1 content-addressed Store: immutable generation objects]
DataRegion --> Data[CAPOSWF1 writable filesystem: config/account state and markers]
Boot -.init mounts data region when present.-> DataRegion
Data -.active and known-good marker files name hashes.-> System
The system and data regions share the co-located data region: immutable
generation objects live in the persistent Store (CAPOSST1), and mutable
config/accounts plus the active/known-good pointers live in the writable
filesystem (CAPOSWF1). The overlay read/validate/merge that composes them at
boot has landed for the installable-system path (see below).
Boot region (read-only at runtime)
The single GPT ESP carrying Limine, the kernel, and the immutable boot
manifest – the same SystemManifest shape that exists today. This region is
what make image produces now (one hybrid BIOS+UEFI image, one ESP). At runtime
it is treated as read-only. The landed update proof stages and commits
generation objects in the data region; production boot-region rewrite,
rollback, signing, and release policy remain future work.
The boot manifest stays the root of trust for topology: it pins the kernel,
the init binary, and the minimum kernel-sourced caps init needs to bootstrap. In
the installable-system path, which system/config generation to activate is
named by writable-filesystem marker files and persistent Store hashes (see
Generations And Rollback); the SystemManifest
still carries no generation field. The boot manifest does not grow to hold
installed-service config or accounts.
System region (immutable per generation)
The landed persistent content-addressed Store (CAPOSST1, put/get/has/
delete keyed by SHA-256, durable across reboot) is the durable substrate for
immutable generation objects. The installable-system proofs exercise config and
account generation objects (SystemConfigOverlay plus related records) and the
marker-selected hashes that choose them. Package-manager-style system payload
generations – service binaries, default software configuration, and release
payload roots written into CAPOSST1 by a production updater – remain future
work.
The target system region is the system of record for what software is
installed, not a POSIX /usr, once those software-payload generation roots
exist. capOS has capabilities, not paths: a service is “installed” when the
generation root object binds its name to the content hash of its manifest
fragment and binary. Because the landed Namespace cap is RAM-only and does
not survive reboot, persistent name-to-hash bindings live inside generation
capnp objects in the Store and in writable-filesystem marker files, not in a
persistent Namespace cap (none exists). A Namespace may still be
populated in RAM at boot from those persistent bindings, and a StoreFS adapter
(storage proposal “Bridging the Two Models”) can expose a generation as a
directory tree for POSIX/WASI consumers, but the durable installable record is
the Store objects plus writable-filesystem markers.
Data region (mutable)
The landed writable filesystem (CAPOSWF1, full Directory mutation set +
File.write/truncate/sync/close under a fail-closed single-writer policy,
co-located with the persistent Store). It holds everything that legitimately
changes at runtime and must survive reboot:
- Persistent system configuration – the central subject below: capnp
overlay objects in the persistent
Storeplus writable-filesystem marker files for theactive/known-good pointers. - Local account store – the provision proof writes an operator account
record as a persistent
Storeobject and config overlay input; broader durable account policy remains owned bylocal-users-management.mdGate 3. - Per-account home/config/cache subtrees and service state.
The data region is mutated under capability authority only. There is no global
filesystem root and no ambient path-based access: a service receives a
writable-filesystem Directory cap (or a Store cap) scoped to its own subtree,
exactly as Storage and Naming
describes for attenuated grants.
Why not “a filesystem is the system of record”
A traditional install makes / the source of truth and layers config files,
package databases, and /etc over it. That is ambient authority through paths,
which capOS rejects by design (storage proposal, “The Problem with
Filesystems”). Here the capability object graph is authoritative; the
durable installable-system record is the persistent Store objects plus
writable-filesystem marker files, and a filesystem view is an adapter over that
authority rather than ambient authority itself. The on-disk bytes may be a
filesystem for tooling convenience, but the system model is capability-native.
Beyond-Boot-Manifest Configuration (Central Decision)
This is the core of the proposal. Today the system is fully described by the static boot manifest. An installed system needs persistent, mutable configuration that the boot manifest cannot carry, while keeping the boot manifest’s fail-closed guarantees.
The model is a two-layer composition resolved by init at boot:
-
Base layer – the immutable boot manifest. Pins the kernel, the init binary (the init mandate from Run Targets, Init Mandate, and Default-Run Integration Gate B still applies:
initConfig.init.binarymust beinit), the kernel-sourced bootstrap caps, and the floor of services and policy the system always runs. This layer is authoritative and cannot be overridden by persistent state – it is the recovery anchor. -
Overlay layer – the persistent config generation. A capnp-encoded configuration object naming additional installed services, local network/runtime settings, and account bindings. The object is content-stored in the persistent
Store(CAPOSST1); a well-known writable-filesystem path (proposedsystem/config/, aCAPOSWF1directory) holds the small marker files that name the current and known-good generation by content hash. The landedNamespaceis RAM-only, so this is filesystem-path + content-hash grounding, not a persistentNamespaceroot. This is whatcapos-system-install,capos-system-provision, andcapos-system-updatewrite in the landed local proofs.
Precedence and merge model
init reads the base manifest from BootPackage (as it does today). The overlay
step landed in 2026-05 (installable-config-overlay-schema-and-merge): when
the data region mounts, init reads system/config/overlay.bin, decodes the
SystemConfigOverlay capnp object, and – only if it validates against the
base’s declared SystemManifest.extensionPoints – composes its additional
services over the base plan (SystemConfigOverlay::compose_onto, proof
make run-installable-overlay). Generation selection also landed
(installable-system-generation-rollback): writable-filesystem marker files
select the active/known-good object hashes and provide failed-boot fallback.
The merge rules are deliberately conservative:
- Base pins win. Anything the base manifest declares (kernel caps, the init binary, floor services, policy floors) is not overridable by the overlay. The overlay may only add services and fill in settings the base marks as overlay-supplied. This prevents a tampered or buggy overlay from dropping a recovery service or widening authority.
- Overlay adds, within declared extension points. The base manifest declares
named extension points (e.g. “additional services”, “network config”,
“account store location”). The overlay may bind only those. An overlay key
that does not match a declared extension point is rejected, not silently
applied – closed by default, mirroring the existing “omitted cap sources fail
closed” invariant (
manifest-startup.md). - No new authority classes. The overlay can request services be started with caps the base manifest already authorizes init to delegate. It cannot mint kernel-source authority that the base did not grant init. The interface is the permission: an overlay names which already-authorized caps a service gets, never a new kernel cap source.
This is layering, not free-form override: the base manifest is a contract and the overlay fills declared holes in it.
Where persistent config physically and logically lives
- Physically: the data region – the landed persistent
Store(CAPOSST1) for the immutable per-generation capnp objects, and the landed writable filesystem (CAPOSWF1) for the smallactive/known-good marker files. - Logically: a
CAPOSWF1directory tree, e.g.system/config/, holds one marker file naming the current generation object by content hash (plus the retained known-good marker); the generation objects themselves are content-stored in the persistentStore. Account records live under a siblingsystem/accounts/tree the same way (consumed bylocal-users-management.md). There is no persistentNamespacecap; the RAMNamespaceis repopulated from these bindings at boot if needed. - Authority: only a narrow system-config authority (held by init and the
dedicated install/provision/update proof services) may write the
system/configwritable-filesystem subtree and the system-configStoreobjects. Ordinary services receive read-only scoped views or nothing. This is the writable-filesystemDirectory/Storeattenuation model, not a new mechanism.
Detecting and recovering from a bad persistent layer
The overlay is the most dangerous new surface: a corrupt or hostile overlay must never prevent boot. The design is fail-safe by construction:
- Validation before merge (landed). init validates the overlay against the
base manifest’s declared extension points and the schema before applying any
of it: a schema-invalid, version-mismatched, content-hash-mismatched,
stale-epoch, or extension-point-violating overlay is rejected whole
(
SystemConfigOverlay::from_capnp_bytes+compose_onto). Validation reuses the existingcapos-configdiscipline rather than a parallel checker. - Monotonic generation + integrity. No landed
Store,Namespace, orSystemManifestschema carries a system-generation/epoch field (other caps such asAccountRecordand the DDF revocation generations do, but not the installable-system path). The overlay instead carries a monotonicepochand a SHA-256contentHashinside its ownSystemConfigOverlaycapnp object (both landed in track item 3): the epoch is checked against the base’sminOverlayEpochfloor and the content hash is a self-consistency check. The writable-filesystem marker files that record which hash isactive/known-good landed in the generation/rollback path. This mirrors the stale-write and monotonic-version rules already required for the account store (local-users-management.mdGate 3) and the managed-cloud store (storage proposal “Managed Cloud Backing”) without extendingStoreorNamespace. A stale overlay (epoch below the floor) is rejected. - Boot-with-base fallback. If the data region does not mount, or the active overlay fails validation, init boots from the base manifest alone and surfaces the failure (serial diagnostics / audit). The system always reaches at least the floor configuration, which by construction includes a recovery path.
- Known-good generation pointer. The
activeoverlay pointer is advanced only after a generation is proven to boot (see Generations And Rollback); a failed new generation leavesactiveon the prior known-good one.
Install / Provision / Update / Rollback Flow
The local/QEMU install, provision, update, and rollback flows have landed. They prove the authority and durability shape over capOS capabilities; they do not claim a production release/update service, secure boot/signing, public ingress, or live multi-provider deployment readiness.
Install
The capos-system-install userspace service takes the packaged image source
from the booted CD-ROM /boot/bins/ tree and writes the installable layout onto
a manifest-selected target disk. It holds only the read-only
installable_image_source Directory and the target-scoped
block_device_target BlockDevice; it cannot reach the boot disk through that
target cap.
The service writes the boot-region head (BOOTHEAD.BIN: protective MBR,
primary GPT, FAT ESP with Limine + release kernel + base manifest), writes the
backup GPT (BOOTGPT.BIN) at the LBA named by the primary GPT header, and
initializes the empty data region (DATAIMG.BIN: empty CAPOSST1 Store +
CAPOSWF1 filesystem with system/config) at the fixed
cap::data_region_base_lba. It validates every sector range and verifies the
read-back before treating the install as complete. The empty data region is the
install floor; the first non-empty config generation is provisioning.
make run-installable-install proves the flow in two passes: pass 1 installs
into the target virtio-blk disk, and pass 2 boots that disk standalone with no
CD-ROM and reaches the base service with the data region mounted.
Provision
First-boot provisioning writes the initial persistent config: the
operator’s first local account record, network/runtime settings, and any
additional services to start. capos-system-provision runs as PID 1 over an
installed system’s persistent data region with only Console,
writable_fs_root, and persistent_store caps. On the empty install floor, it
writes the first non-empty SystemConfigOverlay generation, commits the
generation object to the Store, writes system/config/overlay.bin, and
advances the gen-active marker. Until provisioning runs, the system boots on
the base-manifest floor.
make run-installable-provision boots the same empty-config disk twice: pass 1
provisions the account/settings/additional service, and pass 2 re-reads the
active generation and account record from the data region to prove they survived
reboot.
Update
The landed update flow applies a new generation over the same persistent
Store + writable system/config region used by provisioning. The local proof
does not rewrite a production boot region or ship a signed release updater; it
proves staged generation commit, failed-candidate fallback, and base-overlay
revalidation.
- Write the new generation into the content-addressed
Storeas a new root hash; the old generation’s objects remain (content-addressing dedups shared objects). - Stage a new
active-candidate pointer; do not advanceactiveyet. - Reboot into the candidate. If it reaches a health checkpoint, commit by
advancing
active. If not, the boot-with-known-good fallback keeps the prior generation (see below).
Persistent config (the overlay and accounts in the data region) is carried across updates: the data region’s config/account generations persist across candidate staging, commit, and fallback. Where a new base no longer admits an overlay’s declared authority, the overlay is re-validated against that base and falls back to the base floor with a surfaced error rather than applying partially.
make run-installable-update boots the same empty-config disk three times:
boot 1 provisions known-good generation 1, rejects an overlay against a
revoked-cap base, and stages a healthy generation 2; boot 2 commits generation
2 across reboot and stages a failing generation 3; boot 3 auto-falls back from
generation 3 to known-good generation 2 while preserving the data region.
Generations and rollback
The active system/config generation is named by writable-filesystem marker
files (CAPOSWF1) carrying a content hash and monotonic pointer epoch – not by
a SystemManifest field, since the manifest schema carries no system-generation
field. The generation objects themselves are immutable content-addressed roots
in the persistent Store. Rollback is:
- System rollback: point the active system-generation hash back to the prior known-good generation. Because generations are immutable content-addressed roots, the prior generation’s bytes are still present; rollback is a pointer move plus reboot, not a re-extraction.
- Config rollback: point the
activeoverlay binding back to the prior overlay generation, retained for a bounded number of generations. - Failed-boot auto-fallback: a generation is promoted to known-good only
after it reaches a defined health checkpoint. A boot that does not reach the
checkpoint (kernel panic, init failure, overlay validation failure) is
detected on the next boot via a “boot attempt count vs confirmed” marker, and
the init/generation logic reverts to the last confirmed generation. This is the
standard A/B-generation pattern, expressed over content-addressed
Storeroots rather than two fixed partitions.
make run-installable-generation proves this machinery before the full update
flow: it stages a candidate, records a boot attempt before applying it, rejects
a stale pointer, proves config rollback to a retained generation, and
auto-falls back to the known-good generation across a fresh reboot when a
candidate is left unconfirmed.
Build-On Relationship To Landed And Planned Work
This proposal is an integration design over existing tracks. It must not redesign them. Current state of each piece it builds on:
| Building block | Owning track | Status today |
|---|---|---|
Persistent content-addressed Store | Storage and Naming | landed: CAPOSST1 superblock at LBA 0, put/get/has/delete keyed by SHA-256, durable across reboot (persistentStore grant source; reboot proof make run-storage-persist). RAM-backed Store CapObject + userspace RAM Store service also landed |
Namespace model | Storage and Naming | landed but RAM-only: resolve/bind/list/sub, not persistent (namespace grant source). No persistent Namespace cap exists |
BlockDevice boundary | Hardware, Boot, and Storage “Reusable Block-Device Path” / “Local Disk Storage” | landed: readBlocks/writeBlocks/info/flush over a real cfg(qemu) virtio-blk device (blockDevice grant source; proof make run-virtio-blk) |
Read-only filesystem over BlockDevice | “Local Disk Storage Milestone” | landed: CAPOSRO1 superblock, Directory.list/open/sub + File.read/stat, mutating methods fail closed (readOnlyFsRoot; proof make run-storage-fs) |
| Writable persistence across reboot | “Writable Local Storage Milestone” | landed: CAPOSWF1 writable filesystem at LBA 256, full Directory mutation set + File.write/truncate/sync/close, fail-closed single-writer (writableFsRoot; reboot proof make run-storage-writable). Co-located with CAPOSST1 via tools/mkstore-image --writable |
Bootable disk image (make image, make run-disk) | “Bootable Disk Image” | landed: single hybrid BIOS+UEFI raw image with one GPT ESP carrying Limine + kernel + manifest.bin; make image/run-disk/run-disk-bios; GCP/AWS provider packaging. The boot-binary ISO layout’s on-demand reads also landed behind boot_iso |
Boot manifest / SystemManifest / init mandate | Manifest and Service Startup, Run Targets, Init Mandate, and Default-Run Integration | landed: static manifest, init-owned service graph, name-only boot-ISO path. The installable path additionally reads and validates a persistent overlay only when the data region is mounted and the base manifest declares matching extension points (make run-installable-overlay) |
| Local account store (a consumer) | Local Users, Storage, and Policy Gate 3 | partially landed for installable proof: capos-system-provision writes and re-reads one operator account record through persistent Store/writable-filesystem state; full durable account policy remains future |
The storage and disk-image prerequisites have landed, and the bounded
installable-system composition has landed on top of them: overlay
read/validate/merge, generation marker files, install, provision, and
update/rollback flows all have local QEMU evidence. The decomposition task
(installable-system-decomposition)
required ddf-blockdevice-boundary-virtio-blk-smoke,
storage-readonly-fs-over-blockdevice, storage-persistent-store-reboot-proof,
storage-writable-persistence-reboot-proof, and disk-image-provider-packaging
to be done before emitting implementation tasks; they are. Because some
prerequisites landed with contracts that differ from this proposal’s original
projections (single hybrid ESP rather than three boot/system/data partitions,
RAM-only Namespace rather than a persistent one, no system-generation field on
the Store/Namespace/SystemManifest path), this proposal has been reconciled
to the landed shapes above so the track does not encode a stale contract.
Production hardening remains separate: secure boot/signing, authorized release
publication, public ingress, broader cloud-provider coverage, direct-remapping
production hardware, and full durable local-account policy are not implied by
the local installable-system evidence.
Milestone Framing
installable-system is its own milestone: “an installed, persistent capOS
that boots from disk and keeps mutable system configuration across reboots.” It
is a distinct, user-visible product outcome from the storage and bootable-disk
image milestones it builds on, even though it depends on them – a user can have
block devices, a filesystem, and a bootable disk image without having an
installed, self-configuring, updatable system.
This framing is recorded in Roadmap. The milestone became the selected milestone after Device Driver Foundation closed and is now closed for the bounded local/QEMU installable-system contract by the structural docs reconcile and the landed install/provision/update/rollback evidence. The successor selected milestone is the GCE self-hosted Web UI path; public ingress and TLS remain approval-gated follow-ups under that track.
Design Grounding
- Hardware, Boot, and Storage – Local Disk Storage, Writable Local Storage, and Bootable Disk Image tracks (the storage/boot prerequisites this design composes).
- Local Users, Storage, and Policy – manifest-seeded vs disk-backed accounts (Gate 3); a concrete consumer of the persistent-config region.
- Run Targets, Init Mandate, and Default-Run Integration – the init mandate and boot-manifest policy any installed-system boot path must respect.
- Storage and Naming
– the
Store/Namespace/Directory/Filemodel, content-addressing, attenuation, and the managed-cloud/stale-write rules the persistence layer reuses. - Manifest and Service Startup
and the
system.cue->SystemManifestboot path – the immutable base the persistent overlay composes with. - Cloud Deployment – the cloud disk-image/import path that an installed-system image must remain compatible with.
Closeout And Decomposition
This proposal is reachable from docs/SUMMARY.md, and the installable-system
milestone framing is recorded in docs/roadmap.md.
Turning this design into actionable backlog + implementation tasks is a
separate task,
installable-system-decomposition,
which decomposed the track against the landed
BlockDevice/filesystem/Store/writable-persistence/disk-image contracts in
Installable System. The
behavior track then landed the data-region mount, overlay compose,
generation/rollback machinery, integrated disk packaging, target-disk install,
first-boot provision, and update/rollback flows. This proposal has now been
structurally reconciled to those landed shapes: integrated installed disk
packaging over an ESP plus fixed-LBA data region, writable-filesystem +
content-addressed Store grounding for persistent naming and generation
markers, RAM-only Namespace, and no system-generation field on the
Store/Namespace/SystemManifest path. The proposal text and backlog track
therefore describe the same bounded local/QEMU contract.
Proposal: Resource Accounting and Quotas
Cross-cutting resource profiles, ledgers, reservation semantics, and verification gates for bounded capOS sessions, services, drivers, storage, networking, tests, and future language runtimes.
Related
- Authority Accounting records the current transfer and resource-accounting invariants.
- Memory Management documents the current frame-grant and MemoryObject accounting baseline.
- Go VirtualMemory Contract provides the first concrete virtual-reservation versus physical-commit ledger split for a future language runtime.
Problem
capOS already has several resource limits: cap slots, frame grants, timer waiters, thread and kernel-stack quotas, ring scratch, and spawn preflight checks. Those are useful but fragmented. Local accounts, guests, anonymous callers, external sessions, service accounts, drivers, storage services, network stacks, tests, and future runtimes all need the same rule:
No workload receives implicit unlimited consumption of finite system resources.
This proposal defines the common model. It extends the Security Verification
Track S.9 authority graph and per-process ResourceLedger design rather than
replacing it.
Principles
- A
ResourceProfileis a policy template, not authority. - Actual enforcement happens through ledgers, capability wrappers, brokers, supervisors, and kernel/resource-service admission checks.
- Every resource class has one ledger of record. Mirrors for status, metrics, or audit are derived views and must not be used for enforcement.
- Reservation happens before side effects. Commit publishes the resource. Release and rollback are mandatory on all success, failure, timeout, revocation, and process-exit paths.
- Identity metadata selects policy. It never consumes, releases, or bypasses quota by itself.
- Quota donation is explicit. A caller may donate budget to a service call, but a service cannot silently spend the caller’s unrelated budget.
Resource Profiles
Resource profiles are named templates selected by account records, manifest seed data, service policy, external admission rules, or test manifests. A profile should contain policy intent, not raw authority:
struct ResourceProfile {
profileId @0 :Data;
versionId @1 :Data;
epoch @2 :UInt64;
homeQuotaBytes @3 :UInt64;
tempQuotaBytes @4 :UInt64;
processLimit @5 :UInt32;
threadLimit @6 :UInt32;
capLimit @7 :UInt32;
memoryCommitLimitBytes @8 :UInt64;
frameGrantLimitPages @9 :UInt64;
memoryVirtualReservationLimitBytes @20 :UInt64;
endpointQueueLimit @10 :UInt32;
inFlightCallLimit @11 :UInt32;
retired12 @12 :UInt32; # was pending IPC submission quota; do not reuse
ringScratchLimitBytes @13 :UInt64;
logQuotaBytesPerWindow @14 :UInt64;
networkProfile @15 :Text;
cpuBudgetUsPerWindow @16 :UInt64;
cpuWindowUs @17 :UInt64;
timerWaiterLimit @18 :UInt32;
launcherProfile @19 :Text;
}
The profile is evaluated by a broker or supervisor. The result is a set of ledger limits, wrapper caps, service-specific budgets, and spawn constraints. Changing a profile does not change a running workload until a trusted service issues new limits, revokes old caps, or starts a replacement workload.
Current kernel coverage includes manifest profile decoding, spawn-time profile
resolution, per-process ring and reply scratch sizing, endpoint queue and
in-flight call limits for profile-created endpoint caps, child cap-table slot
limits, and per-process thread table limits. The QEMU proof
make run-resource-profile covers an in-limit spawn, an over-cap spawn
rejection before result authority escapes, rollback after that rejection, and a
thread-limit rejection through ThreadSpawner.create.
Ledgers of Record
The ledger of record depends on the resource owner:
| Resource | Ledger of record |
|---|---|
| Capability slots | Process CapTable / process resource ledger |
| Processes and child subtrees | Supervisor or ProcessSpawner ledger |
| Threads and kernel stacks | Process-owned thread/kernel-stack ledger |
| Anonymous virtual reservations | Address-space or VM service reservation ledger |
| Anonymous committed memory | Address-space or VM service ledger |
| Physical frames and frame grants | Frame allocator / holder ledger |
| MemoryObject mappings | Per-process frame-grant ledger plus address-space tracking |
| Endpoint queues | Endpoint object ledger |
| In-flight calls and result caps | Caller/callee transport ledger |
| Ring submissions | Fixed ring depth and per-dispatch budget; no profile ledger |
| Ring scratch and request buffers | Process ring/resource ledger |
| Timer sleeps and waiters | Timer service waiter ledger |
| Log bytes | Log service token bucket / retention ledger |
| Storage bytes and namespace entries | Store/Namespace service ledger |
| Temporary, cache, and home storage | Store/Namespace scoped sub-ledgers |
| Network listeners, sockets, and bytes | Network service or socket cap ledger |
| CPU share and runtime budget | Scheduler or scheduling-context ledger |
| DMA pool bytes, DMA buffer count, descriptor/ring depth, MMIO mappings, interrupt holds, in-flight DMA submissions | Device-manager ledgers, later |
| Model tokens, provider calls, tool calls | Provider/agent gateway ledgers, later |
No second module should maintain an independent enforcement counter for the same resource. A status service may cache values for display only if it treats the ledger owner as authoritative and never grants or rejects based on stale cache state.
Relationship To Tickless And Realtime CPU Authority
The CPU terms in Tickless and Realtime Scheduling reuse this resource-accounting model:
ResourceProfile.cpuBudgetUsPerWindow: coarse policy template only. Selecting a profile does not mint executable CPU-time authority.ResourceLedgerCPU budget: coarse best-effort accounting before realtime contexts exist, and the ledger of record for non-realtime CPU share/runtime limits.SchedulingContext: spendable CPU-time object for realtime or admitted execution. It carries budget, period, relative deadline, priority/criticality, CPU mask, and overrun policy.CpuIsolationLease: CPU placement, exclusivity, and noise/nohz authority. It is not CPU budget and must charge consumed runtime to aSchedulingContextor schedulerResourceLedger.NoHzEligibility/NoHzActivation: reviewed eligibility plus scheduler-proven current CPU state. They do not grant resource credit.RealtimeIsland: admitted bundle consumingSchedulingContexts plus memory, device, ring, and optionalCpuIsolationLeasereservations.
Do not create a second CPU budget system under nohz, SQPOLL, or realtime terminology. Those features select placement and execution mode; CPU time is still charged through scheduling-context or scheduler-ledger authority.
Reservation Lifecycle
Every resource allocation follows the same lifecycle:
reserve(request, limits, expected_state)
-> reserved(token)
-> denied(reason)
commit(token)
-> committed(resource)
-> rollback(token, reason)
release(resource)
-> released
Rules:
reservevalidates structure, bounds, ownership, and available quota before any externally visible mutation.commitpublishes exactly the resource that was reserved.rollbackrestores all ledgers touched by the reservation.releaseis idempotent from the caller’s perspective but changes ledger state at most once.- Process exit and cap revocation bulk-release all resources owned only by the exiting process or revoked hold edge.
- Stale handles, exhausted quotas, malformed limits, and unknown profile versions fail closed with typed errors or denials, not panics.
The Security Verification Track S.9 transfer transaction is the concrete model for cap transfer and spawn. Other services should reuse the same preflight, reservation, commit, rollback, and audit vocabulary.
Donation and Shared Services
Shared services handle many sessions in one process. They need bounded server-side state without treating caller identity as authority.
Donation is a lease from one ledger to another for a named operation:
Donation {
donorSessionId
donorLedgerId
receiverServiceId
resourceClass
amount
expiresAtMs
callId
}
A donation can pay for queue entries, scratch bytes, temporary storage, outbound bytes, model tokens, or CPU budget needed to serve one request. It does not grant unrelated authority to the service and does not let the caller spend the service’s own management budget. When the call finishes, times out, is cancelled, or the session exits, unused donation is returned and used donation is charged to the donor’s accounting record.
Services may also have their own base budgets for resident state. Per-client budgets and service base budgets are separate ledger entries so a single client cannot hide consumption inside the service account.
Profile Binding
Profiles are selected by policy inputs:
- manifest-seeded operators and recovery identities,
- local account records,
- service account records,
- guest and anonymous admission rules,
- external identity bindings,
- test manifests and QEMU smoke profiles,
- future driver, storage, network, and runtime launch policies.
The broker or supervisor translates those profiles into concrete limits at session creation, spawn, service start, or cap minting time. The translation must record:
- profile ID, version ID, and policy epoch,
- ledger owner and resource class,
- hard limit and optional token-bucket window,
- source policy and approving broker/supervisor,
- audit record ID for the grant,
- expiry or revocation epoch if the budget is leased.
A session can carry profile summaries for audit and display, but the summaries do not enforce quota. Enforcement lives where the resource is created or used.
Resource Classes
Kernel and Process Resources
Cap slots, process count, thread count, kernel stacks, ring scratch, outstanding calls, and endpoint queue entries are kernel or kernel-object resources. Ring submissions are bounded separately by the fixed SQ depth and the per-dispatch budget, so they do not have a profile quota. The remaining checks belong before spawn, thread creation, transfer, IPC, and ring dispatch side effects.
Memory
Current VirtualMemory mappings and held MemoryObject caps charge the
process-owned frame-grant ledger of record. The address space records borrowed
object-backed pages at the same tracking limit so unmap and teardown can
distinguish them from anonymous pages, but that tracking is not a second
enforcement counter. Future reserve/commit/decommit semantics split virtual
reservation from committed physical memory: VirtualMemory.reserve charges a
virtual-reservation ledger, while VirtualMemory.commit and compatibility
VirtualMemory.map charge the committed-memory/frame ledger before pages
become accessible. Decommit releases physical commit budget while preserving
virtual reservation budget until unmap.
Storage
Storage services own byte, object, namespace-entry, and snapshot ledgers.
home, config, cache, and tmp are separate sub-ledgers even when backed
by the same Store. Temporary session storage expires on logout or session
expiry. Cache quota may be reclaimed by policy. Home/config quota should not
be reclaimed without explicit account/storage policy.
Logging and Audit
Log volume uses token buckets and retention limits. Audit entries required for security state transitions should have a protected emergency path; ordinary application logs must not starve audit. If audit storage is unavailable, the system enters a bounded emergency mode rather than silently dropping mandatory security events.
Network
Network profiles select listener authority, outbound connection classes, socket counts, byte windows, and remote scopes. A normal local account may receive client network caps; listener authority requires service policy, operator policy, or an application-specific grant. Anonymous remote sessions receive only protocol state needed to authenticate or create an account.
CPU and Scheduling
CPU share and runtime budget belong to the scheduler or future scheduling context. Until full scheduling-context donation exists, CPU limits can be coarse token buckets and supervisor policy. Later realtime, media, and driver work should use explicit period/budget/deadline records rather than ad hoc sleep or polling loops.
Devices and Providers
DMA pools, MMIO mappings, interrupts, cloud provider calls, LLM tokens, media frames, and external API calls are scarce resources too. The first proof may use service-level ledgers, but the rule is the same: one ledger of record, typed reservation, explicit release, audit-visible denial.
For the Security Verification Track S.11.2 userspace-driver transition, device
ledgers must account at least DMA pool bytes, DMA buffer count, descriptor or
ring depth, MMIO mappings, interrupt holds, and in-flight DMA submissions. A
DMAPool reservation is not only memory allocation; it is also device-visible
write authority and must be released through the same revoke/quiesce/reset path
that makes future reuse
safe.
Canonical device ledger concepts:
dma_pool_bytes
dma_buffer_count
dma_descriptor_count
mmio_mapping_count
interrupt_hold_count
inflight_dma_submission_count
These fields are device-manager accounting concepts even if the first
implementation uses different internal names. They must have one ledger of
record. DMA pool bytes and buffer counts are not interchangeable with ordinary
MemoryObject ownership, because device-visible memory also carries IOVA,
descriptor, reset, and stale-completion obligations.
Failure Semantics
Quota failure is a normal result, not a crash:
| Condition | Result |
|---|---|
| Malformed request | Invalid input / typed transport error |
| Caller exceeds hard limit | Quota denied / overloaded |
| Service base budget exhausted | Service overloaded |
| Donated budget exhausted | Request denied or partial result |
| Stale profile version | Denied; refresh session/profile |
| Ledger mismatch or rollback failure | Enter recovery/emergency mode |
Retry policy belongs to the caller or supervisor. Kernel and service code must not spin, allocate unbounded retry queues, or emit unbounded diagnostics after quota failure.
Audit and Status
Auditable events:
- profile-to-ledger translation,
- reservation denial,
- successful budget grant,
- donation start/commit/release,
- cap or resource revocation,
- process-exit cleanup,
- rollback or recovery-mode entry,
- administrative profile change.
Status views should expose current usage, limits, denial counts, and suppressed diagnostic counts by resource class. They must redact sensitive account, network, provider, and object identifiers unless the viewer holds a suitable audit/status cap.
Verification Gates
Before treating resource profiles as complete for any caller class, add checks at the affected resource owner:
- Host tests for limit parsing, stale profile rejection, reservation/rollback, and one-ledger-of-record invariants.
- QEMU smokes proving quota denial for process/thread/cap, endpoint queue, timer waiter, memory, storage, log, and network resources as they exist.
- Hostile exhaustion tests that do not panic, leak frames, leak cap slots, or leave partial child processes.
- Process-exit and revocation tests proving all charges release exactly once.
- Audit/status tests showing denial and cleanup are visible without exposing secrets.
- Kani or property tests for small pure ledger primitives when bounds are fixed enough to model.
Relationships
- Authority Accounting: Security Verification Track S.9 defines the current authority graph and process-ledger transaction model. This proposal generalizes the quota vocabulary to services, storage, networking, sessions, and future devices.
- User Identity and Policy: account and session resource profiles select templates. Brokers and supervisors translate them into ledgers and wrapper caps.
- OOM Handling and Swap: memory commitment, reclaim, and swap policy are the memory-specific part of this model.
- Storage and Naming: Store/Namespace services own storage ledgers for homes, config, cache, tmp, snapshots, and imports.
- System Monitoring: status and metrics expose derived ledger views, not parallel enforcement counters.
Non-Goals
- No Unix cgroups clone as the primary abstraction.
- No identity-based quota enforcement in the kernel.
- No global mutable quota database trusted by every subsystem.
- No claim that existing code already enforces every resource class above.
- No unbounded best-effort mode for guests, anonymous callers, tests, or service accounts.
Open Questions
- Which ledger IDs and status schemas should become stable ABI first?
- How much CPU-budget enforcement is useful before scheduling contexts exist?
- Should quota donation be represented as a general capability type or as method-specific sideband on selected service calls?
- Which storage quota primitive is first: bytes, object count, namespace entries, or snapshots?
- Which proofs belong in
capos-libversus resource-service-specific tests?
Proposal: Memory Authority Model
capOS already has implemented memory management. This proposal defines the missing cross-cutting contract: which capability may create or hold memory, which memory may move or be reclaimed, when a mapping mutation is complete, and which proofs are required before shared-memory, DMA, swap, and language-runtime features build on that substrate.
This is deliberately not a CPU or language memory-ordering model. Atomic ordering, cache coherence, and Rust aliasing rules remain their own topics. This page covers OS memory authority, residency, consistency, and lifecycle.
Related
- Memory Management documents current physical
frames, address spaces, user-buffer validation,
VirtualMemory, andMemoryObject. - Go VirtualMemory Contract freezes the reserve/commit/decommit contract this proposal treats as the current anonymous-memory baseline.
- OOM Handling and Swap owns memory-pressure, OOM, reclaim, and encrypted swap policy. That proposal consumes the authority/residency vocabulary defined here; this proposal does not re-specify reclaim or swap policy.
- Capability-Based Service Architecture
owns authority-at-spawn, service composition, and the per-service capability
graph that selects which memory authorities (anonymous
VirtualMemory,MemoryObject, future pin/DMA/swap caps) a service receives. The classes and state machines below are the contract that service-graph budgeting and spawn-time memory grants must respect. - Resource Accounting and Quotas owns the one-ledger-of-record and reserve/commit/rollback/release vocabulary used by the accounting rules below.
- Design Risks Register R4 tracks the consolidated open gap in cross-service donation/fairness, log volume accounting, and the memory authority/residency proof obligations this proposal must close before downstream features may depend on them.
- DMA Isolation owns device-visible memory, IOMMU, stale DMA, and scrub-before-reuse requirements.
- Park Authority records why shared park words need mapping identity or object pins before they can be safe across processes.
- Memory Authority Model Backlog decomposes the research, design, and proof work behind this proposal.
Problem
The current tree has strong local rules, but they are spread across several documents and implementations:
- anonymous
VirtualMemoryranges separate reservation from physical commit; MemoryObjectcaps can be shared and mapped into multiple address spaces;- user-buffer copies validate and use pointers while holding the address-space lock;
- TLB shootdown is routed through address-space residency masks;
- private ParkSpace cleanup handles anonymous unmap/decommit and explicit
MemoryObject.unmap; - DMA isolation requires resident unswappable pages, generations, quiesce, and scrub-before-reuse;
- OOM policy rejects default overcommit and forbids an ambient OOM killer.
Those rules are individually useful, but future work needs a single answer to questions such as:
- Is this page only reserved, physically committed, resident, pinned, swapped, or device-visible?
- Which cap or ledger is the authority that made it that way?
- Can this backing frame be reclaimed, unmapped, shared, donated, or exposed to a device?
- When is it safe to reuse a frame after unmap, decommit, protect, process exit, revoke, or failed rollback?
- Which proof must exist before a shared park word, NIC ring, block buffer, Store blob, GPU buffer, Go heap arena, or swap slot depends on the rule?
Without a consolidated contract, later features can accidentally treat
MemoryObject, SharedBuffer, DMA pool pages, anonymous VM pages, and swapped
pages as interchangeable. They are not interchangeable authority classes.
Design Grounding
The project grounding for this proposal is:
docs/architecture/memory.mddocs/backlog/go-virtual-memory-contract.mddocs/proposals/oom-and-swap-proposal.mddocs/proposals/resource-accounting-proposal.mddocs/proposals/service-architecture-proposal.mddocs/design-risks-register.md(entry R4 – Resource accounting is fragmented)docs/dma-isolation-design.mddocs/architecture/park.mddocs/architecture/scheduling.mddocs/architecture/userspace-runtime.mddocs/proposals/storage-and-naming-proposal.mddocs/proposals/go-runtime-proposal.mddocs/security/verification-workflow.mdREVIEW.md
Relevant research grounding:
docs/research/zircon.mdfor VMO/VMAR separation, mapping authority, and VMO commit/decommit precedent.docs/research/genode.mdfor parent-routed sessions and resource quota discipline.docs/research/sel4.mdfor explicit authority, no ambient allocation authority, and the proof value of small stable object invariants.docs/research/eros-capros-coyotos.mdfor single-level-store and keeper lessons; capOS should not make transparent persistence or implicit paging the baseline by accident.docs/research/llvm-target.mdfor Go/runtime memory hooks that require reserve, commit, decommit, and fault behavior.
Goals
- Define the authoritative state vocabulary for user memory, shared memory, pinned memory, DMA memory, and future swapped memory.
- Make every memory-state transition name the capability and ledger that authorizes it.
- Specify when validation, mapping mutation, TLB shootdown, cleanup, and frame reuse are complete.
- Keep future
SharedBuffer, file, network, GPU, and DMA paths from bypassingMemoryObjectprovenance or residency rules. - Tie memory-pressure and OOM behavior to explicit budgets and process lifecycle, not allocator surprises.
- Convert this contract into host tests, QEMU smokes, Kani models, and review gates.
Non-Goals
- No demand paging or swap implementation in this proposal.
- No transparent global persistence or EROS-style single-level store.
- No copy-on-write clone design.
- No sub-VMAR or nested address-region capability in the first version.
- No generic userspace pager in the page-fault path.
- No change to CPU atomic ordering, language memory models, or Rust aliasing rules.
Memory Authority Classes
The model starts by naming memory by authority and residency, not by how it is convenient for one subsystem to access it.
| Class | Authority | Backing | Reclaim/swap | Notes |
|---|---|---|---|---|
| Reserved anonymous VA | VirtualMemory bound to one address space | No frame | Not reclaimable; no physical state | Charges virtual-reservation quota only. |
| Committed anonymous page | VirtualMemory plus frame-grant budget | Private frame | Reclaim/swap only under future explicit policy | VM_PROT_NONE is committed but inaccessible. |
Borrowed MemoryObject mapping | MemoryObject cap plus caller address space | Shared object backing | Not phase-1 swap | Mapping pins backing lifetime and charges mapping budget. |
| Capability transport page | Kernel bootstrap/ring setup | Kernel-owned frame mapped into user VA | Never swap | Ring and CapSet pages are reserved outside caller VM control. |
| Private kernel metadata | Kernel boot/process/scheduler/cap state | Kernel heap or frames | Never swap | Failure is kernel capacity pressure unless later object funding exists. |
| Reclaimable clean cache | Store, boot package, file, or loader service | Reconstructable backing | Drop/refetch, not swap | Must have trusted backing identity before reclaim. |
| Pinned user page | Explicit pin authority or future residency option | Resident frame | Never swap while pinned | Needed for shared park, DMA staging, and secret/locked mappings. |
| DMA-visible page | DMAPool/device-manager authority | Resident frame or IOVA mapping | Never swap; quiesce before free | Device bytes remain untrusted until driver validation. |
| Secret page | Future secret-memory authority | Resident private frame | Never swap; scrub aggressively | Stronger policy than ordinary pinned memory. |
| Swapped anonymous page | VirtualMemory plus swap budget and slot metadata | Encrypted/authenticated slot | Non-resident until fault | Future phase only; anonymous private pages first. |
Two rules follow from this table:
- A capability that maps CPU-visible bytes does not automatically grant device visibility, swap eligibility, pin authority, or persistence.
- A residency promise is an authority and accounting event, not a flag that a helper may set opportunistically.
State Machines
Anonymous VirtualMemory
Anonymous memory follows this lifecycle:
unreservedreservedcommitted inaccessible(VM_PROT_NONE) orcommitted accessibledecommitted reservedunreserved
Required properties:
reservecharges virtual reservation and installs no present user PTE.commitis all-or-nothing: frame allocation, ledger charge, page-table mutation, committed-page metadata, and deferred TLB completion reservation either all become visible or all roll back.protectchanges accessibility only for committed pages; it does not create backing.decommitreleases committed frames and physical charges while preserving virtual reservation.unmapreleases the reservation and any committed backing inside it.- ring and CapSet virtual ranges remain outside caller-controlled
VirtualMemoryoperations.
MemoryObject
MemoryObject backing has a separate lifecycle from any one mapping:
- allocated by
FrameAllocator - held through one or more process cap-table entries
- borrowed into one or more address spaces by
MemoryObject.map - unmapped or torn down from those address spaces
- released after the final cap and final borrowed mapping drop their holds
Required properties:
- held
MemoryObjectcaps charge holder frame-grant pages; - each borrowed mapping charges the mapper while it remains mapped;
- object-backed pages are tracked separately from anonymous reservations;
- unmap/protect must prove the affected range is backed by the same object;
- releasing a cap cannot leave uncharged live mapped backing behind;
- future direct read/write, slice, clone, file-backed, or DMA-backed subclasses must define how they differ from the current physically backed object.
Pinned and Device-Visible Memory
Pinned memory is not just a mapped page with a refcount. It is a promise that reclaim, swap, and frame reuse will not move or free the page until the pin is released.
The first reusable state machine should be:
residentpin_reservedpinnedpin_drainingresident
DMA-visible memory extends this with device ownership:
allocatedmapped_to_devicesubmittedquiescingdevice_mapping_removedscrubbedreleased
The device state machine must advance generations or epochs before reuse so stale handles, stale interrupts, and stale completions fail closed.
Future Swap
Swap is a later extension of committed anonymous memory:
committed residenteviction reservedencrypted slot written and authenticatedcommitted swappedfaulting restorecommitted resident
No phase-1 swap path may include MemoryObject backing, shared IPC pages,
ring/CapSet pages, DMA pages, kernel metadata, or secret pages. If swap-in
cannot restore data or authenticate the slot, the faulting process exits with
a typed OOM or corruption reason; the kernel must not fabricate contents.
Mapping Consistency
The consistency rule is:
A frame may return to the allocator, be exposed to a different authority, or be made device-visible only after every stale CPU and device observer has been invalidated or made irrelevant by generation.
For current CPU mappings this means:
- address-space mutation holds the address-space lock while checking and changing metadata;
- local page-table changes flush locally before returning to user mode;
- remote CPUs that have run the address space receive and acknowledge the required TLB shootdown generation before freed frames can be reused;
- kernel upper-half mappings that share a top-level PML4 entry with existing address spaces mutate only shared lower-level tables and use kernel-wide TLB shootdown; a new kernel-half PML4 slot must either be pre-seeded before user address-space creation or fail closed until a synchronized live-root propagation path exists;
- cleanup paths reserve deferred completion slots before mutation when they may need to free frames after shootdown;
- private ParkSpace waiters are interrupted before an unmapped virtual address can be reused for unrelated state.
For future shared keys, DMA, and swap this means:
- shared park keys must be derived from
MemoryObjectidentity and offset, or the backing object must be explicitly pinned for the duration of key derivation and wait registration; - DMA page reuse requires descriptor quiesce, interrupt/completion generation checks, IOMMU or bounce-buffer invalidation, and scrub-before-release;
- swapped pages carry slot generation/integrity metadata, and stale slots must not be accepted as current backing.
Accounting Rules
Every state transition must name exactly one ledger of record:
- virtual reservation pages: process/address-space VM ledger;
- anonymous committed pages: process frame-grant or future memory-commit ledger;
- held
MemoryObjectbacking: holder frame-grant ledger; - borrowed
MemoryObjectmappings: mapper frame-grant ledger plus address-space tracking; - pinned pages: future pin ledger owned by the pinning authority;
- DMA pool bytes, DMA buffers, descriptors, MMIO mappings, interrupt holds, and in-flight DMA submissions: device-manager ledger;
- swap commitment and swap slots: future swap/budget ledger;
- kernel metadata: kernel capacity budget until explicit object funding exists.
Status views, metrics, and audit logs may mirror these values, but they are not allowed to grant or deny resources from stale copies.
Failure Semantics
Capability-call boundaries return controlled failures:
- malformed ranges, overflow, unsupported protection, and authority mismatch return deterministic validation errors;
- local budget exhaustion returns quota denial or typed OOM;
- global pressure may attempt reclaim first if the memory class is eligible;
- rollback failure enters a bounded recovery or emergency path rather than publishing half-mutated authority.
Execution faults are process lifecycle events:
- access to unreserved memory is an ordinary fault;
- access to reserved uncommitted memory is a reservation fault, not implicit demand commit;
- access to committed
VM_PROT_NONEis a protection fault; - future failed swap-in or failed zero-fill terminates the faulting process, not a random victim.
Boot-time core allocation failures remain boot-fatal until capOS moves kernel object memory under explicit authority.
Proof Obligations
Before a feature depends on this model, it must add or cite the relevant proof:
| Area | Required evidence |
|---|---|
| Frame allocator | Host/Kani proof that allocator metadata frames are never returned, allocation is unique, and free rejects invalid or double-free inputs. |
| Anonymous VM | Host tests for overlap, split, rollback, quota, VM_PROT_NONE, decommit/recommit zero-fill, and process teardown. |
| TLB/frame reuse | QEMU or targeted instrumentation proving unmap/decommit/protect wait for required shootdown before frame reuse on resident CPUs. |
| User-buffer copy | Review and tests showing validation and copy/read occur under one address-space stability guarantee. |
MemoryObject sharing | QEMU proof that mapping, transfer, writes, unmap, release, and teardown preserve backing lifetime and accounting. |
| Shared park words | Proof that key derivation uses object identity and stale waiters cannot observe reused virtual addresses or object generations. |
| DMA | QEMU/host proof that stale handles, interrupts, and completions cannot affect new owners or freed buffers. |
| OOM/quota | Hostile exhaustion tests showing controlled errors, rollback, no leaked frames, and no partial cap publication. |
| Swap | Future proof for encrypted/authenticated slots, stale-slot rejection, pinned/secret/DMA exclusion, and faulting-process termination. |
Kani should stay focused on bounded pure primitives in capos-lib.
Hardware-dependent behavior such as TLB shootdown, page faults, and DMA
requires QEMU or targeted integration evidence.
Phasing
Phase 0: Contract and Inventory
- Land this proposal and the backlog.
- Inventory existing memory-state transitions and match them to the classes above.
- Identify code paths whose errors currently blur validation failure, quota denial, and global pressure.
Phase 1: Host-Testable Ownership Model
- Extract or mirror sparse anonymous reservation behavior into host-testable logic where useful.
- Add model tests for reservation split/merge, committed-page bookkeeping, borrowed mapping provenance, and one-ledger-of-record accounting.
Phase 2: Shared-Memory Provenance and Pins
- Define
MemoryObjectmapping identity sufficient forSharedParkSpace, service-owned shared buffers, and zero-copy file/network paths. - Add explicit object pins or mapping pins only where a locked copy/read is not enough.
Phase 3: Runtime Budgets and OOM Normalization
- Add spawn-time memory budget policy once the selected milestone sequence reaches resource-profile work.
- Normalize allocation failure results at capability boundaries.
- Add typed process exit reporting for memory faults and future OOM.
Phase 4: DMA and Swap Extensions
- Treat userspace drivers as blocked until DMA owner states, generations, ledgers, and stale-completion proofs exist.
- Treat swap as blocked until page classes, slot metadata, encryption, and fault lifecycle are explicit.
Open Questions
- Should capOS expose a first-class pinned-memory option on
VirtualMemory, or only through narrower future caps such asSharedParkSpace,DMAPool, andSecretMemory? - Should
MemoryObjectgain direct read/write and slice operations before file/networkSharedBufferAPIs, or stay mapping-only until a concrete service needs the API? - Which pure address-space interval logic should move into
capos-libfor Kani/host testing without dragging in architecture-specific page-table details? - What is the smallest status surface that exposes reservation, commit, resident, pinned, borrowed, DMA, and swapped counts without creating a second enforcement counter?
- How much kernel metadata should remain permanently reserved before capOS adds explicit object-funding authority for kernel allocations?
Bottom Line
capOS should treat memory state as capability authority plus ledgered residency. Anonymous reservations, committed frames, shared objects, pins, DMA pages, and swapped pages need distinct contracts. The near-term path is not to implement swap or a pager; it is to make the authority, accounting, TLB/device-observer, cleanup, and proof rules precise enough that future shared-memory and device work cannot accidentally bypass them.
Proposal: Stateful Task and Job Graphs
capOS should eventually have a small durable work-graph substrate: a way to describe, run, inspect, pause, retry, and resume stateful DAG-shaped work. It should serve four related needs without becoming a universal service manager:
- init-owned service startup, restart, and shutdown orchestration;
- IX-style package and build graph execution;
- operator-visible task lists with optional assignee, budget, and run state;
- notebook-like user stories where prose, commands, outputs, and rerun points are recorded as a narrative over real work.
The important design line is that the graph substrate is not the UI, not a shell, not a package manager, not a notebook runtime, not a service manager, and not a generic capability broker. It is the durable state machine beneath those tools.
Position
Adopt a WorkGraph model, but keep it narrow.
The core object is a versioned graph definition plus run instances:
- Graph definition: immutable, schema-validated structure: nodes, typed edges, resource hints, authority requirements, retry/cancellation policy, and expected artifacts.
- Graph run: one execution attempt of a graph definition, with node-run state, leases, logs, checkpoints, artifacts, and audit events.
- Node run: one executable, manual, or descriptive unit of work inside a run.
- Artifact: durable output, checkpoint, service export, log, report, or Store/Namespace reference produced by a node.
- Assignment: optional workload metadata: assignee principal, role, queue, priority, resource profile, deadline, and budget.
The common substrate is a schema/library/event-log pattern, not one global coordinator. Each domain owns its coordinator, executor queue, domain-node schema, validation, and authority:
- init owns init lifecycle state;
BuildCoordinatorowns IX build graph execution and job state;- an agent runner owns agent task state and workspace leases;
- a notebook/story service owns narrative projections;
- an operator task service owns human assignment state.
They may share graph/run/event/artifact shapes, but they do not share one authority-holding scheduler.
Everything above that is a facade:
- init sees service lifecycle and dependency state;
- IX sees package inputs, build steps, outputs, and Store commits;
- an operator sees a DAG-organized todo list with assigned work;
- a notebook sees cells, prose, rich outputs, and rerun checkpoints;
- an agent runner sees durable steps, memory/checkpoints, and review gates.
The same persisted run can have more than one projection. A failed package build can appear as an IX build failure, an operator task, a notebook section, and a graph node with logs. The core should not know which projection is being used. Cross-domain views should be read-only projections or explicit links to the owning run, not copied mutable event state.
Why This Belongs in capOS
capOS already has several graph-shaped systems:
initConfig.servicesis an init-owned service graph.ProcessSpawnerandProcessHandleprovide process lifecycle edges.capos-serviceneeds readiness, shutdown, drain, background work, resource reservations, and handoff hooks.- IX-on-capOS needs dependency-ordered fetch, extract, build, Store commit, and realm publish.
- agent and shell workflows need durable state when work crosses sessions, reviews, restarts, or context compaction.
Without a shared state model, each subsystem will grow its own partial orchestrator: init will have a service table, IX will have a build executor, agents will have task memory, operators will have ad-hoc todo state, and notebook-like demos will have their own cell/run records. That is duplication in the wrong layer.
With too much sharing, the substrate becomes a god object. The right answer is a shared run-state and dependency model with domain-specific executors.
Prior Art Baseline
Sources checked for this proposal:
- capOS IX research note: IX-on-capOS Hosting
- Upstream IX repository and executor source: https://github.com/pg83/ix, https://raw.githubusercontent.com/pg83/ix/master/core/execute.py
- Apache Airflow 3.2 DAG docs: https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/dags.html
- Dagster docs on software-defined assets, ops/graphs, jobs, schedules, and sensors: https://docs.dagster.io/, https://docs.dagster.io/guides/build/assets, https://docs.dagster.io/guides/build/ops, https://docs.dagster.io/guides/build/jobs, https://docs.dagster.io/guides/automate/schedules, https://docs.dagster.io/guides/automate/sensors
- Jupyter
nbformatdocs: https://nbformat.readthedocs.io/en/latest/format_description.html - LangGraph persistence docs: https://docs.langchain.com/oss/javascript/langgraph/persistence
The useful lessons are separable.
Airflow: a workflow run has task instances, dependencies, scheduling, retries, timeouts, documentation, and operational state. Airflow’s DAG object intentionally does not care what happens inside a task; it cares about order, retry, timeout, and execution conditions. capOS should copy that separation, but not the Python-file import model, global scheduler database, or operator/plugin surface.
Dagster: asset-first thinking fits capOS better than task-first thinking when the output is durable state. A Store object, package output, Namespace snapshot, boot manifest, built binary, benchmark report, or service export is closer to a Dagster asset than to an Airflow task. Dagster’s ops/graphs remain useful when work is not naturally an asset. capOS should adopt the split: assets are durable products; ops are execution steps; jobs are selections of work to materialize or run. Dagster itself is data-platform-shaped, so it is inspiration, not the implementation target for init.
Jupyter: notebook structure is a user story, not the kernel or init abstraction. Cells, prose, outputs, and metadata are excellent for reviewing a run, explaining why it happened, and rerunning a chosen step. They should be a projection over graph state. Cell order must not become the source of truth for service lifecycle or package builds.
LangGraph: checkpointed graph execution, threads, super-step boundaries, interrupts, and time travel are useful for agent-like and human-in-the-loop work. capOS should borrow the checkpoint boundary idea for resumability, but avoid binding the substrate to LLM message state.
IX: the package graph research is the strongest local precedent. IX’s current executor traverses a dependency graph by node outputs, applies pools, creates output directories, runs shell commands, touches sentinel files, and kills the process group on failure. That proves IX already has a real build graph. It also shows where capOS must stop: graph scheduling must not be fused to subprocess, Unix process groups, filesystem sentinels, hardlinks, symlinks, fetchers, archive extraction, or Store mutation. Those belong behind typed capOS services.
Core Model
The minimal model is:
struct WorkGraph {
graphId @0 :Text;
version @1 :UInt64;
nodes @2 :List(CommonNodeSpec);
edges @3 :List(EdgeSpec);
defaults @4 :GraphPolicy;
domainSchema @5 :UInt64;
}
struct CommonNodeSpec {
nodeId @0 :Text;
title @1 :Text;
inputs @2 :List(ArtifactSelector);
outputs @3 :List(ArtifactSpec);
requiredCaps @4 :List(CapRequirement);
policy @5 :NodePolicy;
assignmentDefault @6 :Assignment;
}
struct WorkRun {
runId @0 :Text;
graphId @1 :Text;
graphVersion @2 :UInt64;
state @3 :RunState;
nodes @4 :List(NodeRun);
events @5 :List(EventRef);
}
struct NodeRun {
nodeId @0 :Text;
state @1 :NodeState;
attempt @2 :UInt32;
assignment @3 :Assignment;
artifacts @4 :List(ArtifactRef);
checkpoint @5 :CheckpointRef;
}
This is a shape, not final schema. The stable part is the split between definition, run, node-run state, artifacts, and assignments.
Domain node meanings are not a shared NodeKind enum in the common schema.
Init may define InitServiceNode; IX may define FetchNode, ExtractNode,
BuildNode, StoreCommitNode, and PublishNode; a story projection may
define NotebookCellNode or ManualNoteNode. Those domain structs live in
domain-owned schemas or config sections and are validated by the domain
coordinator that holds the relevant authority. The common graph library may
hash, store, and index their association with nodeId, but it must not
interpret every domain’s node kinds.
Node State
Node state should be explicit enough for init, package builds, and operators:
planned: validated but not yet eligible.blocked: waiting on upstream nodes, an unavailable capability, resource budget, or manual input.runnable: dependencies are satisfied and a worker may lease it.leased: a worker or assignee owns the next attempt for a bounded time.running: execution has begun.waiting: running but blocked on a child process, readiness export, external event, manual approval, timer, or checkpoint resume.succeeded: produced the declared outputs or accepted terminal result.failed: terminal failure under current policy.retryPending: failed attempt will be retried under policy.skipped: intentionally not run because branch/condition policy selected a different path.canceled: canceled by caller, shutdown, superseding run, or authority revocation.paused: durable operator or policy pause.stale: graph version, cap epoch, input artifact, or session binding no longer matches the run’s assumptions.
State transitions should be append-only events. Services may compact state into snapshots, but audit and replay need a durable event boundary.
Edges
A plain DAG edge is not enough. capOS needs typed edge reasons:
dependsOnSuccess: downstream may run after upstream succeeds.dependsOnArtifact: downstream consumes a named artifact or Store ref.dependsOnReady: downstream waits on a service readiness export.dependsOnLease: downstream may run only while a lease/session is live.cancelsWith: cancellation propagates across the edge.shutdownBefore: shutdown order edge, usually reverse of startup.approvalFor: manual approval gates a node or subgraph.observes: node only observes another node’s state and does not block it.
The graph remains acyclic within one run. Loops are modeled by new runs, periodic schedules, sensors, retries, or explicit child graphs. This is a critical stop line: hidden cycles create service-manager behavior inside the graph engine.
Workload Assignment
Assignment is optional metadata, not authority:
struct Assignment {
principal @0 :Text;
role @1 :Text;
queue @2 :Text;
priority @3 :Int32;
budget @4 :ResourceProfileRef;
deadline @5 :TimeRef;
lease @6 :LeaseRef;
}
An assigned operator or worker may receive a lease to attempt a node. The lease does not grant broad system authority. It only grants the ability to claim or update that node-run through the coordinator, and any executable work still needs domain caps supplied by init, a build coordinator, a package worker, an agent runner, or another supervisor.
This makes the same graph usable as:
- a todo list where a human owns a manual node;
- a build queue where a worker owns a build step;
- an init run where PID 1 owns service lifecycle nodes;
- an agent plan where a worker owns a bounded workspace task.
Init As A Consumer
The user direction is important: this may be used for workload orchestration by init.
The current init path validates initConfig.services, spawns children through
ProcessSpawner, records exports, and waits. The first graph use should only
observe and structure that existing behavior:
- Compile
initConfig.servicesinto a graph definition. - Create a volatile boot
WorkRunin init memory. - Treat each service as a lifecycle node with the states current init can actually observe: planned, spawned, running/waiting, exited, or failed.
- Use typed edges for declared cap imports and manifest-order dependencies.
- Persist selected run events later through a Store-backed journal when storage is available.
Init does not need to become a general-purpose Airflow. It needs a durable or inspectable lifecycle table with graph semantics:
- what services were planned;
- what caps and exports they depend on;
- which services are spawned, running, waiting, exited, failed, or blocked under the current primitives;
- later, which services are restarting, draining, terminating, or ordered for shutdown once those lifecycle primitives exist;
- what operator-visible work remains.
Restart, drain, termination, readiness-export waiting, and shutdown-order control are later phases. They require primitives that are still future in the service and broker proposals:
- process termination or kill-tree semantics narrower than raw process-table authority;
- an explicit readiness/export contract for services;
- service drain or lifecycle caps for graceful shutdown;
- restart policy state that is disabled or narrowed during shutdown mode;
- stale export and stale process-handle behavior for restarted services;
- audit events that distinguish crash, restart, operator stop, shutdown, timeout, and stale-authority denial.
The generic graph code can be an init-internal library at first. If a separate
run-state service appears later, init should delegate only narrow read or
update capabilities to it. The separate service must not receive
ProcessSpawner, raw process handles, or service-owner caps merely because it
stores graph state.
IX Package Graph Consumer
IX should use the same run-state model with a different executor:
- package templates and descriptors produce graph definitions;
- fetch/extract/build/store/publish become typed nodes;
- inputs and outputs are Store or Namespace refs;
- build logs and output hashes are artifacts;
- package build workers lease executable nodes;
BuildCoordinatorowns scheduling, cancellation, queues, and job state;Fetcher,Archive,BuildSandbox,Store, andNamespacehold the real authority.
The graph substrate should not know how to fetch a URL, unpack a tarball, run
sh, or commit a Store object. It records that those typed steps exist, which
worker owns the attempt, what artifacts were produced, and whether the run can
resume or retry.
This preserves the IX research recommendation: use IX’s package corpus and content-addressed model without importing a CPython/POSIX executor boundary. It does not move IX job ownership into a global graph coordinator.
Notebook User Story
Jupyter is best treated as a user story:
- A notebook cell can map to a
note,manualTask,notebookCell,agentStep, orbuildnode. - Cell output is an artifact: text, table, image, log excerpt, benchmark summary, Store ref, or Namespace snapshot.
- Markdown/prose explains why the graph exists and how to interpret its state.
- Rerun means “create a new run or retry selected node(s) under policy”, not “mutate hidden cell global state”.
- Checkpoints let a user resume from a durable boundary.
The notebook layer may be CLI text, mdBook, a future web shell, or a rich UI. The core model should not depend on any of those.
Dagster Fit
Dagster is closer than Airflow for durable capOS work when outputs matter. For capOS, a software-defined asset maps naturally to:
- content-addressed package output;
- boot image or manifest;
- Namespace snapshot;
- benchmark report;
- generated code artifact;
- service export that becomes available after readiness;
- notebook output captured as a reproducible artifact.
Dagster’s ops and graphs map to executable steps. Its jobs map to selections of assets or ops to run. Its sensors and schedules map to run creation policies.
The mismatch is domain and authority. Dagster assumes a data-platform runtime, Python definitions, and external resources. capOS needs capability grants, typed service exports, process handles, sessions, Store/Namespace refs, resource ledgers, and boot-time constraints. The right move is not “run Dagster in init”; it is “use Dagster’s asset/ops/jobs distinction to keep the capOS graph model honest.”
Where To Stop
The main risk is building a god object. The graph substrate must not absorb every adjacent concept.
Stop at these boundaries:
- No kernel
WorkGraphcapability. The kernel provides primitive caps: process, memory, IPC, timers, devices, and storage plumbing. Graph state is userspace. - No global service discovery. A graph may reference capabilities granted into its runner or produced by its own nodes. It must not look up arbitrary services by global name.
- No ambient executor. Run-state code cannot execute arbitrary strings, scripts, Cap’n Proto calls, or binaries. A domain executor must hold the exact capabilities needed.
- No universal plugin ABI. Domain node kinds are typed in domain schemas. Unsupported node kinds fail domain validation rather than becoming untyped byte blobs.
- No authority laundering. Assignment, tags, labels, notebook cells, and graph edges do not grant authority. Only capabilities do.
- No UI state in the core. Notebook cells, DAG visual positions, comments, and todo-list grouping are projections or metadata.
- No package-manager logic in the core. Fetch, archive, build, Store, and Namespace operations stay in IX/build services.
- No init-specific policy in the core. Restart policy, shutdown order, and process termination are init or supervisor policy. The graph can record and drive them only through explicit runner methods.
- No hidden loops. Periodic work, sensors, retries, and agent iteration create new attempts or runs. One run’s execution graph stays acyclic.
- No unbounded event retention by default. Retention and compaction are policy fields, not accidental database growth.
If a feature requires any graph coordinator to hold broad ProcessSpawner,
DeviceManager, NetworkManager, Store, Namespace, Fetcher, shell,
or session authority for all domains, the design has crossed the line.
Service Split
The target split is:
flowchart TD
Lib[Shared graph schema and state library]
Log[Optional Store-backed event log]
Lib --> InitCoord[init-local lifecycle graph]
Lib --> BuildCoord[IX BuildCoordinator graph]
Lib --> TaskCoord[operator task graph]
Lib --> StoryCoord[notebook/story projection]
Lib --> AgentCoord[agent-run graph]
InitCoord --> InitLog[volatile boot run first]
BuildCoord --> Log
TaskCoord --> Log
StoryCoord --> Log
AgentCoord --> Log
InitCoord --> InitExec[init lifecycle executor]
BuildCoord --> BuildExec[build workers]
TaskCoord --> Human[operator/manual assignee]
AgentCoord --> AgentExec[agent worker]
InitExec --> Spawner[ProcessSpawner]
BuildExec --> Sandbox[BuildSandbox]
BuildExec --> Store[Store/Namespace]
AgentExec --> Workspace[Task workspace caps]
Only domain coordinators and executors hold domain authority. The shared code owns no authority beyond manipulating in-memory or Store-backed graph records through whatever narrow capability its caller already holds.
Persistence
Persistence should be incremental:
- Early init boot runs can be volatile.
- Build runs should persist event logs, logs, artifacts, and Store refs as soon as Store exists.
- Operator tasks and notebook stories should persist once user storage exists.
- Agent runs should persist checkpoints and review state, not raw hidden prompt state.
Store integration should use content-addressed objects for immutable outputs and an append-only or generation-checked log for mutable run state. Namespace snapshots can publish human-facing names for completed runs, package realms, or notebook reports.
Boot must not depend on a separate Store-backed graph service being available. If durable graph logging is unavailable during boot, init falls back to its volatile lifecycle table and emits diagnostics through its existing console/log path. Durable replay and post-boot inspection are degraded in that mode, but service startup must not fail solely because the graph log is unavailable.
Security Rules
- Node claims are lease-based and expire.
- Every state update is authorized by the current lease, graph owner, or a delegated control cap.
- Node output publication validates expected artifact type and size.
- Retrying a node must not reuse stale capabilities, stale sessions, or stale object epochs.
- Cancellation must release leases and ask domain executors to drain or kill work through typed lifecycle caps.
- Audit logs distinguish failure, cancellation, stale authority, denied authority, timeout, manual rejection, and superseded run.
- Resource budgets are reserved before execution and released on all terminal paths.
Staged Plan
Stage A: Init-Local Run Model
Add a pure capos-config or init-local graph/run-state library that can model
the existing initConfig.services startup order, service exports, and child
waits. Keep it volatile. Add host tests for graph validation and state
transitions.
Stage B: Init Lifecycle Projection
Teach init to expose or print an inspectable service run summary: planned, spawned, running or waiting, exited, and failed. Later summaries can add readiness, restart, drain, termination, and shutdown ordering after those primitives exist. This can remain a text proof before adding any new capability interface.
Stage C: Store-Backed Run Log
Once Store/Namespace is credible, persist run events and compact snapshots. This unlocks post-boot inspection, operator task state, and notebook stories.
Stage D: IX BuildCoordinator
Represent IX package builds as graph runs. Keep execution in
BuildCoordinator, BuildSandbox, Fetcher, Archive, Store, and
Namespace services.
Stage E: Operator Task Surface
Expose a shell or structured command surface for graph runs: list, inspect, assign, pause, resume, retry, cancel, approve, and show artifacts. This is the DAG-organized todo-list layer.
Stage F: Notebook Story Projection
Generate notebook-like reports from graph runs: prose, cells, commands, logs, artifacts, and checkpoints. Treat notebooks as reproducible run narratives, not as the owner of execution semantics.
Stage G: Agent Workflows
Use graph runs for long-lived agent tasks, review gates, workspace leases, memory checkpoints, and human approval nodes.
Validation
Each stage should have focused checks:
- pure host tests for state transitions and invalid graph rejection;
- init QEMU proof that existing service startup still works;
- later lifecycle-control proof that shutdown dependency order is obeyed, once terminate/drain/shutdown primitives exist;
- stale lease and stale cap epoch tests;
- IX differential tests against host-side IX planning where applicable;
- docs build to refresh topics and catch Mermaid/front matter errors.
Open Questions
- Should init embed the graph library permanently, or should it eventually delegate run-state persistence to a child service once storage is available?
- What is the smallest schema for
ArtifactRefthat covers service exports, Store refs, logs, notebooks, and package outputs without becomingAny? - Does
domainSchemaidentify only a domain schema version, or also the domain payload location and content hash for node-specific config? - How should schedules and sensors be represented without creating hidden cyclic runs?
- Which graph events deserve permanent audit retention versus compacted operational state?
- Should notebook projections use Jupyter
nbformatdirectly, or a smaller capOS-native story format that can export to notebooks later?
Recommendation
Build a small stateful graph substrate, but make it a run-state service or library, not a universal orchestrator.
For init, use it to make service lifecycle visible and eventually durable. For IX, use it to track package build graphs while execution remains in build services. For operators, project it as an assigned DAG todo list. For Jupyter, project it as a notebook-style user story. For agents, project it as durable task state with checkpoints and review gates.
The stop line is authority: shared graph code records state, domain coordinators schedule work, and typed domain services execute it.
Proposal: OOM Handling and Swap
How capOS should behave under memory pressure, what “out of memory” means at different boundaries, and how optional swap support fits the capability model.
Related
- Memory Management documents the current
implemented frame, page-table,
VirtualMemory, andMemoryObjectbehavior. - Go VirtualMemory Contract defines the near-term distinction between virtual reservation and physical commit that this proposal builds on.
- Resource Accounting and Quotas defines the ledger vocabulary used for memory-pressure policy.
- Memory Authority Model defines who may create memory commitments and how authority composes with the budgets this proposal charges against.
- Scheduler Evolution is the parallel
design for CPU-time authority:
SchedulingContextbudget/period/donation is the CPU-side analogue of the spawn-time memory budgets and per-process commitment ledger this proposal requires, and reclaim/swap-in fault paths must respect bound dispatcher budgets and depletion notifications rather than busy-waiting under pressure. - Design Risks Register R4 – Resource accounting is fragmented tracks the cross-proposal gap this design closes for the memory axis and routes related fragmentation (per-service fairness beyond thread weights, unified resource bundles, scratch-bytes/outstanding-calls/in-flight-call quotas) to its owning trackers.
Problem
capOS already has several local out-of-memory paths:
- boot-time allocation failures that are still fatal,
- service-facing operations that return a controlled error,
- rollback paths that free partially allocated state, and
- hostile-path tests that prove some frame-exhaustion cases.
What the tree does not have yet is a coherent memory-pressure policy. There is no system-wide answer to these questions:
- When should an allocation fail immediately vs. trigger reclaim?
- Which memory is reclaimable, swappable, or permanently pinned?
- What outcome should a process observe when a page fault cannot be satisfied?
- Who is allowed to decide that another process should die under memory pressure?
Without that policy, the codebase will drift into a mix of local conventions:
some paths return Overloaded, some return interface-specific failure text,
some remain boot-fatal, and future swap support would have no clear ownership
or threat model.
Design Goals
- No ambient OOM killer. The kernel must not scan the system for an arbitrary victim and kill it Linux-style.
- Explicit accounting. Memory exhaustion must be understood in terms of
budgets, commitments, and reclaimability, not just “the allocator returned
None.” - Typed failure semantics. Callers must be able to distinguish invalid requests, local budget exhaustion, transient pressure, and fatal page-fault failure.
- Fail closed. Memory-pressure code must not corrupt capability state, silently drop dirty data, or leave half-constructed kernel objects behind.
- Swap is optional. capOS must work without swap. Swap is a policy and deployment choice, not a baseline requirement.
- Security first. Swap must not become a secret-leak side channel or an integrity hole.
Non-Goals
- Transparent global persistence in the EROS sense.
- General-purpose overcommit as the default memory model.
- Swapping kernel metadata, capability rings, CapSet pages, or DMA-pinned memory.
- A userspace pager dependency in the first swap implementation.
Design Grounding
This proposal deliberately borrows from three existing design directions in the research set:
- Genode: strict memory accounting and quota donation are the right default because they avoid an ambient OOM killer and make responsibility obvious.
- seL4: explicit memory authority is preferable to a kernel that can create new backing objects out of thin air when under pressure.
- EROS / CapROS / Coyotos: do not make implicit persistent backing store the baseline. capOS already chose explicit persistence and should not back into a single-level-store design through swap.
The result is not a copy of any of those systems. capOS keeps explicit capability-granted memory objects and ordinary page tables, but adopts the accounting discipline that makes OOM behavior reviewable.
Core Policy
1. No Overcommit by Default
The default rule is simple: a process may only create anonymous memory if the system can charge that commitment to a real budget.
That means:
- anonymous
VirtualMemory.commitand compatibilityVirtualMemory.mapconsume committed-page budget, - anonymous
VirtualMemory.reserveconsumes virtual address-space quota only and does not promise physical backing, - resident pages consume real frame availability when they are instantiated,
- swap, when enabled, extends commitment capacity only for memory classes that explicitly allow it,
- and no interface may assume that a later background OOM killer will clean up a bad admission decision.
This follows the same principle as capability authority in general: if a child needs more memory, some parent or broker must have chosen to give it that room.
2. The Kernel Never Picks a Random Victim
When memory is tight, the kernel may:
- reclaim kernel-known clean caches,
- free resources from already-dead processes,
- swap out eligible anonymous pages,
- reject a new allocation,
- or terminate the faulting process when its own page cannot be restored.
What it must not do is kill an unrelated process just because it happens to be large. Cross-process eviction is a supervisor policy decision, not a kernel allocator side effect.
Supervisors remain free to implement their own policy. A shell/session broker or future service manager can decide to stop a child, reduce its budget, or restart it. That decision is explicit and auditable rather than hidden inside the low-level frame allocator.
3. Distinguish Four Memory Outcomes
capOS should treat these as different cases, not variants of one string:
| Situation | Required behavior |
|---|---|
Invalid request (size=0, misaligned range, quota metadata malformed) | Deterministic failed / request validation error |
| Caller exhausted its allowed budget | Deterministic overloaded or typed outOfMemory result |
| Global pressure, but reclaim/swap may succeed | Reclaim first, then retry locally |
| Faulting page cannot be restored or committed | Terminate the faulting process with an explicit OOM exit reason |
The important distinction is between synchronous API failure and asynchronous execution failure. If a capability call asks for more memory, it should get an error back. If a process touches a swapped-out page and the system cannot bring it back, there is no capability return value to encode. That must be a process-lifecycle event.
Memory Classes
The reclaim policy depends on what kind of memory is being discussed.
| Class | Examples | Reclaim policy |
|---|---|---|
| Kernel-reserved, unswappable | kernel heap, page tables, scheduler/process metadata, cap-table backing, ring scratch | Never swap; pressure here is a kernel-capacity problem |
| User pinned, unswappable | capability ring page, CapSet page, DMA buffers, wired mappings, key material, future mlock-style regions | Never swap; allocation fails if unavailable |
| Reclaimable clean cache | boot-package cache, future filesystem cache, executable pages that can be reloaded, clean read-only object pages | Drop and refetch rather than swap |
| Anonymous private swappable | ordinary heap/stack/anonymous VM pages that opt into swap | Swap-eligible if policy allows it |
| Shared/persistent object pages | MemoryObject, mapped content-addressed store pages, future file-backed shared memory | Not part of phase-1 swap; treat as reclaim/drop or keep resident based on object semantics |
Two rules matter here:
- Clean cache is not swap. If a page can be reconstructed from a trusted backing object without preserving dirty state, reclaim it by dropping it.
- Pinned means pinned. If a page participates in DMA, capability transport, bootstrap identity, or secret handling, treat it as unswappable unless a later design proves otherwise.
DMA pages are a pinned residency class with additional lifecycle constraints:
they must be committed before exposure to the device, resident for the entire
device-visible lifetime, unswappable while mapped by a DMAPool or IOMMU
domain, and scrubbed before release to another owner. Reclaim is not allowed to
make progress on a DMA page; pressure must surface as admission failure or
device-manager teardown.
Device-written DMA pages are untrusted input until validated by the owning
driver or network/storage stack. Pinning and residency prevent reclaim races;
they do not make device bytes trustworthy, nor do they grant ordinary
MemoryObject authority over the backing frames.
Failure Semantics by Boundary
Capability Calls
For explicit allocation requests, return a structured failure rather than panicking:
VirtualMemory.mapshould returnoverloadedor a typed OOM result when the request cannot be satisfied.ProcessSpawner.spawnshould continue the current direction: bounded parsing, fallible allocation,Overloadedon resource exhaustion.- Future interfaces where OOM is a normal domain outcome should prefer a typed union result rather than an exception string.
This is consistent with the existing error-handling proposal: temporary resource exhaustion is not the same thing as malformed input.
Page Faults
Page faults are different. A faulting instruction does not have a natural request/response channel. The policy should therefore be:
- attempt reclaim,
- attempt swap-out of another eligible page if that creates room,
- attempt swap-in or zero-fill for the requested page,
- if that still fails, terminate the faulting process with a typed exit
reason such as
outOfMemory.
That is not an ambient OOM killer. It is the equivalent of delivering an unrecoverable execution fault to the process whose own memory access could not be satisfied.
Boot
Boot remains a special case. If the kernel cannot allocate its own core heap, page tables, or init process, the system cannot proceed. Those failures remain boot-fatal until the architecture moves more kernel object memory under explicit authority.
This proposal does not pretend otherwise. It narrows runtime behavior first and only then pushes on the deeper architectural question of who funds kernel objects.
Budget Model
The long-term model should separate commitment from residency.
- Reserved virtual pages: address-space ranges the process owns but that do not yet promise physical backing. The Go allocator contract charges these to a separate virtual-reservation quota.
- Committed pages: memory the system has promised can exist for a process.
This is what
VirtualMemory.commit, compatibilityVirtualMemory.map, and future runtime heap growth should charge. - Resident pages: memory currently backed by a physical frame.
- Pinned pages: resident pages that reclaim and swap may not touch.
- Swapped pages: committed but non-resident anonymous pages with an encrypted slot on a swap area.
The detailed Go/runtime ABI for splitting virtual reservation from physical commitment is Go VirtualMemory Contract. This proposal’s no-overcommit rule applies at commit time, not at pure reservation time.
At spawn time, a parent or broker should be able to set a memory budget for the child. A minimal future shape is:
struct MemoryBudget {
committedPages @0 :UInt32;
pinnedPagesMax @1 :UInt32;
allowSwap @2 :Bool;
swapPagesMax @3 :UInt32;
virtualReservationPagesMax @4 :UInt64;
}
This budget does not require capOS to expose Linux-style cgroups. It is a capability-native admission contract between parent and child.
Swap Support
Position
Swap is useful, but only as a constrained extension of the non-overcommit model.
Swap must not mean:
- “pretend RAM is infinite,”
- “the kernel can now kill random processes later,”
- or “all memory classes are equivalent.”
Instead, swap means: some anonymous pages may be evicted to an encrypted backing area, subject to explicit budgets and page-class rules.
Phase-1 Swap Scope
The first swap implementation should be intentionally narrow:
- only anonymous private pages created through
VirtualMemory, - only for mappings that are explicitly swappable,
- no swapfiles,
- no filesystem dependency,
- no userspace pager in the fault path,
- no swapping of
MemoryObjectresult caps, shared IPC pages, or device/DMA memory.
That scope is small on purpose. Once the first swap implementation exists, expanding eligibility is easy; debugging a too-clever pager in the page-fault path is not.
Backing Store
Phase 1 should use a dedicated swap extent, not a regular file.
Reasons:
- a file-backed swap path drags in namespace, filesystem, metadata writeback, and deadlock questions too early,
- a dedicated extent is easier to bound and reason about,
- and encryption/integrity policy is cleaner when the medium is dedicated to swap slots.
Provisioning should happen through init or a future storage broker that discovers a block extent and passes it into a kernel configuration path.
Compression
Compressed swap caches are a reasonable later optimization, but not the first one to build.
Linux’s zswap design is a useful warning here: it keeps a dynamically sized
compressed pool in RAM and evicts from that pool to a backing swap device when
the pool reaches its limit. That can improve I/O behavior, but it also creates
another reclaim tier with its own sizing, hysteresis, and writeback policy.
capOS should not start there. Phase 1 should write eligible pages directly to the encrypted swap extent. A compressed in-RAM layer can be added later only after the basic swap accounting, eviction, integrity, and observability rules are stable.
Encryption and Integrity
Swap must be encrypted by default.
The crypto policy should match the existing key-management and volume-encryption direction:
- use a fresh per-boot ephemeral symmetric key that lives only in RAM,
- never persist that key,
- invalidate all prior swap contents on boot,
- authenticate every swapped page so stale-slot replay and random corruption do not silently produce attacker-controlled plaintext.
This has one deliberate consequence: hibernation is out of scope for the first design. Per-boot keys make resume-across-reboot impossible, which is the correct tradeoff for an early capability OS that does not yet have a full trusted suspend/resume story.
Page Eligibility
A mapping should carry an explicit policy bit or enum rather than forcing all anonymous pages into one bucket.
A future VirtualMemory.map shape should move from bare protection flags to
options that express residency policy:
enum MemoryResidency {
normal @0; # reclaimable, swap if allowed by budget
pinned @1; # must stay resident
secret @2; # resident only; zero aggressively; never swap
}
This is a better fit than inventing ad hoc “don’t swap this one page” special cases later for crypto heaps, broker secrets, or device buffers.
Fault Path Semantics
On a page fault to a swapped-out page:
- the kernel locates the slot metadata,
- allocates or frees a frame through reclaim,
- reads and authenticates the page,
- remaps the page,
- resumes the process.
If the slot cannot be restored because no frame can be made available, or the page fails integrity validation, the kernel terminates the faulting process with a distinct exit reason. It must not inject zeros, fabricate stale data, or retry indefinitely.
Why Not a Userspace Pager First
A pure userspace pager is attractive in theory but wrong as the initial step. The current kernel does not have the scheduler, storage, and fault-notification machinery needed to make page-fault RPC safe and bounded under memory pressure.
The first swap design should therefore keep the fault mechanism and slot metadata in kernel while keeping the provisioning and high-level policy outside the kernel where possible.
An external pager can remain a later phase once capOS has:
- notifications,
- richer process/thread lifecycle control,
- deadlock-resistant fault upcalls,
- and a storage stack that can be driven safely during memory pressure.
Interface and Lifecycle Changes
This proposal implies a few interface changes, even if the exact schema names change later.
Process Exit Reporting
Supervisors need to know whether a child:
- exited normally,
- hit a capability exception,
- faulted on memory corruption,
- or died because memory pressure could not be satisfied.
That argues for a typed exit record rather than flattening everything into one numeric code.
Spawn-Time Memory Budgets
ProcessSpawner should eventually accept resource limits, including a memory
budget, rather than assuming every child competes in one shared frame pool.
Monitoring
A future monitoring/status surface should expose at least:
- committed pages,
- resident pages,
- pinned pages,
- swapped pages,
- swap I/O failures,
- reclaim counts,
- and per-process OOM termination counts.
Without that, operators will not be able to distinguish “the child leaked heap” from “the kernel pinned too much unswappable state.”
Security Requirements
Memory-pressure code is security-sensitive, not just performance-sensitive.
Required properties:
- reclaim and swap metadata operations are bounded and fail closed,
- swap ciphertext is authenticated, not just encrypted,
- freed swap slots cannot be read by another process,
- secret/pinned mappings never spill to swap,
- swap enable/disable transitions do not expose stale plaintext,
- and pressure paths avoid allocation where possible.
The last point matters because allocating heap memory while handling OOM is how systems spiral into recursive failure and panic surfaces.
Relationship to Existing Proposals
- Error Handling: resource exhaustion should map to
overloadedor typed OOM results at explicit call boundaries, not generic panic text. - Service Architecture: parents and supervisors should own memory budgets just as they own capability grants.
- Storage and Naming: swap should use explicit backing extents, not ambient filesystem paths.
- Volume Encryption / Key Management: swap encryption uses a per-boot ephemeral symmetric key; persistent encryption keys are unnecessary for the first design.
Phases
Phase 0: Normalize Runtime OOM Semantics
- Remove remaining runtime panic surfaces on untrusted allocation paths.
- Distinguish boot-fatal OOM from service-facing
overloaded. - Add typed process-exit reporting for OOM and faulted swap-in.
Phase 1: Budgeted Anonymous Memory
- Add spawn-time memory budgets.
- Charge anonymous
VirtualMemory.commitand compatibilityVirtualMemory.mapagainst committed-page budget. - Charge anonymous
VirtualMemory.reserveagainst virtual address-space quota. - Mark pinned vs. swappable vs. secret mappings explicitly.
Phase 2: Reclaim Without Swap
- Add clean-cache reclaim and dead-process cleanup accounting.
- Expose pressure metrics and events.
- Keep allocation failure deterministic when reclaim cannot help.
Phase 3: Encrypted Kernel-Managed Swap
- Add dedicated swap extent provisioning.
- Add encrypted/authenticated page slots with per-boot ephemeral keying.
- Support swap for anonymous private pages only.
- Terminate the faulting process cleanly when swap-in cannot succeed.
Phase 4: Optional External Pager
- Revisit pager upcalls only after notifications, richer lifecycle control, and storage-stack maturity exist.
- Keep the kernel fault path bounded even if policy moves outward.
Open Questions
- Should capOS ever add demand commit on first access after the explicit
reserve/commitcontract, or should runtime allocators keep making commitment visible through capability calls? - Should executable anonymous pages be swappable in phase 1, or should swap be limited to writable anonymous pages until code-loading semantics mature?
- When
MemoryObjectgrows richer sharing semantics, should some subclasses be reclaimable-from-backing rather than unswappable? - Does a future
secretmapping need stronger guarantees than “never swap,” such as forced zero-on-fork, no-core-dump, and cache-flush hooks? - How much kernel memory should remain permanently reserved before the system starts admitting user commitments?
Bottom Line
capOS should treat OOM as an authority and lifecycle problem, not as a last-gap allocator surprise. The default system should use explicit budgets and no overcommit, return typed exhaustion at API boundaries, reserve process death only for unsatisfied execution faults, and add encrypted swap later as a narrow extension for anonymous private pages.
Proposal: Capability-Native System Monitoring
How capOS should expose logs, metrics, health, traces, crash records, and
service status without introducing global /proc, ambient log access, or a
privileged monitoring daemon that bypasses the capability model.
Problem
The current system is observable mostly through serial output, QEMU exit status, smoke-test lines, CQE error codes, and a small measurement-only build feature. That is enough for early kernel work, but it is not enough for a system whose claims depend on service decomposition, explicit authority, restart policy, auditability, and later cloud operation.
Monitoring is also not harmless. A monitoring service can reveal capability
topology, service names, scoped subject references, transport metadata, timing,
crash context, request payloads, and security decisions. If capOS imports a
Unix-style “read everything under
/proc” or “global syslog” model, monitoring becomes an ambient authority
escape hatch. If it imports a kernel-programmable tracing model too early, it
adds a large privileged execution surface before the basic service graph is
stable.
The design target is narrower: make operational state visible through typed, attenuable capabilities. A process should observe only the services, logs, and signals it was granted authority to inspect.
Current State
Implemented signal sources:
- Kernel diagnostics are printed through COM1 serial via
kprintln!, timestamped with the PIT tick counter. Panic and fault paths use a mutex-free emergency serial writer. - Userspace logging currently goes through the kernel
Consolecapability, backed directly by serial and bounded per call. - A Phase 1 capability log surface has landed:
LogSink/LogReaderover a bounded drop-oldest kernel ring (kernel/src/cap/log.rs), withSystemConfig.logLeveldrop enforcement at the sink, serial forwarding of accepted records, and scoped sink/reader caps granted at spawn (proof:make run-monitoring-log-smoke). Metrics, status, health, traces, crash records, the narrow kernel stats caps, and persistent retention remain future phases. - Runtime panics can use an emergency console path, then exit with a fixed code.
- Capability-ring CQEs carry structured transport results, including negative
CAP_ERR_*values and serializedCapExceptionpayloads. - The ring tracks
cq_overflow, corrupted SQ/CQ recovery, and bounded SQE dispatch, but these facts are not exported as normal metrics. ProcessSpawnerandProcessHandle.waitexpose basic child lifecycle observation, but restart policy, health checks, and exported-cap lifecycle are future work.capos-lib::ResourceLedgertracks cap slots, outstanding calls, scratch bytes, and frame grants, but only as local accounting state.- The
measurefeature adds benchmark-only counters and TSC helpers for controlledmake run-measureboots. SystemConfig.logLevelexists in the schema and is printed at boot, but there is no filtering, routing, or retention policy behind it.- An
AuditLogcapability exists in the schema and kernel (kernel/src/cap/audit_log.rs), used byAuthorityBrokerto record auth, setup, session, broker, and shell-launch events. Currently writes to serial viakprintln!; no ring-buffer reader cap or persistent retention yet. - A
HardwareAuditLogcapability with a bounded volatile ring buffer and drain/snapshot readers exists for DMA/MMIO/Interrupt cap lifecycle events (kernel/src/cap/hardware_audit.rs), including sequence numbers and dropped-record counts. A userspacehardware-audit-servicedrains it into a Store/Namespace-backed hash-chained segment ring and exposes scopedHardwareAuditReadersnapshots; the current backingStoreCapis RAM-backed, so post-reboot retention is still a storage-backend concern. hardware_release_logmodule (kernel/src/cap/hardware_release_log.rs) emits DMA pool, DMA buffer, DeviceMmio, and Interrupt release outcomes to serial; no reader cap or retention yet.
That means the system has useful raw signals and partial audit infrastructure but lacks a unified capability-shaped monitoring architecture with log routing, metrics export, and reader caps for most signal classes.
Design Principles
- Observation is authority. Reading logs, status, metrics, traces, crash records, or audit entries requires a capability.
- No global monitoring root.
SystemStatus(all),LogReader(all), andServiceSupervisor(all)are powerful caps. Normal sessions receive scoped wrappers. - Kernel facts, userspace policy. The kernel may expose bounded facts about processes, rings, resources, and faults. Retention, filtering, aggregation, health semantics, restart policy, and user-facing views belong in userspace.
- Separate signal classes. Logs, metrics, lifecycle events, traces, health, crash records, and audit logs have different readers, retention rules, and security properties.
- Bounded by construction. Every producer path has a byte, entry, or time budget. Loss is explicit and summarized.
- Payload capture is exceptional. Default tracing records headers, interface IDs, method IDs, sizes, result codes, and scoped transport identifiers only when authorized. Capturing method payloads needs a stronger cap because payloads may contain secrets.
- Serial remains emergency plumbing. Early boot, panic, and recovery still
need direct serial output. Normal services should receive log caps rather
than broad
Console. - Audit is not debug logging. Audit records security-relevant decisions and capability lifecycle events. It is append-only from producers and exposed through scoped readers.
- Pull by default, push when justified. Status and metrics are pull-shaped (reader polls a snapshot cap). Logs, lifecycle events, crash records, and audit entries are push-shaped (producer calls into a sink). Traces are pull with an explicit arm/drain lifecycle because capture is expensive. Each direction has its own cap surface; do not generalize one shape to cover all signals.
- Narrow kernel stats caps over one god-cap. The kernel exposes bounded
facts through several small read-only caps (ring, scheduler, resource
ledger, frames, endpoints, caps, crash) rather than one
KernelDiagnosticsthat grants everything. Narrow caps let an init-owned status service be assembled by composition, and let a broker lease a subset to an operator without handing over the rest.
Signal Taxonomy
Logs
Human-oriented diagnostic records:
- severity, component, service name, pid, optional subject/service reference, monotonic timestamp, message text;
- rate-limited at producer and log service boundaries;
- suitable for serial forwarding, ring-buffer retention, and later storage;
- not a source of truth for security decisions.
Metrics
Low-cardinality numeric state:
- per-process ring SQ/CQ occupancy,
cq_overflow, invalid SQE counts, opcode counts, transport error counts; - scheduler runnable/blocked counts, direct IPC handoffs, cap-enter timeouts, process exits;
- resource ledger usage: cap slots, outstanding calls, scratch bytes, frame grants, endpoint queue occupancy, VM mapped pages;
- heap/frame allocator pressure;
- later device, network, storage, and CPU-time counters.
Metric shape is fixed to three forms:
- Counter — monotonic
u64, reset only by reboot. Cumulative semantics make aggregation composable. - Gauge —
i64that moves both ways. Used for queue depths, free-frame counts, mapped-page counts. - Histogram — fixed bucket layout carried in the descriptor,
u64per bucket. Used for ring-dispatch duration, context-switch latency, IPC RTT.
Richer shapes (top-k tables, exponential histograms) are emitted as opaque typed payloads through the producer-scoped envelope described under “Core Interfaces”; the generic reader treats them as data, and a schema-aware viewer decodes them. Metrics should be snapshots or monotonic counters, not unbounded label streams.
Events
Discrete lifecycle facts:
- process spawned, started, exited, waited, killed, or failed to load;
- service declared healthy, unhealthy, restarting, quiescing, or upgraded;
- endpoint queue overflow, cancellation, disconnected holder, transfer rollback;
- resource quota rejection;
- device reset, interrupt storm, link up/down, block I/O error once devices exist.
Events are useful for supervisors and status views. They may also feed logs.
Traces
Bounded high-detail capture for debugging:
- SQE/CQE records around one pid, service subtree, endpoint, cap id, or error class;
- optional capnp payload capture only with explicit authority;
- offline schema-aware viewer for reproducing and explaining a failure;
- short retention by default.
This is the Ring as Black Box milestone from docs/tasks/README.md, not full replay.
Health
Declared service state:
- ready, starting, degraded, draining, failed, stopped;
- last successful health check and last failure reason;
- dependency health summaries;
- supervisor-owned restart intent and backoff state.
Health is not inferred only from process liveness. A process can be alive and unhealthy, or intentionally draining and still useful.
Crash Records
Panic, exception, and fatal userspace runtime records:
- boot stage, current pid if known, fault vector, RIP/CR2/error code where applicable, recent SQE context when safe, and last serial line cursor;
- bounded, redacted, and readable through a crash/debug capability;
- serial fallback remains mandatory when no reader exists.
Audit
Security and policy records:
- session creation, approval request, policy decision, cap grant, cap transfer, cap release/revocation, denial, declassification/relabel operation;
- no raw authentication proofs, private keys, bearer tokens, or full environment dumps;
- query access is scoped by session, service subtree, or operator role.
ITU-T X.700 Series Alignment
The ITU-T X.700 Systems Management framework (OSI management) predates modern observability stacks by two decades but still offers a cleaner decomposition than ad-hoc log/metric/trace categorization. capOS is not implementing CMIS/CMIP (X.710/X.711 assume ASN.1 BER over an OSI stack capOS will never speak); the value is the signal taxonomy and field model, not the transport.
| capOS signal class | Closest ITU-T | What we take from it |
|---|---|---|
| Logs | X.735 Log control function | Log record identity (moRef analog = component+pid+service_ref), severity mapping, scoped reader model. |
| Metrics | X.739 Metric objects and attributes | Fixed metric shapes (counter / gauge / histogram) as opposed to open-ended label streams. |
| Events | X.734 Event report management function | Discriminator-driven filtering, event-type taxonomy, producer/consumer separation. |
| Alarms (events) | X.733 Alarm reporting function | Perceived severity (cleared/indeterminate/warning/minor/major/critical), probable cause, specific problem, trend indication, proposed repair action. |
| Health | X.731 State management function | Operational / administrative / usage state model (enabled/disabled, unlocked/locked, idle/active/busy) feeding HealthState. |
| Audit | X.740 Security audit trail function | Audit record field model: event type, time, initiator, target, outcome, evidence chain. |
| Crash records | X.733 + X.736 Security alarm reporting function | Structured cause + severity for fatal/integrity events; security-relevant crashes flow through both the crash cap and the audit cap. |
FCAPS coverage. X.700/X.701 defines the five management functional areas: Fault, Configuration, Accounting, Performance, Security. This proposal covers Fault (crash records, alarms), Performance (metrics), and Security (audit). Configuration and Accounting are deliberately out of scope here:
- Configuration management (X.700 “C”) — versioned, signed
configuration deltas applied to running services. Partially covered
by
cloud-metadata-proposal.md(ManifestDelta) but capOS has no general configuration-management proposal yet. Candidate for a separate proposal once the manifest-executor and live-upgrade work stabilize. - Accounting management (X.700 “A”) — per-principal, per-session,
per-service resource-usage ledgers with retention and export. The
kernel’s
ResourceLedgeris the lowest layer; aggregation, persistence, and audit-grade usage records are undesigned. Candidate for a separate proposal; would compose with the audit cap and the user-identity session model.
Updated Field Mappings
LogRecord maps roughly onto X.735 logRecord:
X.735 logRecord capOS LogRecord
--------------- ---------------
logRecordId (cursor + pid + tick)
managedObjectClass component + service name
managedObjectInstance pid + service_ref
eventType Severity (lossy; add explicit
eventType once alarm/security
records share the pipe)
eventTime tick (monotonic; wall-clock when
available)
notificationIdentifier not modeled; add when events need
correlation IDs
Audit records should adopt X.740 fields explicitly. Proposed schema extension once the audit service ships:
enum AuditEventType {
# X.740 §6.1 event categories, pruned to what capOS actually records.
authentication @0; # login, logout, auth failure
accessControl @1; # grant, deny, revoke, transfer
policyDecision @2; # broker decision with plan + constraints
objectLifecycle @3; # capability create/destroy, object reap
securityAlarm @4; # X.736-shaped: integrity/confidentiality violation
serviceControl @5; # restart, upgrade, quiesce, resume
administrative @6; # manifest update, role change
}
enum AuditOutcome {
success @0;
failure @1;
denied @2;
pending @3; # multi-party approval outstanding
}
struct AuditRecord {
tick @0 :UInt64;
eventType @1 :AuditEventType;
initiator @2 :Data; # opaque principal/session ID
target @3 :Text; # interface + service identity
outcome @4 :AuditOutcome;
reason @5 :Text;
evidence @6 :Data; # opaque, bounded; no secrets
}
Alarms (X.733) are a structured subset of Events, not a new signal
class. The ServiceStatus / Health path emits alarms when degraded,
failed, or security-relevant thresholds trip:
enum PerceivedSeverity {
cleared @0;
indeterminate @1;
warning @2;
minor @3;
major @4;
critical @5;
}
enum ProbableCause {
# X.733 Annex A lists ~50 values; capOS starts with the handful that
# match known failure modes and extends as needed.
communicationsError @0;
integrityViolation @1;
operationalViolation @2;
softwareError @3;
underlyingResourceUnavailable @4;
qualityOfServiceAlarm @5;
securityAlarmIntegrity @6;
securityAlarmAccess @7;
}
struct Alarm {
tick @0 :UInt64;
managedObject @1 :Text; # service or cap identity
severity @2 :PerceivedSeverity;
probableCause @3 :ProbableCause;
specificProblem @4 :Text;
trend @5 :AlarmTrend;
proposedRepair @6 :Text;
}
The taxonomy buys two things the Unix-style “syslog + Prometheus + Jaeger” tower does not: (1) alarms as a first-class signal with a defined severity lattice and probable-cause field, which is how operators actually triage, and (2) audit as a distinct record type with fixed fields rather than a convention-layer over free-form log messages.
ITU-T references
- ITU-T Rec. X.700 (09/92) — Management framework
- ITU-T Rec. X.701 (08/97) — Systems management overview
- ITU-T Rec. X.733 (02/92) — Alarm reporting function
- ITU-T Rec. X.734 (09/92) — Event report management function
- ITU-T Rec. X.735 (09/92) — Log control function
- ITU-T Rec. X.736 (01/92) — Security alarm reporting function
- ITU-T Rec. X.740 (01/92) — Security audit trail function
- ITU-T Rec. X.731 (01/92) — State management function
- ITU-T Rec. X.739 (11/93) — Metric objects and attributes
Proposed Architecture
flowchart TD
Kernel[Kernel primitives] --> KD[KernelDiagnostics cap]
Kernel --> Serial[Emergency serial]
Init[init / root supervisor] --> LogSvc[Log service]
Init --> MetricsSvc[Metrics service]
Init --> StatusSvc[Status service]
Init --> AuditSvc[Audit log]
Init --> TraceSvc[Trace capture service]
KD --> MetricsSvc
KD --> StatusSvc
KD --> TraceSvc
Services[Services and drivers] --> LogSink[Scoped LogSink caps]
Services --> Health[Health caps]
Services --> AuditWriter[Scoped AuditWriter caps]
LogSink --> LogSvc
Health --> StatusSvc
AuditWriter --> AuditSvc
Broker[AuthorityBroker] --> Readers[Scoped readers]
Readers --> Shell[Shell / agent / operator tools]
StatusSvc --> Readers
LogSvc --> Readers
MetricsSvc --> Readers
TraceSvc --> Readers
AuditSvc --> Readers
The important property is that there is no ambient monitoring namespace. The graph is assembled by init and supervisors. Readers are capabilities, not paths.
Core Interfaces
These are conceptual interfaces. They should not be added to
schema/capos.capnp until the current manifest-executor work is complete and a
specific implementation slice needs them.
enum Severity {
debug @0;
info @1;
warn @2;
error @3;
critical @4;
}
struct LogRecord {
tick @0 :UInt64;
severity @1 :Severity;
component @2 :Text;
pid @3 :UInt32;
subjectRef @4 :Data; # privacy-preserving subject/session correlation
sessionRef @5 :Data; # optional scoped session correlation
serviceRef @6 :Data; # optional authorized service/component correlation
transportId @7 :Data; # debug-only ring/endpoint metadata, not identity
message @8 :Text;
}
struct LogFilter {
minSeverity @0 :Severity;
componentPrefix @1 :Text;
pid @2 :UInt32;
includeDebug @3 :Bool;
}
interface LogSink {
write @0 (record :LogRecord) -> ();
}
interface LogReader {
read @0 (cursor :UInt64, maxRecords :UInt32, filter :LogFilter)
-> (records :List(LogRecord), nextCursor :UInt64, dropped :UInt64);
}
LogSink is what ordinary services receive. LogReader is what shells,
operators, supervisors, and diagnostic tools receive. A scoped reader can filter
to one service subtree or session before the caller ever sees the record.
Monitoring terminology should use snake-case names in prose and map them to schema-style fields only at the Cap’n Proto boundary:
subject_ref / session_ref:
privacy-preserving identity or session correlation fields.
service_ref:
service instance or component correlation where the reader is authorized.
transport_id:
debug-only ring, endpoint, SQE/CQE, or waiter metadata; never subject
identity.
Legacy endpoint badge terminology must not leak into user-facing monitoring
identity. If a low-level transport path still stores a badge-shaped selector,
monitoring may expose it only as debug transport_id under an appropriate
diagnostic cap, not as subject_ref, session_ref, or service_ref.
struct ProcessStatus {
pid @0 :UInt32;
serviceName @1 :Text;
state @2 :Text;
capSlotsUsed @3 :UInt32;
capSlotsMax @4 :UInt32;
outstandingCalls @5 :UInt32;
cqReady @6 :UInt32;
cqOverflow @7 :UInt64;
lastExitCode @8 :Int64;
}
struct ServiceStatus {
name @0 :Text;
health @1 :Text;
pid @2 :UInt32;
restartCount @3 :UInt32;
lastError @4 :Text;
}
interface SystemStatus {
listProcesses @0 () -> (processes :List(ProcessStatus));
listServices @1 () -> (services :List(ServiceStatus));
service @2 (name :Text) -> (status :ServiceStatus);
}
SystemStatus is read-only. A broad instance can see the system; wrappers can
expose one service, one supervision subtree, or one session.
enum MetricKind {
counter @0;
gauge @1;
histogram @2;
}
struct MetricSample {
# Well-known fixed-name slot for counters and gauges the aggregator
# understands without additional schema lookup. Use this for stable
# kernel counters to keep the hot path allocation-free.
name @0 :Text;
kind @1 :MetricKind;
value @2 :Int64;
tick @3 :UInt64;
# Producer-scoped typed envelope for richer samples (histograms,
# top-k tables, per-subsystem structs). Payload is a capnp message;
# the schema is identified by `schemaHash` (capnp node id) and keyed
# per producer. Opaque to the generic reader; a schema-aware viewer
# decodes it.
producerId @4 :UInt64;
schemaHash @5 :UInt64;
payload @6 :Data;
}
struct MetricFilter {
prefix @0 :Text;
service @1 :Text;
}
interface MetricsReader {
snapshot @0 (filter :MetricFilter, maxSamples :UInt32)
-> (samples :List(MetricSample), truncated :Bool);
}
Early metrics should be fixed-name counters and gauges in the name/value
slot. Avoid arbitrary labels until there is a concrete memory and cardinality
policy. The producer-scoped envelope exists so richer samples do not force the
generic reader to learn a string-key taxonomy — if a producer needs per-queue
or per-device detail, it ships a typed capnp struct keyed by schemaHash
rather than synthesizing name strings.
struct TraceSelector {
pid @0 :UInt32;
serviceName @1 :Text;
errorCode @2 :Int32;
includePayloadBytes @3 :Bool;
}
struct TraceRecord {
tick @0 :UInt64;
pid @1 :UInt32;
opcode @2 :UInt16;
capId @3 :UInt32;
methodId @4 :UInt16;
interfaceId @5 :UInt64;
result @6 :Int32;
flags @7 :UInt16;
payload @8 :Data;
}
interface TraceCapture {
arm @0 (selector :TraceSelector, maxRecords :UInt32, maxBytes :UInt32)
-> (captureId :UInt64);
drain @1 (captureId :UInt64, maxRecords :UInt32)
-> (records :List(TraceRecord), complete :Bool, dropped :UInt64);
}
Payload capture should default off. A capture cap that can read payload bytes is closer to a debug privilege than a normal status cap.
enum HealthState {
starting @0;
ready @1;
degraded @2;
draining @3;
failed @4;
stopped @5;
}
interface Health {
check @0 () -> (state :HealthState, reason :Text);
}
interface ServiceSupervisor {
status @0 () -> (status :ServiceStatus);
restart @1 () -> ();
}
ServiceSupervisor is authority-changing. Normal monitoring readers should not
receive it. A broker can mint a leased ServiceSupervisor(net-stack) for one
operator action.
Kernel Diagnostics Contract
The kernel should expose a small read-only diagnostics surface for facts only the kernel can know:
- process table snapshot: pid, state, service name if known, wait state, exit code, ring physical identity hidden or omitted;
- ring snapshot: SQ/CQ head/tail-derived occupancy, overflow count, corrupted head/tail recovery counts, opcode/error counters;
- resource snapshot: cap slot usage, outstanding calls, scratch reservation, frame grant pages, mapped VM pages, free frame count, heap pressure;
- scheduler snapshot: tick count, current pid, run queue length, blocked count, direct IPC handoff count, timeout wake count;
- crash record: last panic/fault metadata and early boot stage.
The kernel should not implement log routing, alerting, dashboards, retention policy, restart decisions, RBAC, ABAC, or text search. Those are userspace service responsibilities.
Implementation shape:
- Maintain fixed-size counters in existing kernel structures where the source event already occurs.
- Prefer snapshots computed from existing state over duplicate counters when the cost is bounded.
- Expose snapshots through a small set of narrow read-only capabilities,
not one
KernelDiagnosticsgod-cap. The initial decomposition:SchedStats— tick count, current pid, run queue length, blocked count, direct IPC handoff count,cap_entertimeout/wake counts.FrameStats— free/used frame counts, frame-grant pages, allocator pressure histogram.RingStats— per-process SQ/CQ occupancy,cq_overflow, corrupted-head recovery counts, opcode counters, transport-error counters.CapTableStats— per-process slot occupancy, generation-rollover counts, insertion/remove rates.EndpointStats— per-endpoint waiter depth, RECV/RETURN match rate, abort/cancellation counts.CrashSnapshot— last panic/fault metadata, early boot stage, recent SQE context when safe.
- Each narrow cap exposes
snapshot() -> (sample :MetricSample)or a typed struct; none of them enumerates processes or reads cap tables beyond what the subsystem owns. A trusted status service composes the ones it needs; a broker leases a subset for operator sessions without the rest. ProcessInspector(pid-scoped process table, cap-table enumeration, VM map) is a distinct, stronger cap and lives with process-management authority, not with monitoring.- Convert broad diagnostics into scoped userspace wrappers before handing them to shells or applications.
- Keep panic/fault serial writes independent of any diagnostics service.
Promotion from the measure feature: the benchmark counters in
kernel/src/measure.rs graduate to always-on in RingStats / SchedStats
when the per-event cost is provably a single relaxed atomic add. Cycle-counter
instrumentation (rdtsc/rdtscp) stays behind cfg(feature = "measure")
because it is serializing and benchmark-only. The promotion threshold keeps
normal dispatch builds free of instrumentation cost without forcing monitoring
into a second build configuration.
Logging Model
Early boot has only serial. After init starts the log service, ordinary services
should receive LogSink rather than raw Console unless they need emergency
console access.
Recommended path:
- Kernel serial remains for boot, panic, and fault records.
- Init starts a userspace log service and passes scoped
LogSinkcaps to children. - The log service forwards selected records to
Consoleuntil persistent storage exists. SystemConfig.logLevelbecomes an initial policy input for which records the log service forwards and retains.- Session and operator tools receive scoped
LogReadercaps from a broker.
Services should not put secrets, raw capability payloads, full auth proofs, or arbitrary user input into logs without explicit redaction. Log records are data, not commands.
Metrics and Status
Status answers “what is alive and what state is it in.” Metrics answer “what is the numeric behavior over time.” Keeping them separate avoids a common failure mode where a human-readable status API grows into an unbounded metrics store.
Initial status fields should cover:
- pid, service name, binary name, process state, exit code;
- process handle wait state;
- supervisor health and restart policy once supervision exists;
- cap table occupancy and outstanding call count;
- ring CQ availability and overflow;
- endpoint queue occupancy where authorized.
Initial metrics should cover:
- ring dispatches, SQEs processed, per-op counts, transport error counts;
- cap-enter wait count, timeout count, wake count;
- scheduler context switches and direct IPC handoffs;
- frame free/used counts, frame grant pages, VM mapped pages;
- log records accepted, suppressed, dropped, and forwarded;
- trace records captured and dropped.
Timer/nohz/realtime metrics should be owned by monitoring rather than left as one-off debug prints once those features exist:
scheduler_tick_count{cpu};ticks_suppressed{cpu,mode};nohz_enter_count{cpu,kind};nohz_exit_count{cpu,reason};oneshot_deadline_miss_count;sqpoll_busy_ns;sqpoll_sleep_count;deadline_expired_count;budget_exhausted_count;realtime_overrun_count;donation_depth_max;housekeeping_offload_count.
These are correctness signals for nohz/realtime admission, not only performance counters. A scoped monitoring reader may observe them only under the same authority rules as other scheduler and service telemetry.
Current state alignment. Scheduler Phase D WFQ and Phase E
SchedulingContext have landed per docs/changelog.md (Phase D closed
2026-05-10), and Phase F is delivering one-SQ-consumer, nohz telemetry
counters, and housekeeping/deferred-work placement; automatic nohz
activation’s first increment is now closed via
docs/tasks/done/2026/scheduler-phase-f-auto-nohz-activation.md
(per the scheduler bullet in docs/tasks/README.md), and SQPOLL-driven auto-nohz
activation is also closed via
docs/tasks/done/2026/scheduler-phase-f-auto-nohz-sqpoll.md: a
ring-coupled kernelSqpoll lease whose bound ring is in SQPOLL
running/sleeping mode with a live owner is admitted for tick suppression,
with the SQPOLL ring-state re-check as the decisive rollback gate; the
CpuIsolationLease preflight performs real per-CPU periodic-tick suppression
for the narrow single-runnable-entity window with fail-closed rollback;
timeout-based auto-revoke and generic full-nohz for ordinary budgeted compute
leases are also landed. The nohz/realtime counter
families above describe the target monitoring surface for those
signals — the kernel may already maintain some counters internally as
Phase F lands them, but until the narrow read-only stats caps
(SchedStats / RingStats and friends) and a userspace metrics service
ship, those counters are scheduler-internal facts and not yet exported
through a monitoring cap. The metrics service is not authority to
trigger nohz mode changes; it observes counters under the authority
rules in this proposal.
Metric labels such as mode, kind, and reason must be fixed enums, not
free-form strings:
#![allow(unused)]
fn main() {
enum NoHzKind {
Idle,
KernelSqpoll,
AutoCompute,
AutoUserspacePoller,
RealtimeIsland,
}
enum TickSuppressionMode {
Idle,
SqpollNoHz,
AutoNoHz,
RealtimeIsland,
}
enum NoHzExitReason {
TimerDeadline,
Ipi,
DeviceIrq,
SecondRunnable,
NetworkForcedPeriodic,
DeferredWork,
LeaseRevoked,
ClocksourceUnsafe,
DebugWatchdog,
}
}
Future metric schemas should add enum variants through reviewed ABI changes rather than accepting arbitrary labels.
Avoid per-method, per-cap-id, per-transport-id, or per-user high-cardinality metrics by default. Those belong in short-lived traces or scoped logs.
Benchmark outputs follow the same cardinality rule. A completed, validated
benchmark run may import a small summary such as latest median, p95, sample
count, and pass/fail status for a named benchmark profile. Raw samples,
transcripts, host/QEMU configuration, correctness evidence, and comparison
tables are benchmark artifacts, not always-on monitoring metrics. Running a
profile that needs measure, debug taps, broad status readers, or other
diagnostic authority should emit an audit record because the act of measuring
can expose timing and topology data that ordinary services should not see.
Ring as Black Box
The first concrete monitoring milestone is the completed docs/tasks/README.md
Ring-as-Black-Box item. The visible milestone was achieved by commit da5f5e9
at 2026-04-24 03:13 UTC:
- define a bounded capture format for SQE/CQE records;
- export capture through a QEMU-only debug path;
- build a host-side viewer that decodes records and capnp payloads when payload capture is authorized;
- add one failing-call smoke whose captured log can be inspected offline.
This buys immediate debugging value without committing to durable audit, network export, service restart policy, or replay semantics.
This is inspection, not record/replay. Replay requires stronger determinism, payload retention, timer/input modeling, and capability-state checkpoints.
Capture path cost. The capture cap (working name RingTap) is
feature-gated (cfg(feature = "debug_tap") analogous to measure). Every
armed tap imposes a serializing fan-out on dispatch; keeping it out of the
default kernel feature set prevents always-on cost. Arming a tap is itself
an auditable event — the tapped process and the audit log observe it —
and tap grants respect move-semantics so a tap cannot be silently cloned
past its intended holder. Payload-capturing taps require a separately
leased cap distinct from metadata-only capture because payloads may
contain secrets.
Health and Supervision
Health and restart policy should live with supervisors, not in a central kernel daemon.
Each supervisor owns:
- a narrowed
ProcessSpawner; - child
ProcessHandlecaps; - the cap bundle needed to restart its subtree;
- optional
Healthcaps exported by children; - a
LogSinkandAuditWriterfor its own decisions.
Status services aggregate supervisor-reported health. They should distinguish:
- no process exists;
- process exists but never reported ready;
- process is alive and ready;
- process is alive but degraded;
- process exited normally;
- process failed and supervisor is backing off;
- process was intentionally stopped or draining.
Restart authority should be a separate ServiceSupervisor cap. A read-only
SystemStatus cap must not be able to restart anything.
Audit Integration
Audit should share infrastructure with logging only at the storage or transport layer. Its semantics are different.
Audit producers:
AuthorityBrokerfor policy decisions and leased grants;- supervisors for restarts and service lifecycle actions;
- session manager for session creation and logout;
- kernel or status service for cap transfer/release/revocation summaries when those events become part of the exported authority graph;
- recovery tools for repair actions.
Audit readers are scoped:
- a user can read records for its own session;
- an operator can read a service subtree;
- a recovery or security role can read broader streams after policy approval.
Audit entries must avoid secrets and payload dumps. They should record object identity, service identity, policy decision summaries, and capability interface classes rather than raw data.
Security and Backpressure
Monitoring must not become the easiest denial-of-service path.
Required controls:
- Per-process log token buckets, matching the Security Verification Track S.9 diagnostic aggregation design.
- Suppression summaries for repeated invalid submissions.
- Fixed-size ring buffers with explicit dropped counts.
- Maximum record size for logs, events, crash records, and traces.
- Bounded formatting outside interrupt context.
- No heap allocation in timer or panic paths.
- No unbounded metric label creation from user-controlled strings.
- Payload tracing disabled by default.
- Redaction rules at producer boundaries and at reader wrappers.
- Capability-scoped readers; no unauthenticated “debug all” endpoint.
When pressure forces dropping, preserve first-observation diagnostics and later summaries. Losing detailed logs is acceptable; corrupting scheduler progress or blocking the kernel on log I/O is not.
Relationship to Existing Proposals
- Service Architecture: monitoring services are ordinary userspace services spawned by init or supervisors. Logging policy and service topology stay out of the pre-init kernel path.
- Shell: the native and agent shell should receive scoped
SystemStatusandLogReadercaps in daily profiles, not global supervisor authority. - User Identity and Policy:
AuthorityBrokermints scoped readers and leased supervisor caps based on session policy;AuditLogrecords the decisions. - Error Handling: transport errors and
CapExceptionpayloads are monitoring signals, but retry policy remains userspace. - Authority Accounting: resource ledgers provide the first metrics substrate and define quota/backpressure boundaries.
- Security and Verification: hostile-input tests should cover log flood
aggregation and bounded diagnostic paths. Each new monitoring boundary
(kernel stats caps, log/metrics/trace/audit services, scheduler nohz
telemetry exports) must be carried into the
docs/proposals/security-and-verification-proposal.mdTrack S.7 trust-boundary inventory before downstream services rely on it; the inventory is the canonical record that a boundary has been reviewed, not this proposal. - Live Upgrade: health, audit, and service status become prerequisites for credible upgrade orchestration.
- System Performance Benchmarks: benchmark runners may read scoped status and metrics before and after a run, but benchmark artifacts and OS-comparison reports live outside the always-on metrics service. Only low-cardinality, validated summaries should be imported into monitoring.
Implementation Plan
-
Document the model. Keep monitoring as a future architecture proposal and do not disturb the current manifest-executor milestone.
-
Ring as Black Box. Completed by commit
da5f5e9at2026-04-24 03:13 UTC: bounded SQE/CQE capture, host-side decoding, and one failing-call smoke form the first useful monitoring artifact. -
Userspace log service. (Phase 1 landed.)
LogSink/LogReaderschemas plusLogRecord/LogFilterexist (additive ordinals, reusingLogLevelas the severity type). A bounded drop-oldest kernel ring (kernel/src/cap/log.rs) backs both caps: the sink stamps the monotonic tick, drops records below the boot-seededSystemConfig.logLevelthreshold (accepted = false), bounds record size, and forwards accepted records to serial; the reader returns cursor/filtered records withnextCursorand adroppedoverflow count. ScopedLogSink/LogReadercaps are granted to children at spawn;make run-monitoring-log-smokeproves the drop, the read-back, and the reader-sideminLevelfilter. Remaining: the widerSeverity(withcritical), the correlation fields (subjectRef/sessionRef/serviceRef/transportId), per-process token buckets / suppression summaries, and persistent retention. -
Narrow kernel stats caps and SystemStatus. Add the narrow read-only caps (
SchedStats,FrameStats,RingStats,CapTableStats,EndpointStats,CrashSnapshot) as bounded snapshot surfaces. A userspaceSystemStatusservice composes the ones it needs and exposes scoped wrappers to shells and operator tools. LeaveProcessInspectorout of this step — it belongs with process-management authority, not monitoring. -
Metrics snapshots. Add fixed counters and gauges for ring, scheduler, resource, log, and trace state. Keep labels static until a cardinality policy exists.
-
Health and supervisor status. Add
Healthand read-only supervisor status once restart policy and exported service caps are concrete. Keep restart authority in separateServiceSupervisorcaps. -
Audit path. Add append-only audit records for broker decisions, cap grants, releases, revocations, restarts, and recovery actions. Start serial or memory backed; move to storage once the storage substrate exists.
-
Crash records. Preserve bounded panic/fault metadata across the current boot where possible; later store records durably.
-
Device, network, and storage metrics. Add driver metrics only after those drivers exist: interrupts, DMA/bounce usage, queue depth, RX/TX/drop/error counts, block latency, and reset events.
Non-Goals
- No global
/procor/sysequivalent with ambient read access. - No kernel-resident dashboard, alert manager, text search, or policy engine.
- No programmable kernel tracing language in the first monitoring design.
- No promise of durable log retention before storage exists.
- No default payload tracing.
- No service restart authority bundled into ordinary read-only status caps.
- No network export path until networking and policy can constrain it.
Open Questions
- Should
KernelDiagnosticsexpose snapshots only, or also a bounded event cursor? - What is the minimum timestamp model before wall-clock time exists?
- Should log records carry local cap IDs, stable object IDs, or only interface and service metadata by default?
- How should schema-aware trace decoding find schemas before a full
SchemaRegistryexists? - Which crash fields are safe to expose to non-recovery sessions?
- What retention policy is acceptable before persistent storage?
- Should
MetricsReaderuse typed structs for each subsystem instead of generic name/value samples? - Where should remote monitoring export fit once network transparency exists: a dedicated exporter service, capnp-rpc forwarding, or storage replication?
Cross-References
This proposal is reader-facing target design. The canonical trackers for the observability-adjacent risks and verification obligations it depends on live elsewhere:
docs/proposals/security-and-verification-proposal.mdTrack S.7 – Stage-6-aware refresh owns the trust-boundary inventory that any new monitoring boundary (kernel stats caps, log/metrics/trace/audit services, scheduler nohz telemetry exports, payload-capturing taps) must be carried into before downstream services rely on it. Track S.7 already lists the active scheduler-evolution surfaces (Phase D WFQ, Phase ESchedulingContext, Phase F one-SQ-consumer and nohz telemetry) plus the WASI host-adapter Phase W.4 entropy/argv boundary as inventory items to carry forward.docs/design-risks-register.mdR12 – Verification coverage is partial, not full proof is the canonical caveat for any monitoring claim that could be read as a verified property. Bounded Kani/Loom/Miri/proptest coverage plus the panic-surface inventory are not whole-system functional refinement; monitoring records and audit entries describing security- relevant decisions must respect that distinction in their wording.docs/design-risks-register.mdQ9 – CPU accounting and scheduling contexts is the canonical answer for the CPU-time, weighted-vruntime, andSchedulingContextbudget/donation/depletion semantics that monitoring metrics should observe rather than redefine. The nohz/realtime counter families in this proposal target the same surfaces; cross-service donation policy, full nohz activation, isolation leases, and fairness across principals remain proposal-shaped per Q9 and are tracked indocs/proposals/scheduler-evolution-proposal.mdanddocs/backlog/scheduler-evolution.md.
Adjacent risk-register entries observed by monitoring but owned elsewhere
include R4 (Resource accounting fragmentation, source of the
ResourceLedger metrics substrate), R8 (Networking lives inside the kernel
TCB, gating exporter-service placement), and R11 (Pre-auth and post-auth
share a shell process, gating who may receive scoped LogReader /
SystemStatus / AuditLog readers).
Proposal: Time and Clock Capability Authority
How capOS should expose wall-clock time, clock discipline, and trusted timestamps without introducing ambient real time, allowing a service to forge timestamps, or creating a covert timing channel between processes.
Problem
Today capOS has one time-related capability: Timer, which exposes
now() -> (monotonicNs, tick) and sleep(). The monotonic counter is useful
for scheduling and rate limiting, but it carries no provenance, has no
relationship to wall-clock time, and is not a trusted source for security
decisions.
Several upcoming capability surfaces implicitly need trustworthy wall-clock time:
- TLS certificate validation (
certificates-and-tls-proposal.md) must comparenotBefore/notAfterfields against a wall-clock source whose provenance the validator trusts. - OIDC token expiry (
oidc-and-oauth2-proposal.md) must compareexpandiatclaims against wall-clock time. - Audit records must carry a timestamp that a security reviewer can trust. A service must not be able to backdate its own audit entries.
- WASI
clock_time_get(CLOCKID_REALTIME)currently returnsNOSYS. Any WASM payload that needs the current time, including TLS libraries compiled to WASM, hits this gap. - Cloud metadata bootstrap (
cloud-metadata-proposal.md) supplies instance-launch time; any cloud image verification that checks a timestamp needs a root-of-trust for time.
None of these can be satisfied by handing callers a monotonic tick offset and asking them to add a boot-time offset they supply themselves: the capability model requires that time provenance be part of the granted interface, not an ambient convention.
User Stories
- A TLS handshake service holds a
WallClockcap labeledntp-synced. It callswallTime()to get the current UTC time and validates a certificate’s validity window. If the provenance wereuntrusted, it would refuse validation or surface a warning. - An audit service receives timestamped records from
AuthorityBrokerand session services. It does not trust the caller-supplied timestamp; it reads its own grantedWallClockand stamps records at ingestion time. - A WASM payload loaded by
capos-wasmcallsclock_time_get(CLOCKID_REALTIME). The WASI host adapter reads theWallClockcap that was granted to thewasm-hostprocess at launch, returns the wall-clock seconds, and sets the provenance flag in the host’s internal WASI state so that WASM callers cannot assume sync quality beyond what was granted. - An init operator grants
clockDisciplineto a userspace NTP service. The NTP service callsstep()orslew()to advance or discipline the system clock. No other process may call these methods. - A process running in an environment with no NTP synchronization receives a
WallClocklabeledmeasured-boot-monotonic. It can compute elapsed time accurately but knows that absolute wall time is only as accurate as the firmware real-time clock at boot.
Design
Existing Timer Interface
interface Timer {
now @0 () -> (monotonicNs :UInt64, tick :UInt64);
sleep @1 (durationNs :UInt64) -> ();
}
Timer remains the canonical interface for deadlines, sleep, and monotonic
elapsed time. It does not change. WallClock is a separate, orthogonal
capability whose provenance tracks the quality of the absolute time signal.
WallClock Interface
enum ClockProvenance {
# Zero-value is fail-closed: an unset, default, or unrecognized provenance
# decodes as untrusted, so a caller that skips the check never treats an
# unknown source as trusted. No reliable source known; callers must fail
# closed on sensitive decisions.
untrusted @0;
# Synchronized to a trusted NTP source within the last sync window.
ntpSynced @1;
# PTP hardware clock; higher precision, same trust level as ntpSynced.
ptpSynced @2;
# Firmware RTC at boot; advanced monotonically since; no network sync.
measuredBootMonotonic @3;
# Manual set by an operator with clockDiscipline authority.
manualSet @4;
}
interface WallClock {
# Returns UTC seconds since Unix epoch, nanoseconds within the second,
# the current monotonic offset from the same Timer.now() base, and
# the provenance label for this clock source.
wallTime @0 () -> (
utcSeconds :Int64,
utcNanos :UInt32,
monotonicNs :UInt64,
provenance :ClockProvenance
);
}
Key properties:
- No ambient access. A process must hold a granted
WallClockcap to read wall time. Init-owned processes receive it via the manifest bundle; ordinary services receive it only if their supervisor grants it. - Provenance is part of the response, not a separate call. A validator that
requires
ntpSyncedcan check the provenance field on every read without a separate round-trip. - Monotonic offset is included. The returned
monotonicNsties the wall-clock sample to theTimer.now()timeline so callers can compute elapsed time without a secondTimercall. The kernel ensures both fields are read from a consistent snapshot within the same tick. - Single method.
WallClockis read-only and has no state. Its simplicity makes attenuation straightforward: a wrapper that downgrades provenance tountrustedor truncates resolution is trivially composable.
ClockDiscipline Interface
Clock setting and NTP/PTP synchronization require a separate, stronger capability. No userspace process can discipline the clock without holding it.
interface ClockDiscipline {
# Atomically step the wall-clock by the given signed delta in nanoseconds.
# Used for large corrections (initial set from RTC, NTP step).
step @0 (deltaNs :Int64) -> ();
# Gradually slew the clock toward the target offset, bounded to
# `maxRateNsPerS` nanoseconds per second. Used for NTP drift correction.
slew @1 (offsetNs :Int64, maxRateNsPerS :UInt32) -> ();
# Declare the current source and its estimated error bound.
setProvenance @2 (
provenance :ClockProvenance,
errorBoundNs :UInt64
) -> ();
# Read the current sync state.
syncState @3 () -> (
provenance :ClockProvenance,
lastSyncMonotonicNs :UInt64,
lastStepMonotonicNs :UInt64,
errorBoundNs :UInt64,
slewRateNsPerS :Int32
);
}
ClockDiscipline is init-owned at boot. The manifest may grant it to a
dedicated NTP service process. No service other than the designated NTP/PTP
daemon should hold this cap.
step() adjusts only the UTC offset, never the monotonic base. Per the
prior-art note’s clock-step/leap-second lesson (a monotonic timeline must never
jump backwards), a step retargets the wall-clock offset layered on
Timer.now(); it does not rewind the monotonic timeline that scheduler
deadlines, ring timeouts, and slew() rate-limiting depend on. Large
discontinuities use step() (initial set / NTP step), small drift uses
slew(), and leap seconds are absorbed by slewing (smear) rather than a
backwards step so ordered timestamps never regress. The lastStepMonotonicNs
field lets a WallClock consumer detect that a step happened since a cached
observation and re-read.
Timezone and Locale Data
Timezone and locale data are not ambient. They are delivered as named entries
in a Directory-backed data store (per storage-and-naming-proposal.md).
A process that needs timezone conversion receives a scoped read-only Directory
cap pointing at the relevant tzdata namespace entry, not an environment variable
or a path under a global filesystem.
Rationale: environment variables are not capability-scoped, and a process should not observe the host’s timezone as a side channel. Explicit directory delivery makes timezone data just another granted resource.
Manifest Seeding
The boot manifest may include a seedUtcSeconds field in SystemConfig
(or an extension struct). At first kernel tick, the kernel initializes the
wall-clock state from this seed with measuredBootMonotonic provenance. If no
seed is present, the firmware RTC is read during early boot; if no RTC is
available, provenance is untrusted.
After init starts the NTP service and that service disciplines the clock, it
calls ClockDiscipline.setProvenance(ntpSynced, ...) to upgrade the provenance
label. From that point, all WallClock.wallTime() calls return ntpSynced.
Audit Timestamps
Audit records must carry a server-stamped timestamp, not a caller-supplied one.
The audit service holds a WallClock cap. When it ingests a record from
AuthorityBroker, SessionManager, or any other producer, it stamps the record
with the time returned by its own WallClock call at ingestion. The producer
may supply a monotonic offset for correlation, but the wall-clock stamp is
always the audit service’s own read.
Audit record timestamps carry the same ClockProvenance enum value that was
returned by WallClock.wallTime() at ingestion time. A security reviewer can
verify that audit entries were timestamped with a synchronized source and reject
or flag entries timestamped under untrusted.
WASI Integration
capos-wasm Phase W.3+ adds WallClock as a grantable cap in the per-instance
CapSet launched by wasm-host. The WASI Preview 1 host function
clock_time_get(CLOCKID_REALTIME, ...) reads from the granted WallClock,
returns the UTC second/nanosecond pair, and records the provenance in the host
state so that the wasm-host audit trail can assert what time quality the WASM
instance saw. If no WallClock cap was granted, clock_time_get(REALTIME)
returns NOSYS as it does today.
No Cross-Process Skew Side Channel
WallClock exposes only the current time from the kernel’s single
wall-clock state. It does not expose skew history, NTP offset measurements,
or raw clock-adjustment rates. ClockDiscipline.syncState() is the only
path to sync state and is held by at most one NTP service.
A process cannot learn another process’s read pattern from WallClock because
there is no shared counter or read-cursor that leaks observer timing. The
monotonic offset in the wallTime() response is derived from the same TSC
baseline as Timer.now() and does not introduce new covert-channel surface.
Fail-Closed Policy
Services that receive a WallClock cap and make security decisions on its
output must treat untrusted provenance as a failure condition, not a
degraded-but-functional mode. The recommended pattern:
let (utc, _, _, prov) = wall_clock.wallTime()?;
if prov == ClockProvenance::Untrusted {
return Err(CapError::ClockProvenanceInsufficient);
}
validate_cert_notafter(utc, cert)?;
Callers that accept measuredBootMonotonic for non-security uses (e.g., log
timestamps, cache TTLs) should document the provenance they accept. Callers
that accept only ntpSynced or ptpSynced for security decisions should
reject all other values.
Phasing
Phase 1 — WallClock Read and Provenance
Status: landed (2026-05-24 09:31 UTC), fixed-boot-base variant. The
WallClock read cap and ClockProvenance enum exist end-to-end: schema +
generated bindings, kernel/src/cap/wall_clock.rs, the capos-config
wall_clock kernel source, the capos-rt WallClockClient, and a shell date
command proven by make run-shell. The follow-up bullets below (manifest seed,
stateful WallClockState, init audit/TLS grants, WASM realtime clock) remain
Phase 1.x / Phase 2.
- Add
WallClockinterface andClockProvenanceenum toschema/capos.capnp. Landed. - Landed (fixed-boot-base variant): the kernel cap derives UTC from a fixed
compile-time base over the existing monotonic timebase and reports the
fail-closed
untrustedprovenance (theClockProvenancezero value). It is not read from firmware RTC and is not network-synchronized, sountrustedis the honest label; this also proves the zero-value fail-closed enum semantics end-to-end. A statefulWallClockState(UTC offset, provenance, last-sync tick, error bound) and a manifestseedUtcSecondsseed withmeasuredBootMonotonicprovenance are deferred to Phase 1.x / Phase 2 whereClockDisciplinecan upgrade the label. cap/wall_clock.rsimplements the cap;capos-rtadds a typed client. Landed (WallClockClient, with a fail-closedClockProvenance::from_schemaunknown-variant decode).- Init grants
WallClockto audit service and TLS service in the manifest bundle. (Deferred; the landed proof grantswall_clockdirectly to the shell-as-init insystem-shell.cue.) - WASM host adapter:
clock_time_get(CLOCKID_REALTIME)reads the instance’s grantedWallClock; if absent, returnsNOSYSas before. (Deferred.) - Smoke: a shell
datecommand inmake run-shellboots, readsWallClock, prints UTC seconds/nanos/monotonic plus the provenance label, and exits cleanly. Landed (asserted intools/qemu-shell-smoke.sh).
Phase 2 — Clock Discipline and NTP Service
- Add
ClockDisciplineinterface to schema. - Kernel implements
step(),slew(),setProvenance(), andsyncState(). - A userspace NTP client process receives
ClockDisciplinefrom init and synchronizes to a configured NTP server (requiresUdpSocketfrom the networking capability). - After first successful sync, calls
setProvenance(ntpSynced, errorBoundNs). All subsequentWallClock.wallTime()calls returnntpSynced. - Audit entries timestamped post-sync carry
ntpSyncedprovenance.
Phase 3 — PTP, Leap Second, and Suspend Recovery
- PTP hardware clock support for environments that have it.
- Leap-second policy: step vs. smear, configurable per
ClockDiscipline. - Suspend/resume:
WallClockprovenance downgrades tomeasuredBootMonotonicafter a suspend event until NTP re-syncs. (Cross-links to the future power/suspend proposal; no dependency today.) - Timezone delivery: a
Directorynamespace entry backed by tzdata is seeded from the manifest and delivered as a cap to timezone-aware services.
Hazards and Invariants
Monotonic vs. wall-clock relationship. The wall-clock state is an offset
applied to the Timer monotonic base. step() changes the offset; the
underlying monotonic timeline never goes backward. Callers that need monotonic
guarantees must use Timer.now(); callers that need calendar time use
WallClock.wallTime(). This separation prevents a clock step from violating
monotonicity promises made to schedulers or ring timeouts.
ABI stability. ClockProvenance enum variants must only be added, never
removed or reordered. Binaries compiled against an older schema that see an
unrecognized provenance value should treat it as untrusted (fail-closed).
This requires the capnp generated decode to default unknown enum values to zero,
which is ntpSynced — so the schema field ordering above must put untrusted
at zero or the generated bindings must use an explicit unknown-variant path.
Ordering note: when adding to schema, put untrusted @0 first so that the
zero default is fail-closed, not the most-trusted value.
DMA and IRQ neutrality. WallClock and ClockDiscipline do not touch
device memory, DMA pools, or interrupt grant paths. They are pure kernel-state
caps. No DMA/MMIO/IRQ hazard applies.
No capability-transfer amplification. WallClock is a read-only snapshot
surface. Transferring it to another process does not grant clock-setting
authority. ClockDiscipline must not be transferable through normal cap-grant
paths; it should be restricted to init-owned grant at boot and explicit
manifest-operator grants.
Relevant Research and Prior Art
In-Tree Grounding
- NO_HZ, SQPOLL, and Realtime Scheduling
records the Linux timer-stack split between clock sources (monotonic
timeline counters) and clock events (hardware devices that interrupt at
selected future times), and concludes capOS should “introduce a monotonic
now_nsclocksource layer” distinct from the scheduler tick. This proposal builds directly on that separation:Timer.now()/WallClock.wallTime()expose the clocksource timeline, while clock-event programming stays a scheduler concern. The wall-clock offset rides on the same monotonic base so a clock step never rewinds the timeline the scheduler and ring timeouts depend on — the monotonicity invariant called out in that note. - Future Scheduler Architecture
reinforces the same clocksource/clockevent boundary and the lesson that
absolute-deadline waiters should be stored by expiry time, not periodic tick
count. That confirms
WallClockmust not become the deadline substrate: deadlines remain monotonic, and wall-clock time is a separate, disciplinable view layered on top.
External Precedent and Lessons
- Linux
clock_gettime/adjtimex. Linux exposes distinct clock IDs (CLOCK_MONOTONICvsCLOCK_REALTIME) and gates clock discipline behind a privileged interface:adjtimex/clock_adjtimeand stepping the realtime clock requireCAP_SYS_TIME. Lesson: reading time and disciplining time are different authorities. capOS encodes this as a read-onlyWallClockcap held by ordinary services and a separate, strongerClockDisciplinecap held only by a designated sync service — the capability-native analog of the read/CAP_SYS_TIMEsplit. - Linux time namespaces.
CLOCK_MONOTONIC/CLOCK_BOOTTIMEoffsets can be virtualized per namespace so a container observes a different boot/monotonic origin. Lesson: time can be a per-context value rather than a single global ambient fact, which supports delivering wall-clock as a granted, attenuable cap (and timezone data as a scopedDirectory) instead of a process-wide environment. - Fuchsia/Zircon UTC clock objects. Fuchsia models UTC as a kernel clock
object distributed to processes as read-only handles, with a separate
privileged maintainer service holding the write handle that disciplines the
clock; clock reads carry an error bound and a “started/synced” signal so a
reader can tell whether the clock is yet trustworthy. Lesson: this is the
closest capability-native precedent for the design here. capOS’s
read-only
WallClockwith aClockProvenancelabel maps to Fuchsia’s read-only UTC handle plus its synced/error-bound signal, andClockDisciplinemaps to the single write-handle maintainer. (The in-tree Zircon report covers handles, rights, and VMOs but not the UTC clock object specifically; the UTC-clock mapping is external precedent, not yet captured as an in-tree research note.) - NTP step vs. slew. NTP daemons step the clock for large offsets and slew
(bounded rate adjustment) for small drift, precisely because abruptly
rewinding wall time breaks timestamp ordering and timeouts. Lesson: capOS
exposes
step()andslew()as distinctClockDisciplinemethods rather than a single “set time”, so the discipline policy is explicit at the cap boundary. - IEEE-1588 PTP. Precision Time Protocol provides sub-microsecond hardware
timestamping via a dedicated hardware clock, distinct from software NTP.
Lesson: provenance is not binary. The
ptpSyncedvsntpSynceddistinction inClockProvenancelets a validator that needs high-precision time distinguish the two without conflating accuracy with mere network sync.
Dedicated Research Note
- Time and Clock Authority is the
focused prior-art survey for this proposal: verified Linux
CAP_SYS_TIMEread/discipline split, time namespaces as per-context clock offsets, chrony/NTP step/slew/smear discipline, PTP/IEEE-1588 hardware timestamping, Fuchsia’sZX_RIGHT_READ/ZX_RIGHT_WRITEUTC clock object, and leap-second smearing vs stepping, each with its capOS lesson and real sources. It is the primary external grounding forWallClock,ClockDiscipline, andClockProvenance.
Residual research still owed before Phase 2/3 implementation: the servo / loop-filter behavior, holdover and error-bound estimation, and suspend/resume clock recovery are the highest-risk underspecified areas and should be deepened in that note (or a follow-on) rather than fixed by this proposal’s sketch.
Relevant Proposals
- Certificates and TLS — TLS validation
delegates certificate validity-window checks to a granted
WallClock. - OIDC and OAuth2 — Token expiry checks
(
exp,iat,nbf) use a grantedWallClockwith at leastmeasuredBootMonotonicprovenance. - WASI Host Adapter — Phase W.3+
clock_time_get(CLOCKID_REALTIME)backed by a per-instanceWallClockcap. - Cloud Metadata — Cloud instance launch time
delivered through the metadata capability; the
WallClockseed path integrates with this bootstrap. - System Monitoring — Audit records carry
ClockProvenance-labeled timestamps from the audit service’s ownWallClockread at ingestion. - Storage and Naming — Timezone and locale
data delivered as a read-only
Directorycap, not an ambient environment.
Proposal: Crash Recovery and Supervision
How capOS handles unplanned process failure: propagating the death to capability holders, recording a structured crash event, and restarting the service within a bounded policy — all without resurrecting stale authority.
Problem
Live upgrade covers the planned case: a supervisor
quiesces a running service, transfers state, retargets caps, and exits the old
process in a controlled sequence. Unplanned failure is different. A process
that panics, faults, or is killed by the kernel OOM path leaves no quiesce
call, no state handoff, and no ordered exit. The kernel marks the process
dead and epoch-bumps its caps, but nothing in the current model tells callers
what happened or gives the supervisor a policy-bounded path to respawn it.
The gaps are:
- Stale-cap observability. Callers holding a cap to the dead process
receive
disconnectederrors at the transport level (the epoch-revocation path from Stage 6 is in place), but there is no structured CQE event that carries crash context or lets the caller distinguish a crash from a planned termination. - Crash metadata capture. Panic location, fault address, and last SQE opcode are useful for operators but must not leak raw cap-table contents, local cap IDs, or buffer bytes, which would break the no-ambient-authority invariant.
- Bounded restart policy. Re-spawning a crashing service without a budget produces crash-loop amplification; re-spawning must use the same broker and manifest authority that the original spawn used, not an escalated path.
- Watchdog liveness. A process that hangs without crashing is not detected by crash handling alone.
- Degraded boot. If a critical service fails to start, the system needs a safe fallback rather than a silent hang.
This proposal fills these gaps without touching the live-upgrade protocol and without adding a god-object supervisor.
User Stories
- An operator running
make run-smokesees a structured crash record in the audit log when a demo service panics, not a silent stale-cap error. - A client process calling a crashed server receives a
disconnected-classCapExceptionpromptly; the process does not block indefinitely. - Init restarts a failed service up to the configured failure budget, then stops and declares the service permanently failed rather than looping forever.
- A watchdog-registered service that hangs (no panic, no exit) is detected within its timeout and restarted under the same policy.
- If the network stack fails before a shell connects, the manifest-declared emergency shell starts instead of leaving the system unresponsive.
Design
Stale-Cap Propagation
When the kernel marks a process dead (panic, fault, or explicit terminate
without a prior clean exit), it performs the same epoch-bump it already does
for released caps. The existing disconnected value in ExceptionType
covers the transport error. The new addition is a death CQE: a
CapException { type: disconnected, message: "server-death" } delivered to
any process with an outstanding CALL SQE whose target belongs to the dead
process.
From the caller’s perspective an unplanned crash looks identical to a
force-mode live upgrade that did not reattach: the in-flight CALL returns
disconnected, epoch is bumped, and any subsequent CALL on that cap also
returns disconnected until the supervisor retargets the cap to a fresh
instance. No new CQE opcode is needed; the existing two-level error model
from Error Handling is sufficient.
Invariants:
- A
disconnectedCQE on an outstanding CALL must be delivered before the kernel recycles any frame that belonged to the dead process. Frame reuse ordering is the same constraint that applies to the force-mode live-upgrade path. - A cap whose epoch has been bumped must never route a new CALL to the dead process’s address space, even transiently. The epoch check is a load fence on the per-cap generation counter before any ring dispatch.
- Endpoint client facets held by the dead process are revoked at the same epoch bump. Other processes’ client facets to the same endpoint are not affected — they route to the endpoint owner, not to the crashed client.
Crash Record Capture
When a process dies unplanned, the kernel appends a crash record to the
AuditLog cap held by the supervisor that spawned the process (not to a
global log visible to all processes). The record is structured to support
operator debugging without leaking internal kernel state:
# Proposed addition to schema/capos.capnp (Phase 1)
enum CrashKind {
panic @0; # Rust panic! path
pageFault @1; # unmapped or protection fault
generalProtection @2;
stackOverflow @3;
illegalInstruction @4;
kernelKill @5; # explicit ProcessHandle.terminate
}
struct CrashRecord {
processName @0 :Text;
kind @1 :CrashKind;
# Instruction pointer at death, relative to ELF load base.
# Absolute virtual address is NOT included to avoid leaking
# kernel-side layout or userspace ASLR seeds.
faultOffsetInBinary @2 :UInt64;
# Last SQE opcode dispatched for this process (0 = none in flight).
lastSqeOpcode @3 :UInt8;
# Session context ID of the process (opaque; matches AuditLog sessionId
# for attribution without carrying cap-table or buffer contents).
sessionContextId @4 :Data;
# Monotonic kernel timestamp at death.
timestampNs @5 :UInt64;
}
Fields explicitly not included: raw cap IDs, cap-table slot contents,
userspace buffer bytes, kernel heap pointers, or any data from the process’s
address space beyond the fault offset. The crash record is attributed to the
process’s session context ID so it can be correlated with prior AuditLog
records without exposing the full cap graph.
The crash record is delivered through the same AuditLog.record path the
hardware-audit service already uses: the supervisor holds the AuditLog cap;
the kernel invokes it on the supervisor’s ring (via a kernel-initiated RECV)
rather than on a shared global ring.
Bounded Restart Policy
The supervisor that spawned a failed process owns the restart decision. The
restart budget is declared in the manifest’s initConfig.services entry and
interpreted by init (or a delegated supervisor):
# CUE representation (illustrative)
restart: {
policy: "on-failure" # never | on-failure | always
maxRestarts: 5 # total budget over the window
windowSecs: 60 # sliding window for the budget
backoffBase: "1s" # initial delay before first restart
backoffMax: "30s" # ceiling on exponential backoff
emergencyFallback: "shell" # service name to promote if budget exhausted
}
Backoff is bounded and service-class aware. The exponential
backoffBase→backoffMax schedule suits user-facing services that should
self-heal without spinning (the Kubernetes CrashLoopBackOff lesson). For
always-available system services, the prior-art note’s systemd lesson favors a
short flat delay so a transient fault recovers fast; such services set
backoffBase == backoffMax for flat RestartSec-style behavior. In both cases
maxRestarts/windowSecs is the hard give-up budget (the OTP
max-restart-intensity lesson), so neither model spins forever.
Crash-loop detection. If maxRestarts attempts exhaust within
windowSecs, the supervisor stops restarting and records a budget-exhausted
event. The service is marked permanently failed until an operator issues an
explicit override through the ProcessHandle or re-spawns via a fresh
manifest reload.
Authority preservation. Each restart uses the original ProcessSpawner
call with the same CapGrant list that was used at initial spawn. The
supervisor does not invent new grants or escalate authority. If a grant source
was a SpawnGrantSource::Kernel DDF handle that is now invalidated (for
example, a DMA buffer whose owner quiesce failed), the restart fails closed
with a spawn-grant-invalid error rather than falling back to an ambient
grant.
No resurrection of stale caps. The restarted process receives a fresh cap
table. The supervisor must call CapRetarget (from
Live Upgrade) to re-point existing
client caps to the new process. If CapRetarget is not yet implemented,
clients observing disconnected must reconnect through the supervisor’s
exported endpoint, which the supervisor re-registers after restart.
Watchdog Capability
A service that can hang without crashing (blocked ring, infinite loop, deadlock on a kernel-held lock it does not own) is not detected by exit-path crash handling. The watchdog provides periodic liveness proof:
# Proposed future addition to schema/capos.capnp (Phase 3)
interface Watchdog {
# Service calls this on every iteration of its main loop to
# reset the deadline. If not called within `timeoutNs` of the
# last kick (or of registration), the supervisor is notified.
kick @0 () -> ();
# Unregister. Safe to call during planned shutdown.
cancel @1 () -> ();
}
interface WatchdogSource {
# Register this process with the given timeout.
# Returns a Watchdog the service holds and kicks.
register @0 (processName :Text, timeoutNs :UInt64) -> (watchdogIndex :UInt16);
}
The supervisor grants a Watchdog cap (minted from a WatchdogSource it
holds) to each service it considers watchdog-registered. If the kernel timer
fires without a kick, the supervisor receives a liveness-failure notification
and treats it identically to an unplanned crash: crash record, restart budget
check, backoff.
The watchdog is an opt-in service-level contract, not a mandatory kernel
mechanism. Services that are inherently event-driven (blocked on cap_enter
waiting for an SQE) do not need a watchdog; they will return disconnected to
callers if they stop processing. Watchdog is primarily useful for services
with internal polling loops or external I/O not driven by the capOS ring.
Degraded Boot
The manifest may declare an emergency fallback service that is promoted when a critical service exhausts its restart budget before the system reaches a usable state:
# CUE (illustrative)
degradedBoot: {
trigger: "net-stack" # if this service fails permanently...
fallback: "shell" # ...promote this service to interactive
timeoutSecs: 30 # deadline from kernel handoff to readiness
}
The init process monitors service readiness. If a declared critical service fails to reach readiness within the timeout and has exhausted its restart budget, init spawns the fallback service with a console cap and an audit cap so an operator can inspect what failed. The fallback service is not granted the failed service’s caps; it is a scoped interactive shell, not a repair agent with escalated authority.
Relevant Research and Prior Art
In-Tree Research Notes
- Crash Recovery and Supervision
is the dedicated prior-art survey for this proposal: supervision trees,
restart budgets (OTP intensity/period, systemd
StartLimit, KubernetesCrashLoopBackOff), dead-server notification (FuchsiaZX_CHANNEL_PEER_CLOSED, seL4 silence, GenodeIpc_error), and coredump-redaction concerns, each verified against primary sources. - OS Error Handling grounds
the stale-cap surface. It records what callers observe when a server dies in
comparable systems: Zircon channels close and the peer observes
ZX_ERR_PEER_CLOSED; Genode capabilities to a dead server become invalid and subsequent invocations produceIpc_error; seL4 routes faults to a per-thread fault endpoint; KeyKOS/EROS routes them to the domain keeper. The shared lesson is that a dead server must surface as a typed transport-level signal, not a hung invocation — which is exactly thedisconnecteddeath CQE this proposal specifies. - Cap’n Proto Error Handling
fixes the meaning of
disconnectedin the four-kind capnp model: “connection to a necessary capability was lost,” with the client response being re-establish-and-retry. This proposal reuses that existing classification rather than minting a new exception kind; the only addition is when the kernel emits it (unplanned death) and the pairedCrashRecord. - EROS, CapROS, Coyotos
documents the EROS/KeyKOS keeper mechanism: a capability to a separate
domain that the kernel invokes on fault, which can inspect, terminate, or
restart the faulting domain (process supervision is an explicit listed use).
capOS’s supervisor-owns-
ProcessHandlemodel is the same shape with capnp typed methods instead of a keeper key, and the kernel never initiates the restart itself. - Genode and
seL4 ground the no-resurrected-authority
invariant: Genode’s parent-supervised component tree with revocable
capabilities, and seL4’s hierarchical delegation plus
Revokeover the capability derivation tree, both establish that a restarted child gets fresh authority and that revocation of the dead instance’s caps is the supervisor’s (parent’s) responsibility, not an ambient lookup.
External Precedent and Lessons
- Erlang/OTP supervision trees. The “let it crash” philosophy plus
supervisor restart strategies (
one_for_one,one_for_all,rest_for_one) and max-restart-intensity (MaxRrestarts withinMaxTseconds, after which the supervisor itself terminates) are the direct precedent for this proposal’s per-service failure budget and crash-loop detection. The lesson: bound restarts over a sliding window and escalate (here: stop and mark permanently failed, optionally promote degraded boot) rather than loop forever. - systemd unit restart policy.
Restart=on-failure|always,RestartSecbackoff, andStartLimitIntervalSec/StartLimitBurstare the precedent for thepolicy/backoffBase/maxRestarts/windowSecsfields. The lesson: separate the whether (policy) from the pacing (backoff) from the give-up threshold (burst limit). - Kubernetes liveness/readiness probes and CrashLoopBackoff. Liveness
probes (kubelet restarts a container that fails its probe) are the precedent
for the
Watchdog.kick/timeout design; readiness gating before promotion is the precedent for the degraded-boot readiness deadline;CrashLoopBackOffwith exponential backoff is the precedent for capped exponential restart delay. The lesson: liveness is opt-in and orthogonal to crash detection — a hung-but-not-dead process needs an explicit liveness signal. - Fuchsia component lifecycle. Component-manager-driven start/stop and
rebinding in the routing graph parallel capOS’s supervisor +
CapRetargetreconnection. The dedicated research note above grounds Fuchsia’s death-observation behavior (ZX_CHANNEL_PEER_CLOSED, no implicit reconnect); a deeper write-up of Fuchsia component-manager restart and escrow semantics remains research-needed (per thedocs/backlog/research-design-gaps.mdconvention) before this proposal cites specific escrow behavior as grounding.
Phasing
Phase 1 — Stale-cap DISCONNECTED propagation and crash record (most
model-critical). The death CQE for in-flight CALLs is the highest-priority
item because it closes the model gap: callers can observe server death as a
typed transport error rather than a hung ring. Crash record delivery to the
supervisor’s AuditLog is paired here because it uses the same kernel death
path. Requires: epoch-revocation from Stage 6 (done), AuditLog cap
(done), CrashRecord schema addition.
Phase 2 — Bounded restart policy and crash-loop detection. Init reads
restart budget fields from initConfig.services, applies exponential backoff,
and stops at budget exhaustion. Requires: Phase 1 crash record so init knows
whether a death was planned or unplanned; CapRetarget from live-upgrade Phase
1 to reconnect client caps after restart.
Phase 3 — Watchdog capability. WatchdogSource and Watchdog schema,
kernel timer integration, supervisor-side timeout detection, and liveness-failure
events fed into the same restart budget path as Phase 2.
Phase 4 — Degraded boot. Manifest parser reads degradedBoot fields;
init promotes the fallback service on budget exhaustion during the boot window.
Requires: Phase 2 budget tracking.
Hazards and Invariants
Frame reuse ordering. The kernel must not return frames from a dead
process’s address space to the frame allocator until all outstanding
disconnected CQEs for that process’s caps have been delivered. Violating
this could allow a concurrent FrameAlloc to map recycled memory into a new
process before the old process’s CQEs complete, creating a window where a
stale disconnected CQE arrives after the frame holds new data. The existing
DMA quiesce/scrub ordering in the DMA pool grant path is the model for this
constraint.
No stale authority after restart. A restarted process receives only the
grants declared in the original ProcessSpawner.spawn call. The supervisor
must not silently re-grant caps that were revoked as part of the death epoch
bump. In particular, any DMAPool-derived handle that was in active use at
crash time must be explicitly re-acquired through the grant-source path, not
recycled from the dead process’s cap table.
Restart does not bypass the authority broker. If the original spawn was
gated on an AuthorityBroker-selected session context, the restart uses the
same broker path. The supervisor cannot substitute a broader session context
or an anonymous context to make the restart succeed.
Capability revocation precedes any dump. The death epoch bump that invalidates the crashed process’s caps must complete before any crash record or future coredump is produced. A record produced post-revocation sees only dead cap indices, never live authority; a pre-revocation memory snapshot could otherwise capture live cap indices or ring-buffer contents (the race class behind recent coredump CVEs). Any future coredump extension must run only after revocation and must not be readable by unprivileged dump readers.
Crash record isolation. The crash record must not carry raw cap IDs, cap table slot numbers, or any data read from the process’s address space (stack contents, heap contents, message buffers). The fault offset is relative to the binary load base, not an absolute virtual address, to avoid leaking kernel layout or userspace address randomization.
Watchdog authority is narrow. A Watchdog cap proves liveness for exactly
one registered process. It does not grant the holder any access to the
supervisor, the process’s caps, or any other service. It is a pure liveness
signal, not an authority surface.
Relationship to Adjacent Proposals
- Live Upgrade — covers the planned
case. The
CapRetargetprimitive defined there is consumed by Phase 2 of this proposal to reconnect client caps after an unplanned restart. The force-modedisconnecteddelivery and epoch-revocation paths are shared; this proposal adds the death CQE and crash record on top. - Service Architecture —
defines the supervisor tree and the
RestartPolicytype currently parsed by init. This proposal extends that policy with the budget, backoff, and budget-exhaustion fields, and binds crash handling to the supervisor that owns theProcessHandle, not to a global daemon. - capos-service — defines the
userspace service framework above
capos-rt. The watchdogkickcall and readiness notification in Phase 3 are natural additions to the service lifecycle hooks thatcapos-serviceabstracts. - Error Handling — the
disconnectedclass in the two-level error model is the transport surface for stale-cap delivery. This proposal does not add new error types; it specifies when and howdisconnectedis delivered for an unplanned death. - System Monitoring — crash records, restart events, budget-exhaustion notifications, and watchdog timeouts are all audit-worthy. The monitoring proposal owns the operator visibility surface; this proposal defines the structured events that feed it.
- Resource Accounting and Quotas — the failure budget is a quota: a count consumed by crash events and refilled by the sliding window. The accounting model for this quota follows the same ledger-of-record pattern as memory and scheduling quotas.
Proposal: Debug and Trace Authority
How capOS should expose process-attach, capability-table inspection, ring-trace capture, and sampler/profiler authority to debuggers and maintenance tools without granting kernel privilege, ambient inspection rights, or a covert channel for authority transfer.
Problem
A capability OS whose security claim is “you can only access what you
were explicitly granted” breaks silently if a debugger can attach to
any process without authority. Unix ptrace is the canonical example:
any process with sufficient Unix privilege can stop, inspect, and
modify another process’s address space and register state, bypassing
all higher-level access controls. capOS must not import that model.
At the same time, debugging real failures requires more than serial
output. The existing debug_tap facility (kernel/src/debug_tap.rs)
emits bounded SQE/CQE records to the emergency serial path at
QEMU-only build time, but it has no userspace-facing capability, no
consent protocol, no audit trail, and no scoping to a specific target.
The measure feature adds benchmark-only TSC counters, also
build-gated and operator-facing only. There is currently no
capability-shaped debug/trace/profile surface at all.
This is a capability-model gap. Until it is filled, the only debugging tool is serial output and offline log inspection — useful for early kernel work, but insufficient once real service decomposition and cross-process interactions exist.
User Stories
- An operator maintenance session needs to inspect which capabilities a stuck service holds, without being able to invoke any of them.
- A developer investigating a failing smoke test wants a bounded record of the SQEs and CQEs the target process issued around the failure, decoded against the current schema.
- A profiler tool needs sampled PC/stack snapshots of a running service at a configured frequency without stopping the service or holding a live breakpoint.
- An agent-shell maintenance workflow needs to attach to a service granted to it by the authority broker, with that attach action recorded in the audit log.
- A supervisor needs to assert that a debugged process cannot escalate its authority into other processes by virtue of being debugged.
Design Principles
- Attach is authority. Connecting a debug session to a process
requires an explicit
DebugSessioncapability. No ambient ptrace analog. The kernel does not hand out debug access on the basis of Unix UID or any implicit privilege. - Consent is required. A
DebugSessionfor a live process is obtained either by explicit owner consent (the process or its supervisor grants one), or through a broker-mediated maintenance session policy decision. Neither path is self-minted. - Attach is audited. Every
DebugSessioncreation and every inspection operation through it is an auditable event. The target process and the audit log both observe it. - Snapshots are read-only. Cap-table and VM inspection through a
debug session produce read-only snapshots. No capability in the
snapshot is transferable to or activatable by the inspector. A
debug session must not become a covert authority-transfer channel.
The GDB-RSP prior art is a reminder that a full debugger is
read/write authority over its target; in this design the read-only
snapshot/trace surface (Phases 1-3) and any future read/write control
(breakpoints, register writes, Phase 4) are distinct authorities.
Write authority is a separately leased, stronger cap and never rides
implicitly on the read-only
DebugSession. - Secrets and payload bytes are redacted by default. Cap-table snapshots expose names, interface IDs, and slot indices — not raw capability payloads, bearer tokens, or memory-mapped buffer contents. Payload capture requires a separately leased and stronger cap.
- A debugged process cannot escalate. A process being debugged must not thereby gain the ability to inspect or affect other processes. The debug session is scoped to one target; no cross-process read or call is admitted through it.
- Symbol resolution is bounded. Resolving a PC address to a symbol name requires access to a symbol table file or binary, not filesystem authority. Symbol resolution is a separate, explicitly scoped cap — not bundled into the basic debug session.
- Build gates are graduated. The
debug_tapkernel facility stays behindcfg(feature = "debug_tap")for its current always-emit emergency-serial behavior. The userspace-facingDebugSessionandRingTracecaps are not build-gated but are absent from production bootstrap CapSets; a broker may mint them only under an explicitly authorized maintenance session policy.
Authority and Consent
A DebugSession is created through one of two paths:
Owner consent. The target process’s supervisor or owner holds a
ProcessHandle and can call a createDebugSession method on it to
mint a DebugSession for the target. This is the normal developer
workflow: the supervisor that spawned a service grants a debug session
to a maintenance tool.
Broker-mediated maintenance session. The authority broker holds a
restricted ability to mint DebugSession caps for processes within a
maintenance session scope — for example, for an operator who has
authenticated and whose session policy permits debugging named
services. The broker records the grant as an audit event. Normal
shells and user sessions do not receive this authority.
Neither path is self-minted. A process cannot mint a DebugSession
for itself or for peers from ambient state. The kernel does not expose
a DebugAll cap at bootstrap.
Attaching a DebugSession produces an audit record covering: the
initiator session, the target pid and service name, the authority
source (owner consent or broker grant), and the timestamp. The target
process receives a notification at attach time if it has an active
ring — not as a blocking gate, but as an observable event.
Proposed Interfaces
These are conceptual interfaces. They should not be added to
schema/capos.capnp until a Phase 1 implementation slice needs them.
# Read-only snapshot of one capability slot in the target's cap table.
# Does not transfer or activate any authority.
struct CapSlotSnapshot {
slotIndex @0 :UInt32;
interfaceId @1 :UInt64; # capnp type ID; 0 if untyped or unknown
methodCount @2 :UInt16;
label @3 :Text; # kernel-assigned or schema-derived name
state @4 :Text; # e.g. "live", "released", "pending-return"
}
# Read-only snapshot of the target's capability table.
# None of these slots are transferable to or callable by the inspector.
struct CapTableSnapshot {
targetPid @0 :UInt32;
tick @1 :UInt64;
slots @2 :List(CapSlotSnapshot);
slotTotal @3 :UInt32;
slotUsed @4 :UInt32;
snapshotDrop @5 :UInt32; # slots omitted due to budget/redaction
}
# A scoped debug session attached to one process.
interface DebugSession {
# Read-only snapshot of the target's current capability table.
capTableSnapshot @0 () -> (snapshot :CapTableSnapshot);
# Arm a bounded ring-trace capture on the target.
# Returns a RingTrace cap scoped to this session and target.
armRingTrace @1 (maxRecords :UInt32, maxBytes :UInt32)
-> (trace :RingTrace);
# Read a bounded sampler record set for the target.
# Returns PC/stack samples at the configured frequency without
# stopping the target.
armSampler @2 (intervalNs :UInt32, maxSamples :UInt32)
-> (sampler :Sampler);
# Detach. Further calls on this session are rejected.
detach @3 () -> ();
}
# Bounded ring-trace cap, scoped to one DebugSession target.
interface RingTrace {
# Drain buffered SQE/CQE records for the attached target.
drain @0 (maxRecords :UInt32)
-> (records :List(TraceRecord), complete :Bool, dropped :UInt64);
# Disarm and release the capture buffer.
release @1 () -> ();
}
# Sampler cap for sampled PC/stack snapshots.
interface Sampler {
# Read the next available sample batch.
read @0 (maxSamples :UInt32)
-> (samples :List(SamplerRecord), dropped :UInt64);
# Stop sampling and release the reservation.
stop @1 () -> ();
}
struct SamplerRecord {
tick @0 :UInt64;
pid @1 :UInt32;
pc @2 :UInt64;
# Shallow inline frames; bounded to avoid variable-length allocation
# on the capture hot path.
frames @3 :List(UInt64);
framesDrop @4 :UInt8; # frames omitted due to depth cap
}
TraceRecord is the same shape defined in
docs/proposals/system-monitoring-proposal.md: tick, pid, opcode,
cap_id, method_id, interface_id, result, flags, and an optional
payload blob gated by a separately leased stronger cap.
Symbol and Source Boundary
Resolving a sampled PC address or a ring-trace cap_id to a human-readable symbol requires access to symbol tables and debug info, not filesystem authority. The design uses an explicit, scoped symbol-resolver cap:
- A
SymbolTablecap holds a read-only ELF DWARF/symbol section for one binary, loaded from a trusted source (boot package or signed artifact store). - The inspector passes a
SymbolTablecap and a list of addresses; the resolver returns bounded name strings. - No arbitrary filesystem path traversal is admitted through this path.
SymbolTableis separately minted fromDebugSession; holding a debug session does not imply symbol resolution authority, and holding a symbol table does not imply attach authority.
Symbol resolution is Phase 3+ work. Phase 1 produces raw addresses;
offline host-side tools (e.g., addr2line on the kernel ELF) handle
symbol lookup during the research phase.
Phasing
Phase 1 — DebugSession Attach and Cap-Table Snapshot (model-critical)
- Define
DebugSession,CapSlotSnapshot, andCapTableSnapshotinschema/capos.capnp. - Implement
ProcessHandle.createDebugSessionin the kernel, guarded by the existingProcessHandleauthority boundary. capOS uses process-level debug authority here because most current services are single-threaded; the seL4 per-TCB-cap prior art argues for deriving per-thread sessions fromThreadControl, the intended finer-grained follow-up once multi-threaded targets need it. capTableSnapshotreturns a bounded, redacted read-only snapshot of the target’s current cap table. No cap in the snapshot is transferable or callable.- Audit record emitted to
AuditLogat attach and at each snapshot call. - No payload capture, no ring trace, no sampler in this phase.
- Proof: a smoke test where a supervisor attaches a debug session to a
child, calls
capTableSnapshot, and verifies the snapshot fields against what the child was granted at spawn time. The audit log must contain the attach record.
Phase 2 — Ring Trace via DebugSession
- Add
armRingTraceandRingTraceto the schema and kernel. - Build on the existing
debug_tapring-capture record format (RingCaptureRecordincapos_config::ring), but route capture through theDebugSessionauthority rather than the always-emit emergency-serial path. - The
RingTracecap is scoped to the attached target; it cannot observe other processes. - Payload capture (
includePayloadBytes) requires a separately presented stronger cap (not yet defined in Phase 2). - Disarming the
RingTracereleases the capture buffer and emits an audit record. - Proof: extend the failing-call smoke from the Ring-as-Black-Box
milestone (commit
da5f5e9) to route capture through aDebugSessioninstead of the emergency serial path, and verify the drained records match the expected SQE/CQE sequence.
Phase 3 — Sampler Authority
- Add
armSamplerandSamplerto the schema and kernel. - The sampler fires at a configured interval, captures PC and a bounded inline call frame, and buffers records for drain.
- The target process is not stopped; sampler overhead is bounded by sample interval and buffer depth.
- Relates to the System Performance Benchmarks proposal: a benchmark runner may arm a sampler before a workload and drain it after to produce a flamegraph, subject to the same audit and consent rules.
- Symbol resolution is offline in this phase (host-side
addr2line).
Phase 4 — Breakpoint, Single-Step, and Payload Capture (deferred)
Breakpoint and single-step authority has a much larger kernel surface than read-only snapshot and sampling. Payload capture risks exposing secrets. Both are deferred until the Phase 1–3 model is stable and the audit/consent infrastructure is proven.
When payload capture is added, it must:
- require a separately leased
PayloadCapturecap distinct from the baseRingTracecap; - be a separately audited grant;
- carry a per-call byte budget enforced by the kernel.
Hazard Preflight
paging/MMIO: cap-table snapshots and ring-trace records read kernel state under existing locks. No new user-mapping or MMIO surface is introduced in Phase 1–3.
ABI: DebugSession, CapTableSnapshot, and RingTrace are new
schema interfaces. Generated bindings must be refreshed via
make generated-code-check before merging any Phase 1 branch.
authority transfer via snapshot: the critical invariant is that no
CapSlotSnapshot entry can be used by the inspector to call or transfer
a capability. The kernel must enforce that the snapshot data path does
not return live cap references — only metadata fields (interface ID,
label, state). This must be verified in the Phase 1 implementation review.
audit bypass: an inspector must not be able to suppress or delay audit records for its own actions. Audit writes must occur synchronously within the debug session dispatch path, not deferred.
covert timing channel: a sampler that returns precise timestamps could be used to extract timing side-channel information about a target service. The sampler tick field is clamped to PIT-resolution granularity in Phase 3 to reduce precision; finer clock access for profiling remains deferred.
Security Boundaries
- A
DebugSessionholder can read snapshots of one target. It cannot call, transfer, or activate any capability belonging to the target. - A
RingTraceholder can read ring metadata for one target. Payload bytes require a separate stronger cap. - A
Samplerholder receives PC and bounded stack frames for one target. No memory-mapped content, no register state beyond PC. - None of these caps admit cross-process inspection. A
DebugSessionfor process A cannot observe process B. - A debugged process remains subject to normal scheduler and capability enforcement. Being debugged does not grant the target any additional capability slots or authority.
- Redaction applies at snapshot construction time, not at read time. The kernel constructs the redacted view; the inspector never sees the raw kernel state.
Non-Goals
- No ambient
ptrace-style process attach without authority. - No kernel debugger (GDB stub, JTAG) exposed as a userspace capability surface — those are operator boot-time tools, not capability-model components.
- No replay semantics. Ring trace is inspection, not record/replay. Replay requires payload retention, timer modeling, and capability checkpoints; that is out of scope.
- No cross-process or system-wide trace aggregation in this proposal.
Aggregate trace is a monitoring concern covered by
docs/proposals/system-monitoring-proposal.md. - No memory read/write through a debug session. Address-space inspection is a separate and stronger authority not proposed here.
- No
DebugSessionself-grant. A process cannot debug itself through this interface. - No crash/exception observation here. A read-only
ExceptionObservercap (the Zircontask_create_exception_channelanalog) for receiving crash notifications without debug-write authority is a separate, weaker authority owned by Crash Recovery and Supervision, not bundled intoDebugSession.
Relevant Research and Prior Art
In-Tree Notes
- Debug, Trace, and Profiling Authority
is the dedicated prior-art survey for this proposal: GDB remote serial
protocol, Linux
ptrace/Yama,perf/CAP_PERFMON, Fuchsia handle-scopeddebug_agent/zxdb, seL4 TCB-cap hardware debug, and Genode CPU-session GDB monitor, grounding theDebugSession/Sampler/exception-observer authority split against real sources. docs/research/zircon.mddocuments Fuchsia’s handle model: handles are process-local references with a rights bitmask, there is no ambient authority, and a process can only interact with kernel objects through handles it holds. capOS draws the directly applicable lesson here — aDebugSessionis a held capability, not an ambient privilege, and inspection of a target’s cap table is itself a distinct grantable authority rather than a side effect of holding a generic “debug” right. The note covers handle rights and transfer but not Fuchsia’sdebug_agent/zxdbdebugging service specifically; that service is now surveyed in the dedicated research note above (and summarized below).docs/research/sel4.mdrecords that seL4 has no in-kernel debug traps or thread-introspection mechanism in the verified configuration; debugging is pushed to userspace and the design constraints (typed authority, no ambient inspection) matter more than any debugger feature. capOS follows the same posture: keep the kernel surface to read-only snapshot and bounded capture, and route policy (who may attach, to what) to userspace consent and the broker.docs/research/genode.mddocuments Genode’s session-and-label model, where every cross-component request carries a label and is mediated by a parent component. The applicable lesson is that attach authority should flow through the same parent/supervisor relationship that already governs spawning — a supervisor that holds a child’sProcessHandleis the natural minter of aDebugSessionfor that child, mirroring Genode’s parent-mediated session routing rather than a global debugger service.docs/research/completion-ring-threading.mdgrounds the io_uring-style SQ/CQ ring transport that the Phase 2 ring trace observes. The trace records the same SQE/CQE structures already captured by the kerneldebug_tapfacility (RingCaptureRecordincapos_config::ring); this proposal adds the authority and consent layer that the existing build-gated emergency-serial capture lacks.
External Precedent
- GDB remote serial protocol (
gdbserver). GDB separates the debugger front-end from a target-side stub that exposes register, memory, and breakpoint operations over a serial/TCP channel. The lesson for capOS is that the inspection surface can be a narrow, well-defined protocol object rather than ambient access — but full register/memory read-write is exactly the strong authority capOS defers to Phase 4 and keeps out of the read-onlyDebugSession. - Linux
ptrace(2).ptraceis the canonical ambient-authority footgun: attach authority derives from Unix UID and the Yamaptrace_scopesysctl rather than from a held, transferable capability, and a successful attach grants register and full address-space read/write at once. This conflates “may observe” with “may control” and bypasses higher-level access controls. capOS rejects this directly —DebugSessionattach is owner-consented or broker-granted, audited, and read-only; observation and control are separate authorities. - Linux
perfand eBPF tracing. Sampled profiling and tracing on Linux sit behind privilege boundaries (perf_event_paranoid,CAP_PERFMON/CAP_BPF) precisely because PC/stack sampling and kernel-wide tracing leak timing and topology information across trust boundaries. capOS treats the same risk as a capability and an audit event: theSamplercap is scoped to one consented target, its timestamp resolution is clamped, and arming it is recorded. - Fuchsia
debug_agent/zxdb. Fuchsia’s debugger is a userspace service (debug_agent) that thezxdbfront-end drives; it operates on process and thread handles rather than ambient privilege, consistent with Zircon’s object-capability model. This is the closest external precedent for capOS’s intended shape — debugging as a handle/capability-mediated service, not a kernel-ambient right. A dedicated in-tree note on thedebug_agentdesign is research-needed per thedocs/backlog/research-design-gaps.mdconvention before the Phase 4 breakpoint/single-step surface is designed. - Object-capability systems generally. Capability systems avoid an
ambient
ptraceanalog because there is no global principal that implicitly dominates other processes; the authority to inspect must be granted like any other capability. This is the structural reason capOS can offer debugging without reintroducing ambient authority, and why the consent and audit requirements in this proposal are load-bearing rather than optional hardening.
Relevant Proposals
- System Monitoring (
system-monitoring-proposal.md): owns aggregate ring traces (TraceCapture), log/metric/audit signal taxonomy, and theRingTapmove-semantics note for payload-capturing taps. This proposal owns the per-process debug attach authority and consent model that monitoring’s trace surfaces do not cover.TraceRecordschema is shared; authority and consent model is separate. - Security and Verification (
security-and-verification-proposal.md): the trust-boundary inventory (Track S.7) must be updated to includeDebugSession,RingTrace,Sampler, andCapTableSnapshotas new boundaries before downstream services rely on them. - System Performance Benchmarks
(
system-performance-benchmarks-proposal.md): benchmark runners may arm aSamplerbefore a workload run; this proposal defines the authority and consent model for that use. - Task State and Agent Telemetry
(
task-state-and-agent-telemetry-proposal.md): agent maintenance sessions may useDebugSessionto inspect service state; telemetry records that fact.
Proposal: Durable Hardware Audit Log Persistence
How the HardwareAuditLog capability moves from a bounded volatile in-kernel
ring to durable, tamper-evident audit storage without claiming authority it
does not have.
Problem
HardwareAuditLog is the read-only observer over the four hardware authority
caps (DeviceMmio, Interrupt, DMAPool, DMABuffer). The kernel still emits
one cap-audit: line per lifecycle event and appends a copy into a fixed-size
volatile ring (capacity 64, drop-oldest). The userspace
hardware-audit-service now drains that ring into a Store-backed,
hash-chained segment ring recoverable through Store.list inventory, and
serves scoped HardwareAuditReader snapshots with self-describing persistence,
retention, subscriber-admission, keyed-seal, key-lifecycle,
physical-persistence, and runtime-admission metadata. The regular DDF audit
service smoke uses the RAM-backed StoreCap and keeps the IOMMU abort-held
DMAPool/DMABuffer evidence strict. The physical persistence proof manifest
grants persistent_store to the service and reuses one disk image across two
QEMU boots; pass 2 must recover and verify pass-1 audit segment blobs before
draining current-boot records. The smoke also stores and reads a separate
content-addressed marker as an independent Store-disk sanity check.
The current keyed mode uses a RAM-local RamSymmetricKey minted through the
development-only local DevelopmentSoftwareKeySource and seals each segment
header with HMAC-SHA256. The audit service never exports raw key material.
Snapshot
metadata reports the signing key identifier, generation, single-local-key
rotation status, and RAM-local revocation caveat so a verifier can distinguish
this local proof from external KeyVault custody.
The remaining gaps before a full production durability and audit-verifier claim are:
- External verifier key custody. The shipped keyed seal is local HMAC
evidence from a development-only deterministic key source. It is not yet a
production
KeyVault/KeySource-managed key with durable rotation and revocation enforcement. - Production media and rollback policy. The QEMU
persistent_storereboot proof demonstrates Store-backed survival across boot using theCAPOSST1disk format. Volume rollback resistance and cloud/hardware media assumptions remain the storage track’s responsibility. - Runtime subscribers are refused until a broker path exists. Manifest
scoped reader grants work.
HardwareAuditReaderruntime admission now fails closed with an explicit no-authority-broker status instead of silently implying support.
The local proof was implemented by
docs/tasks/done/2026-06-07/hardware-audit-physical-persistence-signing-local-proof.md.
This proposal selects the target design for those production extensions and records the boundaries of the Store-backed service that has landed.
Scope and Non-Claims
This proposal is deliberately narrow. It is observer-evidence design only.
- Audit persistence records authority events. It does not grant, gate, or imply authority. The authority checks stay in the device-manager and cap-object paths exactly where they are now.
- Durable audit is not IOMMU isolation. It does not bound DMA, validate MMIO ranges, or constrain interrupt routes. It records that those events happened.
- Durable audit is not provider-driver readiness. A persisted audit trail does not make a userspace driver production-ready; it makes the driver’s hardware-cap lifecycle reviewable.
- Tamper-evidence is detection, not prevention. A signed, hash-chained log proves history was not edited if verification passes; it cannot stop a privileged writer from refusing to append. Availability of the audit path is a separate concern.
- The durable path must not depend on volatile QEMU-only state, the
qemucargo feature proof rings, or local run telemetry. Those remain harness scaffolding.
Design Grounding
docs/tasks/done/2026-05-22/ddf-audit-cap-durable-persistence.md— acceptance criteria and hazard preflight this proposal answers.docs/proposals/cryptography-and-key-management-proposal.md—SymmetricKey(mac/verify),PrivateKey(sign),KeySource, andKeyVaultprimitives consumed for tamper-evidence and key lifecycle.docs/proposals/storage-and-naming-proposal.md— capability-nativeStore, append-onlyFile/ledger semantics, content hashing, previous-record hash chaining, and stale-write rules consumed for the durable ring.docs/proposals/system-monitoring-proposal.md— audit as a distinct append-only record type with its own readers and retention, X.740 audit field model, and “observation is authority” principle.docs/dma-isolation-design.mdanddocs/backlog/hardware-boot-storage.md— the device-driver foundation context the hardware authority caps live in.kernel/src/cap/hardware_audit.rs— the current volatile-ring behavior this design preserves and extends.
Design
1. Durable Audit-Record Ring
The durable audit path is a two-tier structure: the existing bounded
in-kernel volatile ring stays as a fast-path staging buffer, and a userspace
audit log service owns durable persistence behind the capability-native
Store interface.
flowchart LR
DM[Device manager and<br/>hardware cap objects] -->|emit_cap_audit| KR[Kernel volatile ring<br/>capacity 64, drop-oldest]
KR -->|drain cursor poll| ALS[Audit log service<br/>userspace]
ALS -->|append-only records| ST[(Store / append-only<br/>ledger segment)]
ALS -->|sealed segment digest| KV[KeyVault / KeySource]
ALS -->|scoped read window| SUB[Admitted subscribers]
Why a userspace service, not kernel-side disk I/O. Durable storage means a
block device, a filesystem-like layout, segment rotation, and signing. None of
that belongs in the kernel: the kernel’s job is dispatch and isolation. The
kernel keeps doing exactly what it does today — bounded, alloc-free,
lock-light ring emission — and a userspace audit log service drains it through
HardwareAuditLog.drain with a per-cap cursor. This also keeps the durable
path off QEMU-only
telemetry: the service persists through the Store interface. The current
bootstrap StoreCap is RAM-backed and therefore demonstrates the contract; a
real BlockDevice or cloud bridge adapter per the storage proposal is required
before this path claims post-reboot retention.
Drain protocol. The audit log service polls HardwareAuditLog.drain
with a monotonic expected_sequence cursor. Each successful drain returns the
window since
the last durably-committed sequence. The service:
- Reads the drained window and the
dropped_recordscounter. - Appends each record to the current segment (see rotation below).
- Advances its cursor to
next_sequenceonly after the segment write is durably committed (Storesync).
If the kernel ring drops records between polls (dropped_records advanced by
more than the records the service consumed), the service writes a gap
marker record into the durable log: { kind: gap, lost_count, observed_at }.
A gap is itself audit evidence — it is recorded, not hidden. The drop-oldest
behavior of the kernel ring is therefore preserved and made visible in the
durable log rather than silently lost.
Retention and rotation. The durable log is a sequence of fixed-size segments (proposed 1 MiB each; an implementation tuning parameter, not an ABI). When a segment fills:
- The service computes the segment digest (see tamper-evidence below).
- It seals the segment (digest + chain link recorded).
- It opens the next segment, whose first record carries the previous
segment’s digest as
prev_segment_digest.
Retention is count-bounded and age-bounded: keep at most N sealed
segments (proposed default 64) or segments newer than T (proposed default 30
days), whichever is smaller. The bound is a manifest-configurable policy on the
audit log service, not a kernel constant.
Overflow policy. Two distinct overflow points, two distinct policies:
- Kernel ring → service drain lag. Drop-oldest, as today, with a recorded gap marker. Rationale: the kernel ring must never block a hardware cap lifecycle path on a slow or absent consumer. Audit emission is best-effort by construction; the gap marker makes the loss auditable.
- Durable segment retention limit. Drop-oldest sealed segment, with a retention-eviction record appended to the active segment naming the evicted segment’s digest and sequence range. Rationale: an operator querying “what did we lose to retention” gets a definite answer, and the hash chain stays intact across the eviction (the eviction record links forward; the evicted segment’s digest is permanently recorded before deletion).
Backpressure is explicitly rejected for both points. Backpressuring a hardware authority cap on audit-storage latency would let a stalled disk wedge device lifecycle — an availability and correctness hazard far worse than a recorded gap. Audit is evidence over authority, never a gate on it.
Crash-recovery semantics. On audit log service restart:
- The service scans sealed segments oldest-to-newest, verifying each
segment digest and the
prev_segment_digestchain link. - It finds the last segment. If the last segment is unsealed, it replays its
records, recomputing the running digest; a torn final record (incomplete
write) is truncated at the last valid record boundary and a
recovery_truncationmarker is appended. - It re-derives the drain cursor from the highest durably-committed
sequenceand resumes polling the kernel ring from there.
Records lost in the window between the last durable commit and the crash are not recoverable — the kernel ring is volatile and a crash loses it. This is an explicit, accepted limitation: see Assumptions. The recovery markers make the boundary of trustworthy history explicit to any consumer.
2. Tamper-Evidence and Segment Seals
Tamper-evidence is a hash chain plus segment signing, consuming the cryptography/key-management proposal’s primitives. No new crypto is invented here.
Per-record chaining. Each durable audit record carries
prev_record_hash — a hash over the previous record’s canonical bytes. This is
exactly the append-only-ledger pattern the storage proposal already
prescribes (“append new records with previous-record hashes rather than
rewriting history”). Editing or reordering any record breaks every subsequent
prev_record_hash, so a verifier walking the chain detects the first
divergence.
Per-segment signing. The shipped service records per-segment digests and a
running chain head so retained-window tampering is detectable. The local keyed
proof seals each segment header with HMAC-SHA256 using a RAM-local symmetric key
cap minted by the development-only local key source. When a segment is sealed,
the audit log service computes the segment digest (a hash over the sealed record
range, anchored on the running chain hash) and produces a keyed seal over
{ segment_index, sequence_range, record_count, segment_digest, prev_segment_digest }. Production deployment should select one of these key
custody modes by manifest policy:
- MAC mode (default). A
SymmetricKeywithKeyPurpose.integrityproduces an HMAC tag over the segment header viaSymmetricKey.mac. Cheaper, no asymmetric key handling, sufficient when the verifier is trusted to hold the same key. Verification isSymmetricKey.verify. - Asymmetric mode. A sign-only
PrivateKeyproduces a signature viaPrivateKey.sign. Used when audit evidence must be verifiable by a consumer that should not be able to forge records (e.g. an external reviewer holding only the public key). Verification uses the correspondingPublicKey.verify.
The audit log service receives a signing-capable key cap (a SymmetricKey
restricted to mac, or a PrivateKey restricted to sign) at manifest grant
time. It never holds raw key material — the key is a capability object per the
key-management design. The current local proof follows the same no-raw-key
custody rule with a RamSymmetricKey minted by the development-only software
key source. That source deterministically remints the same non-extractable local
HMAC key from stable source metadata and an audit label for the reboot proof,
but it is still not production custody: there is no external root, rollback
resistance, rotation, or persistent revocation state.
What signs what. The chain hash protects record order and content within and across segments. The segment signature protects the segment header, binding the digest, sequence range, and previous-segment digest under a key. Together: a verifier with the verification key can confirm that the sealed segments form an unbroken, unedited chain back to the first segment, and that each seal was produced by the holder of the signing key.
Key lifecycle.
- Current local proof.
signing_key_id = "local-audit-hmac-v1"andsigning_key_generation = 1identify the development-key-source RAM-local HMAC key generation.key_rotation_status = "single-local-key-no-rotation"andkey_revocation_status = "ram-local-key-revocation-not-persistent"are explicit caveats, not production lifecycle controls. - Provenance. The signing key is produced by a
KeySourceand stored sealed in aKeyVault(per the key-management proposal). The manifest grants the audit log service a use capability for the key, not the vault. - Rotation. Keys rotate on a policy interval (proposed default 90 days) or
on demand. Rotation is segment-aligned: a segment is always signed by exactly
one key. The first segment after rotation records a
key_rotationmarker carrying the new key’s identifier (KeySource.infoidentifier — a label, not a secret) and the previous key’s identifier. A verifier follows the identifier sequence to know which key verifies which segment range. - Revocation. If a signing key is suspected compromised, it is revoked in
the
KeyVault. Revocation does not invalidate already-sealed segments — those remain verifiable against the (now-revoked) key, and the revocation itself is recorded as akey_revocationmarker. What revocation prevents is future seals with that key. A consumer treats segments signed by a revoked key as “authentic at seal time, key later revoked” — still evidence, with a documented caveat. - What is NOT protected. Tamper-evidence cannot protect records the kernel ring dropped before the service drained them, cannot protect the crash-window records, and cannot prevent an attacker who holds the live signing key from forging new well-formed history going forward. It detects edits to already-sealed history. These limits are stated in Assumptions.
3. Production Subscriber Admission Policy
Today exactly one manifest-granted reader gets a volatile snapshot. The production model keeps “observation is authority” but adds structure.
Reader caps are typed and scoped. The audit log service exposes readers as distinct capability objects, not a single shared snapshot method:
HardwareAuditReader— a read-only cap over a scoped window: a subscriber may be granted the full history, a single hardware-cap-tag slice (e.g.DMAPoolevents only), or a bounded recent window. Narrowing is structural — a narrower reader is a wrapper cap exposing less, per the capOS capability-model principle, not a rights bitmask.- The cap exposes
snapshot(cursor-based, preserving the existing field model) andverify(returns segment-chain verification status so a subscriber can confirm tamper-evidence without holding the signing key, when the deployment uses asymmetric mode and grants the public verification key).
Admission is manifest-declared, with a runtime broker path. Two tiers:
- Manifest-declared subscribers. The boot manifest declares which services receive which scoped reader caps, exactly like every other capability grant. This is the baseline and covers the monitoring/audit service itself.
- Runtime-admitted subscribers. A later phase may route audit-reader
requests through the userspace authority broker
(
docs/proposals/userspace-authority-broker-proposal.md), so an operator session can be granted a scoped, time-bounded reader without a reboot. This is explicitly future work, gated on the broker. The shipped reader endpoint exposes a runtime-admission method that refuses withInvalidArgumentand reportsruntime_admission_policy = "runtime-reader-admission-refused-no-authority-broker", so callers get a fail-closed status instead of an implied grant.
Revocation. Reader caps are ordinary caps and are revoked the ordinary way (cap-table teardown). Revoking a reader does not touch the durable log.
4. Preservation of Existing Volatile-Snapshot Behavior
The kernel-side volatile ring and its snapshot ABI are preserved unchanged as the staging tier:
- The bounded ring (capacity 64),
head/len/next_sequence/dropped_recordsbookkeeping, and drop-oldest admission stay exactly as inkernel/src/cap/hardware_audit.rs. - The snapshot cursor (
start_sequence), truncation labels (no-records-requested,request-limited,snapshot-limit-limited,available-records-exhausted), and thedropped_recordscounter stay available to directHardwareAuditLog.snapshotobservers. - The durable service path uses
HardwareAuditLog.drain(expected_sequence, max_records)as its per-cap cursor protocol. A cursor mismatch still fails closed; a cursor-verified overflow reanchors at the retained window and reports the advanceddropped_recordscounter so the service can record a visible gap. - The QEMU-only proof rings and
prove_qemu_snapshot_truncation_contractremain harness scaffolding and are not on the durable path. - The
HardwareAuditReader.snapshotresult’s self-describing status fields stay, and their values advance as the durable path lands. The Store-backed service reportspersistence_status = "store-backed-segment-ring",signature_status = "hash-chain-plus-local-hmac-segment-seals",keyed_seal_countgreater than or equal to the retained sealed segment count,signing_key_id = "local-audit-hmac-v1",key_rotation_status = "single-local-key-no-rotation",key_revocation_status = "ram-local-key-revocation-not-persistent",physical_persistence_status = "store-cap-backing-manifest-selected",subscriber_admission_status = "manifest-admission-active-runtime-broker-refused", andruntime_admission_policy = "runtime-reader-admission-refused-no-authority-broker". Changing those field values is an ABI-adjacent change and must land with schema, generated bindings, runtime decode, demos, and smoke assertions in one branch, per the task hazard preflight.
No focused hardware-audit smoke is invalidated by this design: the kernel-side behavior they assert is unchanged. New durable-path behavior gets new smokes (see Evidence Expectations in the task file).
5. Assumptions
The durable evidence is trustworthy only under stated assumptions. A consumer must know these before trusting the log.
- Crash window is lossy. Records in the kernel volatile ring that were not yet durably committed by the audit log service are lost on a crash or power loss. The durable log’s recovery markers bound trustworthy history; they do not recover the lost window. Audit is best-effort at the volatile staging tier by design — it must never block hardware cap lifecycle.
- Rollback below the audit log is out of scope. This design assumes the
Store/BlockDevicebeneath the audit log service does not silently roll back committed segments. If the underlying storage can roll back (e.g. a snapshot-restore of the whole volume), the hash chain detects the resulting gap on next verification, but the design does not prevent it. Volume-level rollback protection is the volume-encryption/storage proposals’ concern. - Rotation is segment-aligned and monotonic. A production segment is signed
by exactly one key. Key identifiers in
key_rotationmarkers are assumed monotonic and unique so a verifier can deterministically map segment ranges to keys. - Key lifecycle is delegated. Key generation, sealing, rotation scheduling,
and revocation are the
KeySource/KeyVaultservices’ responsibility. This proposal assumes those primitives behave as the key-management proposal specifies; it does not re-implement them. The landed local HMAC proof uses a development-only deterministic source and states its lack of production rotation/revocation in reader-visible metadata. - Signing key compromise forges the future, not the past. An attacker holding the live signing key can produce well-formed new records. The hash chain plus revocation marker make the compromise boundary detectable once revocation is recorded, but records sealed during the compromise window are only as trustworthy as the key was. Asymmetric mode narrows this: a verifier holding only the public key cannot itself forge, but a compromised private key still can until revoked.
- The audit log service is trusted to append. Tamper-evidence detects edits to sealed history. It does not prevent the audit log service from refusing to append, stalling, or being killed. Availability of the audit path — restart policy, health checks — is the service-architecture and monitoring proposals’ concern, not this one.
Relationship to Other Proposals
- Cryptography and Key Management — this proposal consumes
SymmetricKey.mac/verify,PrivateKey.sign,KeySource, andKeyVault. It adds no cryptographic primitive. - Storage and Naming — the durable ring is an append-only ledger on the
capability-native
Store, using the previous-record-hash chaining the storage proposal already prescribes. - System Monitoring — the audit log service is the hardware-cap-specific
producer feeding the broader audit-record model in the monitoring proposal;
scoped
HardwareAuditReadercaps follow the monitoring proposal’s “observation is authority” and per-record-type retention principles. - Device Driver Foundation — this design records hardware authority cap lifecycle events. It does not change where authority is checked, and does not claim provider-driver readiness or IOMMU isolation.
Open Questions
- Segment size, retention counts, and rotation interval are proposed defaults,
not ABI. The focused smoke currently retains eight sealed segments so
boot-time abort-held DMA records remain inside the proof window; production
defaults still need a tuning pass once a real
BlockDevicebackend exists. - Whether the
verifymethod onHardwareAuditReadershould return a full chain proof or a bounded status summary depends on the first real consumer’s needs and is deferred to implementation. - Cloud-bridge-backed
Storefor the durable log inherits the storage proposal’s stale-write and size-bound rules; whether audit segments should also be content-addressed objects in that backend is left to the storage track.
Proposal: System Performance Benchmarks
How capOS should benchmark system performance against other operating systems without producing misleading numbers, rewarding special-case optimizations, or treating speed as a substitute for correct capability behavior.
Problem
capOS already has smoke tests, QEMU boot proofs, ring-tap debugging, and a
measure feature for focused cycle measurements. Those are necessary, but they
do not answer the product-level question: can capOS remain effective on common
workloads while preserving its capability model?
Generic OS benchmark suites are useful but dangerous in this project. Most assume POSIX process, file, pipe, socket, and shell semantics. capOS should not fake broad ambient Unix authority just to run a familiar benchmark. It also should not compare a capability-native path against Linux, FreeBSD, or a microkernel by publishing a single blended score that hides unsupported semantics, incorrect outputs, or different isolation boundaries.
The benchmark system needs to produce three kinds of evidence:
- Primitive cost: capability calls, IPC, scheduling, park waits, VM changes, process creation, memory copy, and later device I/O.
- Common workload adequacy: database, compression, build, network, storage, shell/session, service graph, and runtime workloads that users recognize.
- Correctness under load: workload outputs, service boundaries, capability denial paths, and data integrity must remain correct while performance is measured.
Current State
Implemented measurement and comparison hooks:
make run-measurebuilds a separate measurement kernel feature and bootssystem-measure.cue.kernel/src/measure.rsrecords benchmark-only dispatch counters and cycle segments for ring processing, SQE validation, cap lookup, Cap’n Proto encode/decode, method body dispatch, CQE posting, and waiter wake checks.- The measurement manifest grants
ring-nopa measurement-onlyNullCapandParkBenchcapability throughProcessSpawner. demos/ring-nopmeasuresCAP_OP_NOP, empty and smallNullCapcalls, and compact-versus-generic park-shaped operations.demos/thread-lifecyclemeasures privateParkSpacefailed wait, empty wake, wait-to-block, wake-to-runnable, and wake-to-resume paths.make run-smp-process-scaleboots a focused SMP proof manifest under QEMU/KVM, runs 1/2/4 independent prime-counting worker-process cases, verifies aggregate prime count and checksum, records raw serial logs and CSV rows undertarget/smp-process-scale/, and enforces the completed milestone’s default five-run1.6xmedian 1-to-2 speedup threshold when KVM evidence is available.tools/linux-smp-process-scale-baseline.shbuilds a tiny Linux initramfs and runs the same forked prime-counting workload under the same QEMU/KVM CPU and memory envelope for reference-OS comparison. Its split defaults must stay in sync with the capOS SMP process-scale workload before a comparison table is published.make run-linux-thread-scale-baselineruns the in-process thread-scale fixed-size checksum workload as a native Linux pthread baseline, recording worker-window and total pthread timings plus compact-versus-padded result slot diagnostics undertarget/linux-thread-scale/.make run-smoke,make run-spawn,make run-net, and focused service smokes provide correctness and user-visible behavior proofs, but they do not yet emit structured performance results.
Planned CPU-scaling profiles should prefer uniform fixed-size chunk work, such as parallel hashing/checksum over disjoint buffers, when the claim is near-linear scheduler/runtime scaling. Prime counting is retained as historical multi-process evidence, but its trial-division cost requires tuned partitioning and is a weaker default for same-process thread scaling.
The next full-SMP CPU profile should not use nested QEMU as its primary performance source. QEMU/KVM remains useful for boot, CI, and virtualization comparisons, but a 16/32-core scheduler result needs direct capOS execution on a dedicated perf runner or bare-metal/cloud-bare-metal machine, with native Linux baselines on the same hardware. The report should include 1, 2, 4, 8, 16, and 32-worker rows where hardware exists, and separate SMT rows from physical-core rows.
That is enough for local dispatch decisions. It is not enough for comparing capOS with Linux, FreeBSD, seL4-based systems, Genode scenarios, or other OS baselines on common workloads.
Design Principles
- Correctness gates first. A benchmark result is publishable only when the workload’s output verifier passes and capOS-specific authority checks still hold.
- No semantic laundering. Unsupported POSIX features are reported as unsupported or not applicable, not silently emulated through broad authority.
- Benchmark artifacts are not normal metrics. Always-on monitoring may expose low-cost counters. Benchmark logs, raw samples, host configuration, and per-run outputs are retained as explicit benchmark artifacts.
- Compare like mechanisms where possible. Compare capOS capability IPC to Linux pipes, Unix domain sockets, io_uring, or futexes only when the semantic differences are declared in the result.
- Use common suites as references, not design masters. lmbench, UnixBench, fio, iperf3, SQLite speedtest, Phoronix/OpenBenchmarking profiles, and SPEC CPU are valuable precedent. capOS should adopt their methodology where it fits and reject assumptions that would distort capOS.
- Publish raw context. Results include kernel commit, manifest, QEMU command, CPU model, host OS, compiler, build flags, feature flags, warmup, run count, and raw logs.
- Separate hosted and native comparisons. Early capOS runs in QEMU. Compare against Linux/FreeBSD guests under the same QEMU/KVM envelope, and separately against native host OS runs when the question is absolute hardware performance.
- Regression gates are narrower than claims. CI gates should catch local regressions in stable paths. Public OS comparisons need controlled machines, repeated runs, and manual review.
- Security posture is part of the result. A fast result that requires a broader cap bundle, disabled validation, payload tracing, or a special kernel build must be labeled as such.
- No single score. capOS should publish a matrix of workload results and ratios, not an aggregate score that implies all workloads matter equally.
Benchmark Tiers
Tier 0: Existing Correctness Smokes
Tier 0 is not a performance suite. It is the mandatory correctness floor:
- default boot/login/shell smoke;
- focused spawn, shell, terminal, credential, login, chat, adventure, revocable-read, memory-object, ringtap, networking, and measurement smokes;
- host tests for config, ring Loom, capos-lib, mkmanifest, generated code, and runtime surface checks.
No performance result should be retained when the relevant Tier 0 proof fails.
Tier 1: capOS-Native Primitive Benchmarks
These benchmarks measure the cost of capOS mechanisms directly:
| Area | Initial measurements | Correctness condition |
|---|---|---|
| Ring transport | CAP_OP_NOP, empty NullCap, small payload NullCap, CQE post | expected CQE result, no overflow, bounded dropped count |
| Cap dispatch | cap lookup, generation rejection, revoked cap rejection, invalid method | correct CAP_ERR_* or CapException |
| IPC | endpoint CALL/RECV/RETURN round trip, direct handoff, transfer copy/move | reply payload and transferred-cap identity match oracle |
| Park/threading | failed wait, timeout, wake-one, wake-many, wake-to-resume | waiter count and join status match oracle |
| Scheduler | context switch latency, timer wake latency, direct IPC handoff latency | no runnable-thread loss or unexpected starvation |
| Process lifecycle | spawn, ELF load, wait, failed spawn rejection | child output and exit code match manifest oracle |
| VM/memory | map/protect/unmap, MemoryObject map, frame allocation/free | data visibility, W^X, quota, and cleanup checks pass |
| Terminal/session | readLine/write latency and throughput under foreground ownership | echo/cancellation/stale-input checks pass |
These are capOS results first. Linux or FreeBSD baselines can use matching
native mechanisms, but the report must describe the mapping. For example, a
capOS endpoint IPC round trip can be compared with Linux pipe, Unix-domain
socket, eventfd, or futex ping-pong results, but none is a perfect semantic
match.
Tier 2: Translated OS Microbenchmarks
lmbench and UnixBench are useful because they isolate OS primitives such as system-call overhead, process creation, context switching, pipes, networking, and filesystem reads. They are also Unix-shaped.
capOS should implement a capos-osbench harness that translates the benchmark
intent into capability-native operations:
fork/exec/waitintent becomesProcessSpawner.spawnplusProcessHandle.wait.pipethroughput/context switching becomes Endpoint or a future byte-stream or socket capability round trip, labeled by transport.getpidsyscall overhead becomes a minimal kernel fact cap orCAP_OP_NOP, labeled as “capOS ring entry” rather than “POSIX syscall”.- file reread and mmap benchmarks remain unsupported until Store/Namespace and file-backed mappings exist.
- networking tests map to
TcpSocket/TcpListeneronce the Telnet and socket capability work lands.
The translated suite must emit not_applicable for missing capability
subsystems instead of adding compatibility shims that change the OS being
measured.
Tier 3: Portable Common Workloads
These benchmarks answer whether capOS is useful on recognizable work:
| Workload | Candidate benchmark | capOS prerequisite | Result verifier |
|---|---|---|---|
| SQLite database | SQLite speedtest1, optionally via a Phoronix profile on reference OSes | C runtime or native port, Store/Namespace or RAM-backed DB | SQLite exit status, optional SQL result checksum |
| OLTP database | TPC-C/TPC-E-inspired profile, not an official TPC result until disclosure and durability rules are met | durable Store/block I/O, SQL/database stack, transaction integrity, terminal/client driver model | committed transaction counts, invariant checks, ACID/error-injection proof |
| Decision-support database | TPC-H/TPC-DS-inspired profile at declared scale factors, not an official TPC result until rules are met | SQL/query engine, bulk data load, durable or explicitly memory-backed storage, query result verifier | query answer hashes, load status, scale factor, refresh/query stream status |
| Key-value serving | YCSB-style read/update/scan/insert mixes | Store/Namespace, KV service, stable client driver | operation counts, latency distribution, value/hash verifier |
| Storage engine | RocksDB/LevelDB db_bench-style fill/read/overwrite/seek profiles | file/store semantics, fsync/sync policy, storage engine port | key/value integrity, database reopen, configured write durability |
| Compression | xz, zstd, or small native compressor corpus | C/Rust userspace runtime and file/store access | compressed output hash and decompression hash |
| Build/developer workload | small Rust/C package build, later IX package build | process spawning, Store/Namespace, toolchain support | output artifact hash and build log status |
| Network throughput | iperf3-equivalent TCP stream and request/response latency | TcpSocket, network harness | byte count, JSON/structured summary, peer checksum |
| Storage I/O | fio-equivalent sequential/random read/write, verify mode | block device, Store/Namespace, direct I/O policy | fio-style verify/checksum result |
| File service | SPECstorage-inspired workload profile | network filesystem or capOS file-service equivalent, durable storage, client load generation | throughput, response time, data integrity |
| Java/server runtime | SPECjbb 2015 or Renaissance-inspired profiles | JVM or Java compatibility profile, timers, threads, networking/storage as needed | benchmark verifier and SLA/throughput summary |
| HTTP service | wrk-style request load against a capOS HTTP service | TCP, HTTP service, stable response corpus | response checksum/status mix, latency distribution, error rate |
| Cloud services | CloudSuite-inspired data caching/serving/search/web profiles | multi-service graph, storage/network/runtime support | workload-specific answer checks and service SLOs |
| Microservices | DeathStarBench/TailBench-inspired tail-latency profiles | service graph, network or local RPC, load generator, tracing/status caps | request correctness, p95/p99 latency, no unauthorized cap exposure |
| ML storage | MLPerf Storage-inspired data feeding profile | high-throughput storage path, dataset loader, accelerator or simulated training reader | records/images delivered, latency/throughput, data checksum |
| ML inference/training | MLPerf-inspired inference/training profile | model runtime, accelerator/GPU capability or CPU baseline, dataset and accuracy harness | accuracy/quality target plus throughput or time-to-train |
| Shell/session | boot-to-shell, Telnet shell, command launch latency | current shell plus terminal/socket path | transcript oracle and authority denial checks |
| Service graph | chat/adventure/resident service load | shared-service demos | scripted transcript and service identity checks |
| Runtime/library | Go/Lua/Wasm micro and app kernels | relevant runtime proposal milestones | language-level test suite or checksum oracle |
Early capOS should start with RAM-backed variants where storage is not ready, but those results must be labeled as memory-backed. A RAM-backed database result does not compare to a Linux disk-backed SQLite result.
Industry benchmark families belong later than SQLite speedtest and simple compression/build profiles. TPC-C/TPC-E and TPC-H/TPC-DS are database-system references with strict workload, disclosure, pricing, and correctness expectations. SPEC, MLPerf, CloudSuite, TailBench, and DeathStarBench bring similar setup and disclosure obligations in their domains. capOS can use inspired profiles to exercise the same workload classes before it can make official or directly comparable claims, but reports must label them as such and state which upstream rules are not yet satisfied.
Tier 4: User-Story Benchmarks
User-story benchmarks measure complete workflows that a person, operator, or service owner would recognize. They are intentionally broader than a single primitive or portable benchmark profile, and they should be described by the user outcome they prove rather than by the current demo implementation.
Initial user stories:
| Story | Example capOS proof | Result verifier |
|---|---|---|
| Start a local session | boot to an interactive shell or terminal prompt | transcript reaches ready prompt with expected cap bundle |
| Authenticate and receive authority | anonymous session upgrades to an operator/session profile | wrong credential denied, right credential grants exact profile |
| Run a delegated task | launch a child process with a narrow cap bundle | child output, exit code, and denied extra authority match oracle |
| Use a remote terminal | host-local TCP terminal reaches the same shell/session model | connect, authenticate, run command, clean disconnect |
| Use a resident service | client talks to a long-running service through scoped authority | request/reply transcript and service-visible identity match oracle |
| Serve a network request | network-facing service handles requests while local work continues | response checksum, latency, and no unauthorized cap exposure |
| Complete a developer workflow | build or transform an artifact from declared inputs | output hash, logs, and resource profile match declared policy |
| Recover from expected failure | service fault, rejected grant, timeout, or restart path | failure is bounded, audited, and visible through status |
User-story results report latency distribution, success rate, resource usage, and authority outcome. They are the closest evidence for “effective on common workloads,” but they are not substitutes for primitive measurements when a regression appears.
Reference Operating Systems
Initial comparisons should use these environments:
| Reference | Why include it | Caveat |
|---|---|---|
| Linux guest under same QEMU/KVM flags | Stable baseline with broad benchmark support | Linux has mature drivers, filesystems, VM, scheduler, and libc |
| FreeBSD guest under same QEMU/KVM flags | Second mature Unix-like baseline, useful for POSIX-independent signal | Not every benchmark profile has equal FreeBSD support |
| Linux native host | Shows absolute host hardware ceiling | Not directly comparable to capOS-in-QEMU latency |
| seL4 or Genode reports/scenarios | Prior art for capability/microkernel IPC and service decomposition | Often not the same hardware, workload, or application stack |
The default published table should show capOS versus Linux guest first. Native host and external microkernel data belong in separate context columns, not the primary ratio.
Correctness Model
Every benchmark definition carries:
- expected input corpus hash;
- command or manifest used to run the workload;
- output verifier;
- allowed nondeterminism, such as timestamps or generated IDs;
- capOS authority profile;
- unsupported-feature policy;
- result parser version.
A result is invalid when:
- the output verifier fails;
- QEMU exits abnormally;
- the kernel panics or reports an unexpected fault;
- the benchmark had to grant broader authority than its declared profile;
- host logs show dropped records that invalidate the measurement;
- the run used a special fast path not available in the declared configuration;
- the reference OS result used a materially different workload size or dataset.
Correctness should be stored alongside the performance value. A fast failed run is not a slow successful run; it is no result.
Measurement Method
Controlled runs should use:
- fixed capOS commit, reference OS image hash, benchmark source hash, compiler version, and toolchain flags;
- fixed QEMU version, machine type, CPU model, memory size, SMP count, KVM/TCG mode, disk image type, and network backend;
- for direct-hardware SMP runs, fixed machine identity, firmware version, APIC mode, CPU topology, SMT state, frequency governor or fixed-frequency policy, isolation policy, memory size, storage/network devices when relevant, and bare-metal versus cloud-bare-metal provider details;
- warmup runs for workloads with caches, JITs, connection setup, or first-use allocation;
- at least 5 measured runs for primitive and user-story benchmarks, more when coefficient of variation is high;
- median, min, max, standard deviation, and p95/p99 for latency where sample count supports it;
- raw logs retained for the benchmark artifact;
- no performance claim from one isolated run unless explicitly labeled as a smoke measurement.
Kernel-internal cycle-counter measurements remain inside cfg(feature = "measure") and are used for relative path decisions. Focused benchmark demos
may use user-mode cycle counters when the result is explicitly labeled and the
workload remains correctness-gated; run-smp-process-scale uses a scaled
worker-side cycle count because the 100 Hz timer tick is too coarse for the
selected speedup gate. Wall-clock user-story and workload comparisons use
host-side timestamps around QEMU transcripts or in-guest monotonic timers when
the timer contract is adequate.
Result Schema
The benchmark harness should emit a structured artifact, not a free-form log:
enum BenchmarkStatus {
passed @0;
failed @1;
unsupported @2;
invalid @3;
}
struct BenchmarkResult {
runId @0 :Text;
benchmarkName @1 :Text;
tier @2 :UInt16;
status @3 :BenchmarkStatus;
correctnessId @4 :Text;
configHash @5 :Data;
artifactHash @6 :Data;
notes @7 :Text;
result :union {
measurement @8 :MeasurementSummary;
failure @9 :RunFailure;
unsupported @10 :RunFailure;
invalid @11 :RunFailure;
}
}
struct MeasurementSummary {
unit @0 :Text;
lowerIsBetter @1 :Bool;
median @2 :Float64;
p95 @3 :Float64;
samples @4 :List(Float64);
}
struct RunFailure {
reason @0 :Text;
detail @1 :Text;
}
This schema is conceptual. It should not be added to schema/capos.capnp until
a concrete benchmark-runner service exists. The important property is that
measurement values exist only in the passed/publishable branch; failed,
unsupported, and invalid runs carry reasons instead of zero-valued scalar
defaults. Before that, host scripts can emit JSON with the same shape.
Integration With System Monitoring
System Monitoring should expose operational state; the benchmark system should store explicit run artifacts. The overlap is narrow:
- benchmark runs may read scoped
MetricsReader,SystemStatus,RingStats,SchedStats, and later device stats before and after a run; - benchmark summaries may be imported into a metrics service as low-cardinality
gauges such as
benchmark.last_median_ms, keyed by benchmark name and profile, after validation; - raw samples, transcripts, QEMU logs, host environment, and correctness
evidence belong in a
BenchmarkStoreor CI artifact store, not in always-on metrics; - starting a privileged benchmark profile is an auditable event because it may require measurement-only caps, debug taps, or broad status readers;
- benchmark readers should receive scoped read-only caps, not global monitoring roots.
The existing system-monitoring-proposal.md boundary remains correct:
cycle-counter instrumentation stays behind measure, while cheap counters can
later graduate into narrow stats caps.
External Grounding
Relevant local design grounding:
docs/build-run-test.mddocs/status.mddocs/proposals/system-monitoring-proposal.mddocs/architecture/capability-ring.mddocs/architecture/park.mddocs/architecture/scheduling.mddocs/research/sel4.mddocs/research/zircon.mddocs/research/genode.mddocs/research/out-of-kernel-scheduling.md
External sources checked:
- USENIX lmbench paper page:
https://www.usenix.org/conference/usenix-1996-annual-technical-conference/lmbench-portable-tools-performance-analysis - fio documentation:
https://fio.readthedocs.io/en/master/fio_doc.html - iperf3 documentation:
https://software.es.net/iperf/ - SPEC CPU 2017 overview and run rules:
https://www.spec.org/osg/cpu2017/andhttps://www.spec.org/cpu2017/Docs/runrules.html - Byte UnixBench repository:
https://github.com/kdlucas/byte-unixbench - SQLite testing documentation and OpenBenchmarking SQLite speedtest profile:
https://www.sqlite.org/testing.htmlandhttps://openbenchmarking.org/test/pts/sqlite-speedtest - TPC benchmark overview, TPC-C, TPC-H, and TPC-DS descriptions:
https://www.tpc.org/information/benchmarks5.asp,https://www.tpc.org/tpcc/default5.asp,https://www.tpc.org/tpch/default5.asp, andhttps://www.tpc.org/tpcds/ - YCSB and storage-engine benchmark references:
https://hse-project.github.io/apps/ycsb/,https://github.com/facebook/rocksdb/wiki/Benchmarking-tools, andhttps://github.com/google/leveldb - SPECjbb 2015, Renaissance, and HTTP service benchmark references:
https://www.spec.org/jbb2015/,https://renaissance.dev/, andhttps://github.com/wg/wrk - Cloud/service benchmark references:
https://github.com/parsa-epfl/cloudsuite,https://github.com/delimitrou/DeathStarBench, andhttps://tailbench.csail.mit.edu/ - Storage and ML benchmark references:
https://www.spec.org/storage2020/,https://mlcommons.org/working-groups/benchmarks/storage/,https://mlcommons.org/benchmarks/training/, andhttps://docs.mlcommons.org/inference/index_gh/ - OpenBenchmarking test-suite/profile descriptions:
https://openbenchmarking.org/suites/andhttps://openbenchmarking.org/tests
The relevant lessons are straightforward:
- lmbench isolates OS primitives from larger application behavior and was explicitly used to compare system implementations.
- fio and iperf3 provide flexible, parameterized I/O and network workload models with machine-readable output and verification options.
- SPEC CPU’s run rules show why disclosure, correct output, and configuration control matter when publishing comparative results.
- UnixBench is useful as a historical system benchmark, but its own workload descriptions reveal Unix assumptions that capOS must translate carefully.
- SQLite speedtest is a recognizable application workload with broad public baseline data, but database benchmarking must distinguish RAM-backed and storage-backed results.
- TPC-C/TPC-E and TPC-H/TPC-DS are the right industry references for later OLTP and decision-support database claims, but capOS should treat early runs as TPC-inspired unless it can satisfy the relevant TPC rules and disclosure requirements.
- YCSB and
db_benchare useful earlier data-system pressure tests because they can exercise key-value, read/write mix, and storage-engine behavior before capOS has a full SQL system. - SPECjbb and Renaissance become relevant only when a Java profile exists; until then they are runtime targets, not near-term OS benchmarks.
- CloudSuite, DeathStarBench, and TailBench are good references for cloud, microservice, and tail-latency user stories, but they require a mature service graph, load generation, and workload-specific correctness checks.
- SPECstorage and MLPerf Storage are later storage references once capOS has durable storage and enough client/load infrastructure to avoid misleading fio-only claims.
- MLPerf inference/training is relevant only after model runtimes and accelerator or CPU-baseline execution are credible, and any result must carry the benchmark’s accuracy or quality target rather than only throughput.
- OpenBenchmarking/Phoronix-style test profiles are useful precedent for packaging benchmark definitions separately from result storage.
Implementation Plan
-
Structured parser for current
run-measure. Add a host parser that converts existingmeasure:and demo output lines into JSON artifacts with config hash, raw log path, and verifier status. -
Primitive benchmark manifest set. Split ring, park, IPC, process, VM, and scheduler benchmarks into focused manifests so each can be repeated independently without running unrelated demos.
-
Reference guest harness. Add Linux guest scripts that run equivalent primitive tests under the same QEMU/KVM settings. Keep these scripts outside the capOS boot image. Partially done for the SMP process-scale proof through
tools/linux-smp-process-scale-baseline.sh; future benchmark profiles need their own reference guest harnesses or explicit unsupported status. -
Translated OS microbench suite. Implement
capos-osbenchfor the subset of lmbench/UnixBench intents that capOS can represent honestly. Emit unsupported results for missing Store, file, mmap, and socket primitives until those subsystems exist. -
Common workload pilots. Start with workloads that can be made deterministic early: compression, SQLite speedtest against RAM-backed storage once Store exists, shell/session latency, and remote-terminal user-story latency after the current milestone.
-
Network and storage workloads. Add iperf3/fio-equivalent profiles only after socket and block/storage capabilities exist. Use verification modes for write workloads.
-
Benchmark store and monitoring bridge. Add a
BenchmarkStoreservice or CI artifact convention. Import only validated summary values into monitoring metrics, and audit privileged benchmark starts. -
Regression gates. Add narrow CI thresholds for stable primitive paths. Use review-only warnings for noisy or hardware-dependent workloads until enough history exists.
-
Cloud-VM rerun profile. After the first real cloud-VM boot path exists, rerun the benchmark profiles that are valid for the booted hardware surface. At minimum, retain separate cloud evidence for boot/session smokes and CPU-only profiles such as
run-smp-process-scaleand laterrun-thread-scale, recording provider, region, instance type, cloud image id, firmware/device model, CPU topology, SMT state, QEMU pinning/isolation policy, nested-KVM availability, and serial-console collection method. Cloud results are separate environments; they do not replace local QEMU/KVM proof gates unless a milestone explicitly changes that gate. A shape liken2-highcpu-8is credible for 1/2/4-vCPU CPU-only profiles if/dev/kvmis available to the benchmark user and the run records the exact CPU platform and topology. -
Full-SMP hardware profile. Add a profile for direct 16/32-core scheduler evidence. It should reuse the parallel-pattern plan rather than inventing one checksum-only workload: static map/reduce, dynamic task pool, barrier phase loop, independent processes, same-process threads, and one service/capability-call workload. The artifact should report work-window and total-time medians, variance, verifier output, speedup, efficiency, scheduler counters, and matching native Linux rows on the same hardware. QEMU rows may accompany the report only as separate virtualization or regression context.
Reporting Format
Published reports should include:
- executive table with benchmark, status, unit, capOS median, Linux guest median, ratio, and notes;
- separate sections for primitive, common workload, and user-story results;
- correctness summary with failed/unsupported/invalid runs;
- configuration appendix with hashes and QEMU commands;
- raw artifact links;
- explicit warning for benchmark-only builds, debug tap runs, or special caps.
Do not publish a capOS “system score.” The useful output is a workload matrix with enough context to explain the result.
Non-Goals
- No POSIX compatibility layer purely to run Unix benchmarks.
- No public comparison that treats unsupported workloads as zero performance.
- No single aggregate score.
- No benchmark-only fast paths in normal dispatch builds.
- No always-on cycle-counter tracing.
- No network result publication before the network path has correctness and authority proofs.
- No storage result publication before write verification and crash/error semantics are defined.
Open Questions
- Which Linux primitive baselines should be first-class: pipe, Unix socket, futex, eventfd, io_uring, or all of them?
- Should the benchmark store be a capOS service, a host CI artifact convention, or both?
- What variance threshold should turn a benchmark from a CI gate into a review-only signal?
- How should reference OS images be pinned and distributed without bloating the repository?
- Which cloud provider and instance shape should be the first benchmark rerun
target after capOS boots outside local QEMU/KVM? A GCE
n2-highcpu-8host is a plausible first nested-KVM target for CPU-only profiles, but the final choice should follow the first cloud boot path that can expose/dev/kvmand usable serial-console artifacts. - What is the earliest honest SQLite storage profile: RAM-only, MemoryObject backed, Store-backed, or block-backed?
- Should benchmark definitions be modeled as manifest fragments, host-side YAML/JSON, or capOS service objects?
Proposal: HPC Parallel Processing Patterns
capOS should grow from focused SMP/threading speedup proofs into a correctness-gated suite of generic parallel processing patterns. The suite should cover the single-node and multi-node algorithm shapes commonly used in HPC without pretending that capOS already supports MPI, POSIX files, shared memory libraries, or cluster networking.
This proposal extends System Performance Benchmarks. It also defines the workload matrix that the future full-SMP scalability milestone tracked in Scheduler Evolution Phase F.5 should use when capOS is ready for 16/32-core evidence on top of the SMP substrate and the in-process threading contract. The old single checksum workload remains useful as one static map/reduce row, but it is too narrow to stand in for broad multicore behavior.
Design Grounding
Local grounding:
- Benchmarks
- System Performance Benchmarks
- SMP Phase C
- SMP
- Ring v2 For Full SMP
- Scheduler Evolution, in particular the full-SMP scalability focus that names Phase F.5 as the 16/32-core milestone this suite feeds
- In-Process Threading
- HPC Parallel Patterns
External grounding summarized in the research note covers Berkeley dwarfs, NAS Parallel Benchmarks, HPL/LINPACK, HPCG, Graph500, MPI collectives, and OpenMP loop/task/reduction constructs.
Current capOS Benchmark Analysis
Current CPU-scaling evidence is useful but narrow:
make run-smp-process-scaleexercises independent worker processes under QEMU/KVM. Its prime-counting workload is static partition plus final verification. Current rows reach 1, 2, and 4 vCPUs, plus one 8-logical-CPU SMT row on a 4-core/8-thread host.make run-thread-scaleuses a fixed-size checksum workload with per-thread rings and guest phase counters. The strongest current row records capOS 1-to-4 work/total speedups3.088x/2.700xunder QEMU/KVM, while the matching Linux pthread baseline on the same host and pin set records3.974x/3.850x.- Native Linux pthread baselines show the checksum shape can scale on the benchmark host, but also expose coordinator and oversubscription sensitivity. Larger workloads help separate Amdahl effects from thread lifecycle overhead.
- Guest measurement now covers scheduler, serial, scheduler-lock, timer, TLB, and user-PC attribution, but the workload still represents only static partitioned CPU work.
So the current suite covers one pattern well: independent fixed chunks with a final checksum/reduction. It does not yet cover dynamic task scheduling, barriers, prefix/scan, all-to-all movement, stencils, sparse/dense kernels, graph frontiers, pipelines, or multi-node communication. It also does not yet produce direct-hardware 16/32-core rows, which is the bar that Scheduler Evolution Phase F.5 sets for full-SMP scalability evidence on top of the SMP bring-up substrate.
Goals
- Classify parallel benchmark coverage by algorithm pattern, not by a single score or one “HPC benchmark” label.
- Keep correctness and authority gates ahead of speed claims.
- Provide single-node pattern kernels before multi-node transport exists.
- Give scheduler, runtime, memory, IPC, storage, and networking work concrete future coverage targets.
- Allow Linux/FreeBSD/MPI/OpenMP comparisons only when the semantic mapping is declared.
Non-Goals
- Do not port MPI or OpenMP as a prerequisite for the first pattern kernels.
- Do not run full HPL, HPCG, NAS, or Graph500 before capOS has the required runtime, memory, file/store, and network substrate.
- Do not add POSIX compatibility or ambient filesystem authority only to run a familiar suite.
- Do not count SMT diagnostics as core-count scaling evidence.
- Do not add benchmark-only fast paths to normal kernel dispatch.
Pattern Coverage Matrix
Each pattern should have a capOS-native kernel, a result verifier, and a declared authority profile. Multi-node variants stay future until network transport and distributed-capability authority are explicit.
| Pattern | Single-node kernel | Multi-node shape | Verification |
|---|---|---|---|
| Static map/reduce | split fixed-size byte/block ranges across threads or processes | scatter chunks, local compute, reduce root | deterministic root hash or numeric reduction |
| Dynamic task pool | variable-cost tasks in a bounded deque or queue | work requests between nodes or delegated task shards | all task ids completed once, result hash, cancellation proof |
| Barrier phase loop | repeated phase computation with a barrier between phases | barrier across ranks or services | phase count, no early phase observation |
| Prefix/scan | per-thread prefix over numeric blocks | distributed scan over rank partitions | prefix checksum and boundary carry checks |
| Stencil/halo | 1D/2D/3D grid update with neighbor halo buffers | halo exchange between rank partitions | final grid checksum and boundary oracle |
| Dense tiled compute | tiled matrix multiply or small LU-like update | 2D block-cyclic tile distribution | matrix checksum/residual bound |
| Sparse iterative compute | CSR-like sparse matrix-vector plus dot products | partitioned sparse rows with global reductions | residual/checksum and iteration count |
| FFT/transpose | staged local FFT-like butterflies plus matrix transpose | all-to-all transpose between ranks | output checksum against reference |
| Sort/partition | integer bucket partition plus local sort | sample/splitter exchange and all-to-all buckets | sortedness, permutation checksum |
| Graph frontier | BFS-like frontier over synthetic graph | distributed frontier exchange | parent tree/level validation |
| Pipeline/stream | bounded producer/stage/consumer service graph | service pipeline across nodes | ordered records, backpressure, no dropped records |
| Collective-only | barrier, broadcast, gather, scatter, reduce, allreduce | same operations over networked ranks | collective-specific oracle and timeout behavior |
Proposed Stages
Stage 0: Keep Current CPU Rows Explainable
Keep make run-thread-scale as the fixed-size checksum workload so historical
rows remain comparable. Add new pattern targets alongside it rather than
changing the meaning of the existing table. The benchmark page should describe
the workload, timed region, verifier, environment, and limitations directly.
Stage 1: Single-Node Pattern Kernels
Add a parallel-patterns demo crate and host harness that can run small
single-node kernels under one process and multiple worker processes. Workers
should be expressible both as same-process threads via the
in-process threading contract and as
independent processes over the SMP substrate, so the rows
distinguish thread-local pick/wake costs from process/IPC boundaries. These
are the first rows needed for the future full-SMP hardware profile that
Scheduler Evolution Phase F.5
treats as its 16/32-core success bar:
static_reduce: successor to the checksum workload, reusable as the sanity baseline.dynamic_pool: uneven task sizes to force runtime scheduling and fairness.barrier_loop: repeated phases to expose barrier and wakeup overhead.scan: prefix computation to exercise ordered fan-in/fan-out.stencil_2d: shared-buffer or private-buffer halo copies inside one node.
Each kernel prints compact structured lines with pattern, workers,
cpus, input_class, verified, work, total, and relevant counters.
Host harness summaries must keep raw logs under target/parallel-patterns/.
The hardware profile should run these kernels at 1, 2, 4, 8, 16, and 32
workers when the machine has enough physical cores, with SMT rows separated.
Stage 2: Memory And IPC Intensive Kernels
After MemoryObject/shared-buffer and IPC paths mature, add:
sparse_spmv: CSR-style row partition with deterministic matrix generator;graph_bfs: synthetic graph frontier with visited-set validation;sort_bucket: bucket partition, prefix counts, local sort, and merge verification;pipeline_stream: bounded service stages with backpressure telemetry.
These kernels should run both thread-local and process/service forms so capOS can distinguish scheduler overhead from IPC, cap-table, shared-buffer, and service-boundary costs. The thread form follows the in-process threading contract; the process and service forms exercise the cross-CPU wake, migration, and stale-context paths that Scheduler Evolution Phase F.5 expects to harden on top of the SMP substrate.
Stage 3: Capability-Native Collectives
Introduce a small collective service or library abstraction before pretending to support MPI. The first operations are:
- barrier;
- broadcast;
- scatter/gather;
- reduce/allreduce;
- scan;
- all-to-all for fixed-size blocks.
Collectives are benchmark subjects and future runtime building blocks, not ambient cluster authority. A caller receives only the communicator/session cap for its benchmark group. Membership, timeout, cancellation, and stale-session behavior are part of the verifier.
Stage 4: Multi-Node Harness
After capOS has a network-capability path suitable for services, add a multi-node harness that can start N capOS guests or capOS plus Linux reference guests under a controlled topology. The first target is not full MPI; it is a capOS-native rank/session model:
- rank membership is represented by explicit capabilities;
- transport authority is scoped to the benchmark group;
- result collection includes per-node raw logs and topology metadata;
- failed, slow, or stale ranks produce controlled errors instead of hanging the harness indefinitely.
Only then should capOS attempt NAS-like, HPL-like, HPCG-like, or Graph500-like profiles with clearly labeled deviations from upstream rules.
Authority And Safety Rules
- A benchmark group cap grants participation, not ambient network or process authority.
- Distributed pattern kernels must separate control-plane capabilities from data-plane buffers or sockets.
- Every kernel must have bounded allocation, queue, and message sizes.
- Timeouts and cancellation are correctness paths, not harness afterthoughts.
- Result verification must fail closed before speed summaries are accepted.
- Measurement features may add counters, but normal dispatch must remain the code path being evaluated unless the result is labeled as a measure build.
Reporting Format
Pattern results should extend the existing benchmark artifact conventions:
- source commit, manifest, input class, worker/rank count, CPU count, run count;
- host, QEMU/KVM, pinning, SMT/core topology, and network topology;
- capOS authority profile and any benchmark-only feature flags;
- per-run raw logs and
results.csv; - median work and total windows, plus variance;
- verifier status and reason for any
not_applicableordiagnosticresult; - comparison-system mapping, such as OpenMP taskloop, pthreads, MPI collectives, or native Linux process/thread equivalents.
Near-Term Recommendation
Do not start with HPL, HPCG, or Graph500 ports. Start with a small
capOS-native parallel-patterns harness after the current thread-scale
milestone closes. The first five kernels should be static_reduce,
dynamic_pool, barrier_loop, scan, and stencil_2d. That set broadens
coverage from “static independent chunks” to synchronization, irregular
scheduling, ordered reductions, and neighbor exchange while staying within
single-node capOS mechanisms.
When networking and storage mature, extend the same pattern definitions to multi-node and data-intensive variants rather than creating a parallel, unrelated benchmark suite. Pattern adoption stays paced by the substrate it exercises: Scheduler Evolution Phase F.5 gates the 16/32-core single-node rows, the SMP proposal gates the per-CPU substrate those rows depend on, and the in-process threading contract gates the same-process worker forms each pattern kernel needs.
Proposal: Scientific Standard Package And Agent Lab Capabilities
capOS should eventually ship a curated scientific standard package: a capability-scoped service graph that gives agents and users high-level access to computer algebra, numerical computing, solvers, formal proof systems, notebooks, reproducible package environments, and experiment records.
This is not a request to turn the kernel into a scientific runtime. The kernel still provides capability tables, address spaces, scheduling, IPC, memory, device, and storage primitives. The scientific package lives in userspace, above package, workspace, job-graph, model, and broker services.
Design Grounding
Local grounding:
- Scientific Agent-Lab Software Stack
- Linux Sandboxes And Virtualization For Workloads
- NO_HZ, SQPOLL, and Realtime Scheduling
- Language Models and Agent Runtime
- capOS-Hosted Agent Swarms
- Userspace Binaries
- Stateful Task and Job Graphs
- Storage and Naming
- HPC Parallel Processing Patterns
- GPU Capability
- System Performance Benchmarks
External grounding is summarized in the research notes and covers PARI/GP, SageMath, GAP, Singular, OSCAR, SymPy, SciPy, R, Octave, JupyterLab, Z3, cvc5, HiGHS, SCIP, OR-Tools, JuMP, CVXPY, Lean/mathlib, Rocq, Isabelle, Agda, Spack, Guix-HPC, Nix, Apptainer, Linux namespaces/cgroups/seccomp/Landlock, User-Mode Linux, gVisor, QEMU/KVM, Firecracker, Kata Containers, and Linux CPU isolation/housekeeping.
Goals
- Give users, agent runners, and batch services high-level scientific capabilities without granting an unrestricted shell.
- Make exact computation, numerical computation, optimization, SMT solving, and proof checking ordinary capOS services with explicit authority.
- Preserve reproducibility: package closure, input data, seed, backend, version, timeout, quota, output, and audit metadata travel with every result.
- Reuse mature upstream tools wherever possible.
- Support both interactive research and unattended agent jobs.
- Keep tool authority separate from model inference. Models propose; trusted capOS runners execute through broker policy.
Non-Goals
- Do not invent a replacement for SageMath, PARI, GAP, Singular, OSCAR, SciPy, Jupyter, Lean, Rocq, Isabelle, or established solvers.
- Do not add POSIX, Docker, Conda, Nix, Guix, or Spack as ambient system authority.
- Do not make notebook execution equivalent to shell access.
- Do not treat SMT or CAS answers as formal proof unless a proof checker validates an artifact.
- Do not make this package part of the active in-process threading milestone.
Package Profiles
The standard package should be split into explicit profiles so capOS can ship or grant only what a session needs.
| Profile | Contents | Primary use |
|---|---|---|
scientific-base | PARI/GP or PARI C service, SymPy, Z3, cvc5, HiGHS, Lean checker, artifact store | Low-risk exact math, solver, and proof assistance |
scientific-research | SageMath, GAP, Singular, OSCAR/Julia, R, Octave, SciPy, JuMP, CVXPY, SCIP, OR-Tools | Full interactive research workflows |
scientific-notebook | Jupyter-compatible notebook/session service and language kernels | Literate experiments with replayable artifacts |
scientific-lab | Experiment registry, workspaces, job graphs, retrieval, review gates, GPU/model integration | Long-running research labs with users, agents, and review workflows |
scientific-commercial | Optional proprietary/commercial connectors such as Wolfram Engine or commercial solvers | Explicitly licensed site-local extensions |
Profiles grant service roots, not every concrete backend cap. A user, agent
runner, or batch service normally receives a ScientificSession facade that
advertises only the tools and methods permitted for the current session.
Capability Surface
Catalog And Environment
ScientificCatalog: lists installed profiles, backend identities, supported interfaces, licenses, package closures, and known reproducibility caveats.PackageCatalog: resolves named package environments to content-addressed closures.PackageClosure: immutable description of packages, build inputs, toolchain versions, hashes, license metadata, vulnerability metadata, and supported CPU/GPU features.Environment: starts a bounded interpreter, solver, proof, notebook, or job process with exactly the selected closure and granted caps.
Workspaces And Artifacts
ResearchWorkspace: branchable namespace for source, notebooks, data, generated files, proofs, and run records.ArtifactStore: immutable objects for solver inputs, proof logs, notebooks, datasets, plots, tables, binaries, and transcripts.ProvenanceLog: append-only record of who or which agent produced an artifact, with model/tool/package/session metadata.ExperimentRegistry: immutable run specifications plus mutable review status, labels, and publication decisions.
CAS And Mathematical Services
ComputerAlgebra: general symbolic manipulation facade for factorization, simplification, integration, exact linear algebra, polynomial operations, and expression normalization.NumberTheory: PARI-backed exact number theory, elliptic curves, modular forms, algebraic number fields, L-functions, and related computations.DiscreteAlgebra: GAP-backed group, representation, finite algebra, and combinatorics workflows.PolynomialAlgebra: Singular-backed ideals, modules, Groebner bases, quotient rings, and algebraic geometry computations.JuliaAlgebraKernel: OSCAR/Nemo/Hecke/AbstractAlgebra workflows for cases where a general Julia session is the correct backend.
Each method returns structured values when practical and always records the backend, package closure, input, output, elapsed time, and resource envelope.
Solvers
SmtSolver: typed SMT-LIB import/export, assertions, check-sat, model, unsat core, timeout, random seed, proof/certificate metadata, backend selection among Z3/cvc5 or future solvers.OptimizationSolver: LP, MIP, QP, conic, CP-SAT, routing, scheduling, and nonlinear solve jobs with declared model format, backend, objective, constraints, time limit, memory limit, gap/tolerance policy, and solution status.ModelingSession: JuMP/CVXPY/OR-Tools-style language session for models that need high-level construction rather than direct serialized input.
Solver calls must distinguish optimal, feasible, infeasible,
unbounded, unknown, timeout, resource_exceeded, and backend_error.
User-facing tools should not collapse these into a single textual answer.
Formal Proof
ProofCatalog: installed proof assistants, libraries, theorem indexes, and package closures.ProofSession: checkout, edit, build, query goals, run tactics, run tests, and produce checked proof artifacts.ProofChecker: batch verification of a named theorem or project under a pinned closure.LemmaSearch: retrieval over local proof libraries, declarations, docs, and prior accepted project artifacts.
The first implementation target should be Lean plus mathlib because it is the most useful default for current agent-assisted mathematics. Rocq, Isabelle, and Agda should remain first-class future backends with separate kernels and project layouts.
Notebook And Interactive Kernel Sessions
NotebookDocument: immutable or branchable notebook object with cells, outputs, attachments, environment id, and execution provenance.NotebookSession: starts kernels, executes cells, captures outputs, renders rich media, and gates side effects.KernelSession: Python, Sage, Julia, R, Octave, Lean, GAP, PARI, or other REPL-like process with explicit workspace and package environment caps.
Notebook execution is authority-bearing. Opening a notebook for reading should not execute it. Running a notebook should prompt or use session policy for network access, package installation, writes outside the workspace, long jobs, credential access, GPU use, and publication.
Agent Lab Architecture
An LLM agent research lab on capOS should be a service graph:
flowchart LR
User[User] --> Runner[Agent Runner]
Runner --> Broker[AuthorityBroker]
Runner --> Model[LanguageModel]
Runner --> Sci[ScientificSession]
Sci --> CAS[CAS Services]
Sci --> Solvers[Solver Services]
Sci --> Proof[Proof Services]
Sci --> Notebook[NotebookSession]
Sci --> Jobs[JobGraph]
Sci --> Workspace[ResearchWorkspace]
Workspace --> Artifacts[ArtifactStore]
Jobs --> Compute[CPU/GPU/Storage/Network Caps]
Runner --> Audit[ProvenanceLog]
The runner owns the user session and applies tool policy. The model service
does not hold scientific tool caps directly. Tool calls from the model become
typed proposals to the runner, and the runner invokes ScientificSession
methods only when broker policy allows it.
Authority And Safety Rules
- A tool cap grants only the named interface.
NumberTheorydoes not imply file, shell, network, package install, or proof-publication authority. - Package installation and environment resolution are separate authorities from executing an already-pinned environment.
- External network fetch is separate from local computation. Literature search, package download, model-provider calls, and dataset upload are different caps.
- Every long-running calculation must have a job id, quota, cancellation path, and durable status.
- GPU use requires a GPU/session cap and should record driver/runtime/kernel metadata.
- Proof acceptance must be checker-backed. Agent confidence, CAS evidence, or SMT success is advisory unless the proof kernel accepts the artifact.
- Published results must cite the artifact ids and package closure ids that produced them.
- Commercial or proprietary engines must be opt-in, labeled, and grantable only through site policy.
- Linux workload placement must distinguish ordinary resource-limited work
from capOS-native auto-nohz-eligible work. Linux
nohz_fullinside a guest may be useful compatibility or benchmark state, but capOS CPU isolation, auto full-nohz activation, housekeeping placement, IRQ routing, and exclusive CPU use are outer scheduler-authority decisions, not options an agent tool descriptor can set by itself.
Linux Workload And Virtualization Strategy
The first implementation is likely to consume the generic Linux workload sandbox substrate for large scientific stacks. Scientific jobs should be selected by trust and compatibility class:
| Backend | Use | Boundary claim |
|---|---|---|
| namespace/cgroup/seccomp/Landlock sandbox | trusted batch tools and fast command wrappers | shares host Linux kernel; useful policy layer, not strong multi-tenant isolation |
| bubblewrap/nsjail | early command-wrapper executor for gp, solvers, proof checkers, and scripts | structured process sandbox over Linux primitives |
| User-Mode Linux | developer/debug fallback when KVM is unavailable | Linux-as-host-process compatibility; not the main strong-isolation path |
| gVisor | container-compatible higher-risk workloads | per-sandbox application kernel reduces direct host-kernel exposure |
| QEMU/KVM Linux guest | broad compatibility, full distro roots, package builds, untrusted notebooks | hardware-backed guest kernel boundary |
| Firecracker or Kata-style microVM | repeated stateless solver/proof/notebook jobs with narrow device models | hardware-backed microVM boundary with smaller operational surface |
| dedicated host or single-tenant node | high-risk tenants, sensitive data, GPU/device passthrough, side-channel-sensitive jobs, long-lived browser/GUI workloads | reduces shared-host and VM-escape blast radius beyond ordinary VM tenancy |
The generic LinuxWorkloadSandbox service should record backend,
image/rootfs/package hashes,
sandbox policy or VM device model, kernel version, CPU affinity, cgroup quota,
deployment location, external-host placement metadata, capOS
NoHzEligibility/NoHzActivation state for capOS-scheduled proxies or VMMs,
guest tickless/nohz state, network policy, artifact inputs, artifact outputs,
and exit reason. A result from a namespace sandbox and a result from a KVM
guest may be functionally equivalent, but their security, scheduler, and
reproducibility claims are different.
For the native capOS auto full-nohz scheduler track, scientific jobs should use the generic workload placement classes:
- ordinary placement: cgroup v2 resource limits and optional affinity for normal solver, proof, CAS, package, and notebook jobs.
- auto-nohz-eligible placement: explicit capOS eligibility plus CPU-time authority for low-jitter benchmark, realtime, GPU-feed, SQPOLL-like, or latency-bound workload loops. The outer capOS scheduler must know the workload’s vCPU/helper/poller threads and must also account for housekeeping CPUs, IRQ placement, timers, and deferred kernel work. Guest Linux tickless state and external Linux-host isolation state are recorded separately and do not by themselves activate capOS nohz.
Existing Solutions To Adapt
| Area | Adapt first | Reasonable capOS adaptation |
|---|---|---|
| Number theory | PARI/GP | Wrap gp early; use PARI C library for stable service calls later. |
| Broad math | SageMath | Host as a Python/Sage kernel with pinned closure and notebook integration. |
| Discrete algebra | GAP | Wrap CLI and package loading; later expose common group-theory methods. |
| Polynomial algebra | Singular | Wrap command/batch mode; later expose polynomial/ideal operations. |
| Algebra research | OSCAR | Host Julia/OSCAR kernel; avoid flattening its object model prematurely. |
| Symbolic Python | SymPy | Embed in Python service for lightweight symbolic calls and code generation. |
| Scientific Python | NumPy/SciPy | Provide Python kernel and batch-job environments with BLAS/LAPACK metadata. |
| Statistics | R | Provide Rscript and R kernel sessions with package closure capture. |
| MATLAB-like workflows | GNU Octave | Provide batch and interactive kernel sessions. |
| SMT | Z3, cvc5 | Provide SmtSolver with backend identity, model, core, and timeout fields. |
| Optimization engines | HiGHS, SCIP, OR-Tools | Provide direct solve jobs and higher-level modeling sessions. |
| Modeling layers | JuMP, CVXPY | Host Julia/Python modeling kernels and export normalized model artifacts. |
| Formal proof | Lean/mathlib first; Rocq, Isabelle, Agda later | Provide proof sessions, build logs, theorem search, and checked artifacts. |
| Notebooks | JupyterLab model | Reuse .ipynb concepts and kernels but replace ambient authority with caps. |
| Package closure | Nix, Guix, Spack | Ingest closures and recipes; expose capOS package catalogs and Store objects. |
| HPC containers | Apptainer | Use as a Linux-sidecar compatibility bridge, not as native authority. |
Staged Implementation
Stage 0: Interface-Only Design
Define schemas for ScientificSession, ArtifactStore, PackageClosure,
SmtSolver, OptimizationSolver, ProofSession, and NotebookSession.
No backend porting is required. The goal is to make the authority and result
model reviewable.
Stage 1: Linux Sidecar Prototype
Run tools on a controlled Linux host or hardware-backed Linux guest and expose them to capOS through a capability proxy. Namespace/cgroup/seccomp/Landlock wrappers are acceptable for trusted batch tools, but untrusted notebooks, model-generated code, package builds, and multi-tenant jobs should use a QEMU/KVM guest first and Firecracker/Kata-style microVMs later. High-risk tenants, sensitive data, GPU/device passthrough, and side-channel-sensitive jobs may require single-tenant hosts instead of shared VM hosts. User-Mode Linux may remain a developer/debug fallback when KVM is unavailable, but it is not the default strong-isolation backend. This proves the API, audit, and reproducibility model before native userspace can run Python, Julia, R, and large C++ stacks.
Initial tools:
- PARI/GP;
- SymPy;
- Z3 and cvc5;
- HiGHS;
- Lean plus mathlib project build;
- immutable artifact store and provenance records.
Stage 2: Native Wrapper Services
When capOS userspace has the necessary binary/runtime support, add command
wrapper services for gp, lean/lake, z3, cvc5, highs, Rscript,
octave, gap, and Singular. Each wrapper runs with an explicit workspace,
environment, timeout, and resource ledger.
Stage 3: Notebook And Language Kernels
Add Jupyter-compatible document storage and kernel-launch policy. Python,
Sage, Julia, R, Octave, Lean, and GAP kernels can then run as KernelSession
services with capOS-owned artifact capture.
Stage 4: Package-Closure Store
Import or build Nix/Guix/Spack-style closures into capOS Store and
Namespace capabilities. Package resolution stays outside the kernel. The
important kernel-visible property is that executable environments are immutable
objects with explicit resource and authority grants.
Stage 5: Lab Workflow
Combine scientific sessions with hosted-agent workspaces, experiment registry, review gates, browser/literature tools, GPU/model services, and stateful job graphs. This is the point where capOS becomes a credible LLM agent research lab rather than a collection of math commands.
Open Questions
- Should the first sidecar protocol be Cap’n Proto RPC directly, MCP through a gateway, or both?
- Which package-closure source should capOS ingest first: Nix for breadth, Guix for scientific reproducibility, or Spack for HPC variants?
- Which hardware-backed Linux guest backend should be first after QEMU/KVM: Firecracker for narrow batch workers, Kata-style VM containers for OCI integration, or both?
- Which workload classes are eligible for capOS native auto full-nohz
placement, and how should that map to future
CpuIsolationLease,NoHzEligibility,NoHzActivation, andSchedulingContextauthority? - How much of
.ipynbshould be preserved versus represented as a capOS-native notebook object with import/export? - Which proof artifacts can be reduced to small trusted checker inputs, and which require full project build logs for confidence?
- How should floating-point nondeterminism and randomized solver behavior be summarized so agents do not overclaim exactness?
- Where should license policy live: package catalog, broker policy, or both?
Near-Term Recommendation
Do not start by porting SageMath or Jupyter. Start with a small
scientific-base sidecar proof:
NumberTheory.evalbacked by PARI/GP;SmtSolver.checkbacked by Z3 and cvc5;OptimizationSolver.solvebacked by HiGHS for LP/QP/MIP smoke cases;ProofChecker.buildbacked by Lean/mathlib for a pinned project;- immutable artifact/provenance records for every call.
That profile gives users, agent runners, and batch services exact arithmetic, constraint checking, optimization, and formal proof validation while keeping authority narrow enough for review. SageMath, OSCAR, Jupyter, R, Octave, and full package closure support should follow after the base interfaces and audit model are credible.
Proposal: User Identity, Sessions, and Policy
How capOS should represent human users, service identities, guests, anonymous callers, and policy systems without reintroducing Unix-style ambient authority.
Status: partially implemented. The current tree has entropy-backed
UserSession metadata for anonymous, operator, and guest profiles; a bootstrap
CredentialStore; shell-driven login, setup, and guest profile changes;
AuthorityBroker.shellBundle returning broker-issued launcher, copied session,
SystemInfo, and operator-scoped service endpoint caps; and manifest seed
records for local operator/guest proofs. Guest shell bundles are manifest-gated
and receive no default service endpoints. Endpoint calls now keep subject
details private by default and disclose only requested-and-allowed fields from
cap-held service/broker disclosure scope. The broader proposal remains target
design for durable account storage, external identity bindings, session
logout/revocation/renewal lifecycle, quota-backed profiles, ABAC/MAC policy
engines, and POSIX compatibility metadata.
Problem
capOS has processes, address spaces, capability tables, object identities,
badges, quotas, and transfer rules. It deliberately does not have global
paths, ambient file descriptors, a privileged root bit, or Unix uid/gid
authorization in the kernel.
Interactive operation still needs a way to answer practical questions:
- Who is using this shell session?
- Which caps should a normal daily session receive?
- How does a service distinguish Alice, Bob, a service account, a guest, and an anonymous network caller?
- How do RBAC, ABAC, and mandatory policy fit a capability system?
- How does POSIX compatibility expose users without letting
uidbecome authority?
The answer should keep the enforcement model simple: capabilities are the authority. Identity and policy decide which capabilities get minted, granted, attenuated, leased, revoked, and audited.
Design Principles
useris not a kernel primitive.uid,gid, role, and label values do not authorize kernel operations.- A process is authorized only by capabilities in its table.
- Authentication proves or selects a principal; it does not itself grant authority.
- An account is a durable local record for a principal; it is not a running subject.
- A session is a live policy context with selected policy and resource profiles that receives a cap bundle.
- A workload is a process or supervision subtree launched with explicit caps.
- POSIX user concepts are compatibility metadata over scoped caps.
- Guest and anonymous access are explicit policy profiles, not missing policy.
- External roles, groups, claims, and local roles are broker inputs, not authority after the corresponding caps are absent.
Concepts
Principal
A principal is a durable or deliberately ephemeral identity known to auth and policy services. It is useful for policy decisions, ownership metadata, audit records, and user-facing display. It is not a kernel subject.
Examples:
- human account
- operator account
- service account
- cloud instance or deployment identity
- guest profile
- anonymous caller
- pseudonymous key-bound identity
The schema excerpt below is proposal-level shape. Where the interfaces already
exist in schema/capos.capnp, the ordinals shown here must match the checked-in
schema; future methods must be assigned from the next free ordinal when the
schema is actually extended.
enum PrincipalKind {
human @0;
operator @1;
service @2;
guest @3;
anonymous @4;
pseudonymous @5;
}
struct PrincipalInfo {
id @0 :Data; # Stable opaque ID, or random ephemeral ID.
kind @1 :PrincipalKind;
displayName @2 :Text;
}
PrincipalInfo is intentionally descriptive. Possessing a serialized
PrincipalInfo value must not grant authority.
Federated authentication uses a canonical external subject key:
hash(providerKind, issuer, tenant, subject). For OIDC, issuer is iss,
subject is sub, and tenant is the normalized tenant or configured empty
tenant. sub alone is not unique across IdPs and must not be used directly.
Admission policy either maps that external key to an existing local principal
through an ExternalIdentityBinding or admits it as a pseudonymous principal
under an explicit policy/resource profile pair. PrincipalKind covers the resolved
local principal through human / operator / service / pseudonymous
depending on deployment intent; a federated service account is service, a
federated human is human, and a federated ephemeral identity with no stable
person behind it is pseudonymous. The OIDC integration details live in
OIDC and OAuth2.
User
user is a user-facing category for a principal/session that represents a
human or human-adjacent actor. It is not a kernel object, not a UID, and not
an authority source. Use principal, account, session, or workload
when one of those narrower concepts is meant.
Account
An account is a durable local record for a principal. It binds credential references, status, roles, attributes, storage roots, quotas, and default policy/resource profile names. Some principals deliberately have no account: anonymous callers, some guests, and some one-shot external sessions.
Accounts do not run and do not hold capabilities. Session creation reads an account record, manifest seed record, or external admission binding, then asks a trusted broker to mint the actual CapSet for a live session or workload.
Profile
A profile is a named policy template. It contains no authority by itself.
- A policy profile selects roles, ABAC defaults, allowed bundle fragments, approval paths, label defaults, and external admission constraints.
- A resource profile selects storage, memory, CPU share, process/thread/cap limits, IPC limits, log volume, network posture, and launcher posture.
Use plain profile only when prose intentionally covers both policy and
resource profiles.
Session
A session is a live context derived from a principal plus authentication and policy state. Sessions carry freshness, expiry, auth strength, audit identity, and selected policy and resource profiles. The selected profiles influence which caps a broker may mint and which quotas wrappers apply; the profiles are not usable authority.
AuthStrength aligns with ITU-T X.1254 Entity authentication assurance
framework (= ISO/IEC 29115) level-of-assurance tiers. X.1254 defines
LoA 1 (little or no confidence) through LoA 4 (very high confidence) as a
composite of identity-proofing strength, credential strength, and
authentication-protocol strength. capOS uses the same tiers so that
policy decisions can be expressed as “require LoA ≥ 3 for
ServiceSupervisor(net-stack)” without inventing parallel terminology.
# ITU-T X.1254 / ISO/IEC 29115 level-of-assurance tiers.
# `loa0` covers "no assertion" (`anonymous` sessions) and sits below
# the X.1254 lattice; the standard numbers LoA 1-4 only.
enum AuthStrength {
loa0 @0; # no authentication; anonymous
loa1 @1; # little/no confidence; self-asserted identity
loa2 @2; # some confidence; single-factor, e.g. password
loa3 @3; # high confidence; multi-factor, hardware-backed key
loa4 @4; # very high confidence; multi-factor with tamper-resistant
# hardware and in-person or equivalent identity proofing
}
struct SessionInfo {
sessionId @0 :Data;
principal @1 :PrincipalInfo;
authStrength @2 :AuthStrength;
createdAtMs @3 :UInt64;
expiresAtMs @4 :UInt64;
policyProfile @5 :ProfileSummary;
resourceProfile @6 :ProfileSummary;
# Multi-party / delegated / federated session context. Populated when
# the session was minted through an AuthorityBroker approval flow or a
# federated IdP rather than direct interactive login.
delegationChain @7 :List(Data); # opaque session/IdP IDs
}
struct ProfileSummary {
id @0 :Data;
displayName @1 :Text;
versionId @2 :Data;
epoch @3 :UInt64;
}
struct CapabilityResultHandle {
brokerId @0 :Data;
grantId @1 :Data;
interfaceId @2 :UInt64;
issuedAtMs @3 :UInt64;
expiresAtMs @4 :UInt64;
}
interface UserSession {
info @0 () -> (info :SessionInfo);
auditContext @1 () -> (sessionId :Data, principalId :Data);
logout @2 () -> ();
# Future result/grant metadata methods must use fresh ordinals; they are
# intentionally not assigned in this proposal sketch.
}
interface SessionManager {
login @0 (
method :Text,
selector :LoginSelector,
proof :Data,
source :LoginSourceMetadata
) -> (sessionIndex :UInt16);
guest @1 () -> (sessionIndex :UInt16);
anonymous @2 () -> (sessionIndex :UInt16);
sshPublicKey @3 (
username :Text,
algorithm :Text,
publicKey :Data,
authBytes :Data,
signature :Data,
sourceAddr :Data
) -> (sessionIndex :UInt16);
# Future renewal must use the next free ordinal in the checked-in schema,
# currently @4, not @3.
}
When brokers return granted caps, GrantedCap should be the same
transport-level result-cap concept used by ProcessSpawner, not a parallel
authority encoding.
UserSession is the live session/profile summary surface, not the account
database and not the process invocation subject itself. In the session-bound
invocation model, the immutable kernel-installed SessionContext on the
process is the invocation context; kernel/src/session_context.rs owns that
state and the spawn-time inheritance/broker-selection rules described in
Service Architecture. A
UserSession cap may expose stable session metadata, profile summaries, audit
context, expiry, and opaque handles for cap-broker results that have already
been minted. It can also be used as trusted broker/session-manager input to
spawn a child with a matching SessionContext, but copying a UserSession
into an existing process cannot install a second session or relabel future
calls. These handles are
non-bearer metadata for audit and UI display: they cannot be redeemed into
caps unless the caller also holds the separate broker, approval, or launcher
authority required for the grant. UserSession must not expose mutable account
records, credential records, role bindings, storage-root records, policy
document bodies, or redeemable grant tokens. Fresh cap bundles come from
AuthorityBroker or a launcher/supervisor that consumes the session context;
the session cap itself is not a general account-store reader and is not the
ordinary authority-vending path.
Session Lifecycle And Renewal
The expiresAtMs field is not sufficient by itself. The target model treats a
session as a revocable lease with explicit state:
live | logged_out | revoked | expired | recovery_only
The immutable process SessionContext identifies the subject selected at
spawn (see Service Architecture
for the kernel-owned spawn-time installation and the
make run-session-context proof). It should point at, or be paired with,
trusted session-manager liveness state that can change without relabeling the
process:
SessionLivenessCell {
sessionId
sessionEpoch
state
notBeforeMs
notAfterMs
policyEpoch
resourceProfileEpoch
auditRecordId
}
The liveness cell answers whether ordinary invocation may continue. Grant leases answer whether a particular broker-issued bundle or elevated cap remains valid. Object/facet epochs answer whether the target live object generation has been revoked or replaced. These checks compose; none of them is a substitute for capability possession.
For local password-authenticated shells, fixed short wall-clock expiry should not be the only interactive policy. A sane default is that the session remains live until explicit logout, terminal/connection close, owner shell or supervisor subtree exit, administrator revocation, account disablement, policy version invalidation, or a configured idle/hard maximum. Guest, anonymous, remote, federated, and elevated sessions may use much shorter leases.
Renewal must be a narrow session-manager or broker path. The exact Cap’n Proto
signature is future schema work; with the current checked-in SessionManager
ordinal map, the first renewal method would be assigned @4 unless another
schema change lands first:
interface SessionManager {
renew @nextFree (
session :UserSession,
proof :Data,
requestedDurationMs :UInt64
) -> (session :UserSession);
}
renew may extend the same liveness cell or mint a successor session in the
same audit family, depending on policy. It must check account status, auth
freshness, session state, policy/resource profile epochs, requested duration,
absolute maximum lifetime, and explicit revocation state. It must not make all
old grants fresh. When policy needs a new decision, the broker returns fresh
grant leases and wrapper caps; stale ordinary grants remain stale or are
explicitly revoked.
Only named recovery methods should work after expiry: logout, renew, recovery, and narrowly scoped self-diagnostic status. Explicit revocation should block ordinary renewal unless a separately audited recovery policy says otherwise. Owner-shell exit and gateway disconnect should call logout for sessions they own, then process-exit cleanup releases local hold edges.
Workload
A workload is a process or supervision subtree started from a session, service, or supervisor. Workloads may carry session metadata for audit and policy, but they do not run “as” a user in the Unix sense. They run with a CapSet.
Common workload shapes:
- interactive native shell
- agent shell
- POSIX shell compatibility session
- user-facing application
- per-user service instance
- shared service handling many user sessions
- service account process
Capability
A capability remains the actual authority. A process can only use what is in its local capability table. Policy services can choose to mint, attenuate, lease, transfer, or revoke capabilities, but they do not create a second authorization channel.
Account and Admission Sources
capOS should have three account and admission sources. All three feed policy; none of them bypass the capability graph.
- Manifest seed accounts. Immutable or append-only bootstrap records in the boot package. These create first local operators, recovery identities, service identities, emergency guest policy, and initial policy bundles. Seed data must be sufficient to boot, recover, unlock storage, and create or repair the local account store. It must not become the ordinary mutable account database.
- Local account store. Mutable Store/Namespace-backed records for accounts, credentials, roles, attributes, quotas, policy profiles, resource profiles, and storage roots. After initialization, disk state is authoritative for ordinary local accounts, with explicit versioning, rollback detection, and recovery import/export.
- External identity admission and bindings. OIDC, passkey, cloud, deployment, or certificate-backed principals mapped to named policy/resource profiles or existing local accounts. External claims are normalized ABAC inputs and may select a binding; they do not grant local authority by themselves.
Account Store Boundary
Mutable account state belongs in a separate account-store schema and service
slice, not in the session schema. The identity/session schema should contain
PrincipalInfo, SessionInfo, profile summaries, audit context, and opaque
broker result handles. The account-store slice owns durable account records,
credential references, local role bindings, external identity bindings,
profile bodies and versions, storage-root references, recovery/import records,
and mutation/audit metadata.
The account-store service should expose typed reads for trusted policy
services and compare-and-set mutation methods for administrative tooling.
SessionManager reads account-store records only while creating or refreshing
a session, then returns a UserSession summary. AuthorityBroker uses that
summary plus account-store/profile lookups to mint caps. Ordinary workloads
must not learn more than the scoped session/profile metadata and caps they
were explicitly granted.
Initial records should stay cap-shaped:
struct AccountRecord {
recordId @0 :Data;
principalId @1 :Data;
kind @2 :PrincipalKind;
displayName @3 :Text;
status @4 :AccountStatus;
credentialRefs @5 :List(Data);
roles @6 :List(Text);
attributes @7 :List(Attribute);
resourceProfile @8 :ProfileRef;
policyProfile @9 :ProfileRef;
homeRoot @10 :StorageRootRef;
createdAtMs @11 :UInt64;
updatedAtMs @12 :UInt64;
schemaVersion @13 :UInt32;
storeEpoch @14 :UInt64;
recordVersion @15 :UInt64;
policyEpoch @16 :UInt64;
previousHash @17 :Data;
contentHash @18 :Data;
}
struct ProfileRef {
profileId @0 :Data;
versionId @1 :Data;
epoch @2 :UInt64;
}
struct StorageRootRef {
storageServiceId @0 :Data;
rootObjectId @1 :Data;
rootKind @2 :StorageRootKind;
schemaVersion @3 :UInt32;
rootVersion @4 :Data;
}
enum StorageRootKind {
namespace @0;
}
enum AccountStatus {
active @0;
disabled @1;
locked @2;
recoveryOnly @3;
}
struct ResourceProfile {
profileId @0 :Data;
versionId @1 :Data;
epoch @2 :UInt64;
homeQuotaBytes @3 :UInt64;
tempQuotaBytes @4 :UInt64;
processLimit @5 :UInt32;
threadLimit @6 :UInt32;
capLimit @7 :UInt32;
memoryCommitLimitBytes @8 :UInt64;
frameGrantLimitPages @9 :UInt64;
endpointQueueLimit @10 :UInt32;
inFlightCallLimit @11 :UInt32;
retired12 @12 :UInt32; # was pending IPC submission quota; do not reuse
ringScratchLimitBytes @13 :UInt64;
logQuotaBytesPerWindow @14 :UInt64;
networkProfile @15 :Text;
cpuBudgetUsPerWindow @16 :UInt64;
cpuWindowUs @17 :UInt64;
timerWaiterLimit @18 :UInt32;
launcherProfile @19 :Text;
}
struct ExternalIdentityBinding {
bindingId @0 :Data;
provider @1 :Text;
subjectHash @2 :Data; # hash(provider kind, issuer, tenant, subject)
principalId @3 :Data;
tenant @4 :Text;
acceptedClaims @5 :List(Text);
expiresAtMs @6 :UInt64;
policyProfile @7 :ProfileRef;
resourceProfile @8 :ProfileRef;
schemaVersion @9 :UInt32;
storeEpoch @10 :UInt64;
recordVersion @11 :UInt64;
policyEpoch @12 :UInt64;
previousHash @13 :Data;
contentHash @14 :Data;
}
homeRoot is a persistent reference that the account/storage broker resolves
into a live Namespace capability at session-bundle time. It is not a path,
not a raw Directory, and not itself a capability. Compatibility Directory
views are projections returned only when a workload needs file-like APIs.
Manifest seed records and local account records may name roles and profiles,
but the resulting authority is still the CapSet returned by
AuthorityBroker. A disabled or locked account can authenticate only to
explicit recovery flows allowed by its account state and current policy.
Stable ID Formats
Names are display and lookup hints only. They must not be treated as authority or as stable cross-store identity. All durable IDs used for account-store joins should be opaque binary values with a declared version and fixed length:
- Local principals:
principalIdis a 32-byte opaque random value minted by the local account store or imported from a trusted recovery record. User names, display names, POSIX names, and email addresses are attributes, not identifiers. - Account records:
recordIdis a 32-byte opaque record identity. It may equalprincipalIdonly if the store permanently enforces one account record per local principal; otherwise it must be separate. - External bindings:
subjectHashis a 32-byte hash over canonical provider kind, issuer, tenant, and external subject.bindingIdis a 32-byte opaque or content-derived ID over the normalized binding tuple plus the local principal ID. Provider display names and group strings are not authority. - Policy and resource profiles:
profileIdis a 32-byte opaque profile identity.versionIdis a 32-byte content hash of the canonical profile body, schema version, parent version if any, and effective constraints. Profile display names such asoperatororguest-shellare aliases. - Policy versions: policy bundles use a 32-byte
versionIdplus a monotonically increasingpolicyEpoch. Brokers refuse grants when the session/profile summary names a stale epoch. - Storage roots:
storageServiceId,rootObjectId, androotVersionare storage-service-owned opaque binary identifiers. A storage root is never a path or user name; the storage broker resolves it into a liveNamespaceonly after current policy permits the grant.
Version, Rollback, and CAS Rules
Disk-backed account-store records must be rejected unless their integrity and
freshness checks pass. The minimum record header is schemaVersion,
storeEpoch, recordVersion, policyEpoch, previousHash, and
contentHash. schemaVersion selects the decoder and migration policy.
storeEpoch is a monotonic store-wide epoch advanced for every accepted
mutation batch. recordVersion is monotonic per record. policyEpoch binds
the record to the policy/profile generation used to evaluate it.
previousHash chains the prior accepted canonical record bytes when a previous
record exists, and contentHash covers the canonical bytes excluding the
hash field itself.
Mutations use compare-and-set semantics:
update(recordId, expectedStoreEpoch, expectedRecordVersion, expectedHash, patch)
-> accepted(newStoreEpoch, newRecordVersion, newHash)
-> stale(currentStoreEpoch, currentRecordVersion, currentHash)
-> denied(reason)
Administrative tools must submit the last observed epoch, version, and hash. The store accepts an update only when those values match the current durable record and the new record validates against the active schema and policy epoch. Replayed records, older store epochs, lower or equal record versions, hash-chain breaks, unknown schema versions, profile versions not recognized by the active policy bundle, and missing rollback metadata are fail-closed denials. A failed check may leave the account disabled for ordinary login while allowing only explicit recovery identities to inspect or repair it.
The account store should persist a signed or sealed store checkpoint that
records the latest storeEpoch, account-store installation ID, accepted
policy epoch, and root hash. If the checkpoint says a later epoch existed than
the records currently on disk, the store is in recovery mode and must not let
disk account records override manifest seed data or widen authority.
Recovery Import and Seed Repair
Manifest seed data is the recovery source when the local account store is missing, unreadable, or rollback-damaged. Recovery records should include first-operator or break-glass principal IDs, recovery credential references, profile refs, storage-root repair refs, import/export record IDs, allowed repair operations, expiry or quorum requirements, and audit requirements. Recovery identities are not normal operators: their default session bundle is limited to inspecting account-store state, exporting/importing records, disabling stale bindings, and applying exact-target repairs.
Import from seed or offline export is additive and conservative:
- preserve local
principalId,recordId, profile IDs, storage-root refs, and externalbindingIdvalues when their hashes and epochs validate; - import missing seed operators, service identities, recovery identities, and minimum guest/anonymous profiles needed to boot and repair the system;
- disable, not delete, external bindings whose provider, tenant, subject hash, policy epoch, or profile version cannot be validated;
- never auto-map a new external subject to a broader local role or profile than the signed seed/import record names;
- never widen caps, quotas, storage roots, roles, or approval paths as a side effect of recovery import;
- emit audit records for import start, source identity, records accepted, records preserved, records disabled, denials, and the final store epoch.
If audit storage is unavailable, recovery may continue only into a bounded emergency mode whose transcript is written to the best available append-only sink and whose repaired accounts remain disabled for ordinary login until an auditable store checkpoint is committed.
Session Startup Flow
flowchart TD
Input[Login, guest, or anonymous request]
Auth[Authentication or guest policy]
Source[Manifest seed, account store, or external binding]
Session[UserSession cap]
Broker[AuthorityBroker / PolicyEngine]
Bundle[Scoped cap bundle]
Shell[Native, agent, or POSIX shell]
Audit[AuditLog]
Input --> Auth
Auth --> Source
Source --> Session
Session --> Broker
Broker --> Bundle
Bundle --> Shell
Broker --> Audit
Shell --> Audit
The shell proposal’s minimal daily cap set is a session bundle:
terminal TerminalSession
self self/session introspection
status read-only SystemStatus
logs read-only LogReader scoped to this principal/session
home Directory or Namespace scoped to account storage
launcher restricted launcher for approved user applications
approval ApprovalClient
The shell still cannot mint additional authority. It can ask
ApprovalClient for a plan-specific grant, and a trusted broker can return
a narrow leased capability if policy and authentication allow it.
The terminal cap is the session-scoped foreground TerminalSession, not the
boot debug Console; login hands that terminal into the shell bundle only
after authentication or explicit guest/setup policy succeeds. The concrete
default-boot login/setup flow that consumes this bundle is documented in
Boot to Shell, and the shell-side
contract for receiving and inspecting it lives in
Shell.
Detailed decomposition for manifest-seeded accounts, disk-backed account storage, default resource bundles, local roles, RBAC, ABAC, MAC/MIC labels, POSIX profile metadata, and external identity bindings lives in Local Users, Storage, and Policy.
Multi-User Workloads
capOS should support two normal multi-user patterns.
Per-Session Subtree
The session owns a shell or supervisor subtree. Every child process receives an explicit CapSet assembled from the session bundle plus workflow-specific grants.
Example:
- Alice’s shell receives
home = Namespace("/users/alice"). - Bob’s shell receives
home = Namespace("/users/bob"). - The same editor binary launched from each shell receives different
homeandterminalcaps. - The editor cannot cross from Alice’s namespace into Bob’s unless a broker deliberately grants a sharing cap.
This is the right default for interactive applications and POSIX shells.
Shared Service With Per-Client Session Authority
A server process may handle many users in one address space. It should not infer authority from a caller’s self-reported user name, principal ID, role name, or endpoint label. Instead, a trusted issuer binds the subject before the service accepts it:
- authentication or admission creates a live
SessionContext; - a spawned process receives exactly one immutable session context, installed
at spawn time by
kernel/src/session_context.rs(see Service Architecture); AuthorityBrokergrants service roots or narrower facets for that session;- endpoint calls expose privacy-preserving caller-session metadata by default;
- subject details are disclosed only when the method/call explicitly requests disclosure and a broker/service-granted disclosure scope allows the named fields;
- quota donations or accounting caps may accompany service grants when server-side state needs explicit resource backing.
The service uses the caller session reference, disclosed subject facts, and service-local records to select scoped storage, enforce per-client limits, emit audit records, and return narrowed caps. Endpoint badges are not a normal identity mechanism; any remaining badge-shaped kernel field should be treated as internal endpoint transport state during the migration. This is the right shape for HTTP services, databases, log services, terminals, and shared daemons.
Service Accounts
Service identities are principals too. They are usually non-interactive and receive caps from init, a supervisor, or a deployment manifest rather than from a human login flow.
Service-account policy should be explicit:
- which binary or measured package may use the identity,
- which supervisor may spawn it,
- which caps are in its base bundle,
- which caps it may request from a broker,
- which audit stream records its activity.
Service account records may be manifest seeded or stored in the local account store, but their sessions should receive no terminal and no interactive bundle. They launch as workloads with measured binary, supervisor, service name, network/IPC, log, state namespace, and key-use constraints.
Anonymous, Guest, and Pseudonymous Access
These are distinct profiles.
Empty Cap Set
An untrusted ELF with an empty CapSet is not a user session. It is the
roadmap’s “Unprivileged Stranger”: code with no useful authority. It can
terminate itself and interact with the capability transport, but it cannot
reach a resource because it has no caps. The visible proof was achieved by
commit d4016ab at 2026-04-22 16:35 UTC.
Anonymous
Anonymous means unauthenticated and usually remote or programmatic. It should receive a random ephemeral principal ID and a very small cap bundle.
Typical properties:
- no durable home namespace by default,
- strict CPU, memory, outstanding-call, and log quotas,
- short session expiry,
- no elevation path except “authenticate” or “create account”,
- audit records keyed by ephemeral session ID and network/service context.
Guest
Guest means an interactive local profile with weak or no authentication.
Typical properties:
- terminal/UI access,
- temporary namespace,
- optional ephemeral home reset on logout,
- restricted launcher,
- no administrative approval path unless policy grants one explicitly,
- clearer user-facing affordance than anonymous.
Pseudonymous
Pseudonymous means durable identity without necessarily naming a human. A public key, passkey, service token, or cloud identity can select the same principal across sessions. This can receive persistent storage and quotas while still remaining separate from a verified human account.
External pseudonymous sessions require explicit admission configuration. A binding either maps the external subject to an existing local account or allows auto-creation of a tenant-scoped account with named policy and resource profiles. Durable storage is granted only through that local principal mapping and a broker-minted storage cap.
POSIX Compatibility
POSIX user concepts are compatibility metadata, not authority.
uid,gid, user names, groups,$HOME,/etc/passwd,chmod, andchownlive inlibcapos-posix, a filesystem service, or a profile service.open("/home/alice/file")succeeds only if the process has aDirectoryorNamespacecap that resolves that synthetic path.setuidcannot grant new caps. At most it asks a compatibility broker to replace the process’s POSIX profile or launch a new process with a different cap bundle.- POSIX ownership bits may influence one filesystem service’s policy, but they cannot authorize access to caps outside that service.
This lets existing programs inspect plausible user metadata without making Unix permission bits the capOS security model.
Policy Models
RBAC, ABAC, and mandatory access control fit capOS as grant-time and
mint-time policy. They should mostly live in ordinary userspace services:
AuthorityBroker, PolicyEngine, SessionManager, RoleDirectory,
LabelAuthority, AuditLog, and service-specific attenuators.
The kernel should keep enforcing capability ownership, generation, transfer rules, revocation epochs, resource ledgers, and process isolation. It should not evaluate roles, attributes, or label lattices on every capability call.
RBAC
Role-based access control maps principals or sessions to named role sets. Roles select cap bundles and approval eligibility.
Examples:
developercan receive a launcher for development tools and read-only service logs.net-operatorcan request a leasedServiceSupervisor(net-stack).storage-admincan request repair caps for selected storage volumes.
Implementation shape:
interface RoleDirectory {
rolesFor @0 (principal :Data) -> (roles :List(Text));
}
interface AuthorityBroker {
request @0 (
session :UserSession,
plan :ActionPlan,
requestedCaps :List(CapRequest),
durationMs :UInt64
) -> (grant :ApprovalGrant);
# Mint an ApprovalInbox for the bound session. The broker policy
# decides whether the requesting session is allowed to triage
# approvals and which entries are visible (own requests only,
# role-scoped queue, multi-party reviewer queue).
inbox @1 (
session :UserSession
) -> (inbox :ApprovalInbox);
}
The detailed ActionPlan, ActionStep, CapRequest, GrantedCap,
ApprovalInbox, ApprovalEntry, and ApprovalListener schemas live
in Shell under
Approval and Authentication. The broker is the single producer for
both ApprovalGrant (the requester-side handle) and ApprovalInbox
(the decider-side handle); they meet only at the broker, never on a
shared transport channel.
Roles do not bypass capabilities. They only let a broker decide whether it may mint or return particular scoped caps.
The role/attribute/decision split matches the ITU-T X.812 Access control framework (= ISO/IEC 10181-3) decomposition into ADF (access-control decision function) and AEF (access-control enforcement function). In capOS terms:
- The AEF is the
CapObject::calldispatch plus wrapper caps: the enforcement point that cannot be bypassed because it is the only path to the underlying object. - The ADF is the
PolicyEngine/AuthorityBroker: it evaluates a decision request and returns a capability (or refuses) rather than returning a boolean that downstream code might ignore.
The ADF/AEF split is why capOS can make PolicyDecision a
cap-minting input rather than a per-call allow/deny flag — the
enforcement point is already structural (you need a cap to reach the
object) and the decision point returns the cap.
Remote Client Bundles
Remote programmatic and GUI clients consume the same identity and policy model as shells, but they need a different bundle shape. A remote host app may authenticate with password, public key, OIDC, passkey/WebAuthn, mTLS, guest/anonymous admission, or a service/workload credential. After admission, the broker returns a remote-client bundle whose entries are exported as Cap’n Proto RPC object references by a per-session gateway worker.
Those references are live capability proxies, not bearer tokens and not local
cap-table metadata. A remote bundle may include session, systemInfo, and
specific service caps such as chat, paperclips, or command surfaces. It
should not inherit terminal, launcher, broad storage, raw network, key-vault,
credential-store, or process-spawn authority merely because an operator shell
profile would receive some of those caps. The detailed transport and lifetime
rules live in
Remote Session CapSet Clients.
ABAC
Attribute-based access control evaluates a richer decision context:
- subject attributes: principal kind, roles, auth strength, session age, device posture, locality,
- action attributes: requested method, target service, destructive flag, requested duration,
- object attributes: service name, namespace prefix, data class, owner principal, sensitivity,
- environment attributes: time, boot mode, recovery mode, network location, cloud instance metadata, quorum state.
ABAC is useful for contextual narrowing:
- allow log read only for the caller’s session unless break-glass policy is active,
- issue
ServiceSupervisor(net-stack)only with fresh hardware-key auth, - grant
Namespace("/shared/project")read-write only during a maintenance window, - deny network caps to guest sessions.
ABAC decisions should return capabilities, wrappers, or denials. They should not create hidden ambient checks downstream.
OAuth2 scopes and OIDC claims (acr, amr, groups, tenant-specific
fields) are ABAC inputs. The broker consumes them alongside session
freshness, object attributes, and environment state to pick a cap
bundle or decline. They never authorize capability calls directly,
and they do not create a downstream check outside the broker’s
decision path. See
OIDC and OAuth2.
ABAC Policy Engine Choices
Do not invent a policy language first. The capOS-native interface should be small and capability-shaped, while the broker implementation can start with a mainstream engine behind that interface.
Recommended order:
-
Cedar for runtime authorization. Cedar’s request shape is already close to capOS:
principal,action,resource, andcontext. It supports RBAC and ABAC in one policy set, has schema validation, and has a Rust implementation. That makes it the best fit forAuthorityBrokerandMacBrokerservice prototypes. -
OPA/Rego for host-side and deployment policy. OPA is widely used for cloud, Kubernetes, infrastructure-as-code, and admission-control style checks. It is useful for validating manifests, cloud metadata deltas, package/deployment policies, and CI rules. The Wasm compilation path is worth tracking for later capOS-side execution, but OPA should not be the first low-level runtime dependency.
-
Casbin for quick prototypes only. Casbin is useful for simple RBAC/ABAC experiments and has Rust bindings, but its model/matcher style is less attractive as a long-term capOS policy substrate than Cedar’s schema-validated authorization model.
-
XACML for interoperability and compliance, not native policy. XACML remains the classic enterprise ABAC standard. It is useful as a conceptual reference or import/export target, but it is too heavy and XML-centric to be the native capOS policy language.
The capOS service boundary should hide the selected engine:
interface PolicyEngine {
decide @0 (request :PolicyRequest) -> (decision :PolicyDecision);
}
struct PolicyRequest {
principal @0 :PrincipalInfo;
action @1 :Text;
resource @2 :ResourceRef;
context @3 :List(Attribute);
}
struct PolicyDecision {
allowed @0 :Bool;
reason @1 :Text;
leaseMs @2 :UInt64;
constraints @3 :List(Attribute);
}
PolicyDecision is still not authority. It is input to a broker that returns
actual caps, wrapper caps, leased caps, or denial.
References:
- Cedar policy language docs: https://docs.cedarpolicy.com/
- Amazon Verified Permissions concepts: https://docs.aws.amazon.com/verifiedpermissions/latest/userguide/terminology.html
- Open Policy Agent docs: https://www.openpolicyagent.org/docs
- Casbin supported models: https://www.casbin.org/docs/supported-models
- OASIS XACML technical committee: https://www.oasis-open.org/committees/xacml/
- ITU-T Rec. X.812 (11/95) — Information technology - Open Systems Interconnection - Security frameworks for open systems: Access control framework. ADF/AEF terminology.
- ITU-T Rec. X.741 (10/95) — Systems Management: Objects and attributes for access control. Concrete managed-object attributes for ACLs, ACIs, default access, and access-decision inputs.
Mandatory Access Control
Mandatory access control is non-bypassable policy set by the system owner or deployment, not discretionary sharing by ordinary users. In capOS, MAC should be implemented as mandatory constraints on cap minting, attenuation, transfer, and service wrappers.
Examples:
- a
Secretcap labeledhighcannot be transferred to a workload labeledlow, - a
LogReaderfor security logs cannot be granted to a guest session even if an application asks, - a recovery shell can inspect storage read-only but cannot write without a separate exact-target repair cap,
- cloud user-data can add application services but cannot grant
FrameAllocator,DeviceManager, or raw networking authority.
Implementation components:
enum Sensitivity {
public @0;
internal @1;
confidential @2;
secret @3;
}
struct SecurityLabel {
domain @0 :Text;
sensitivity @1 :Sensitivity;
compartments @2 :List(Text);
}
interface LabelAuthority {
labelOfPrincipal @0 (principal :Data) -> (label :SecurityLabel);
labelOfObject @1 (object :Data) -> (label :SecurityLabel);
canTransfer @2 (
from :SecurityLabel,
to :SecurityLabel,
capInterface :UInt64
) -> (allowed :Bool, reason :Text);
}
For ordinary services, MAC can be enforced by brokers and wrapper caps. For high-assurance boundaries, the remaining question is whether transfer labels need kernel-visible hold-edge metadata. That should be added only for a concrete mandatory policy that cannot be enforced by controlling all grant paths through trusted services.
The attribute model borrows from ITU-T X.741, which enumerates the
managed-object attributes a directory-based access-control system
tracks: ACL entries, access-control information (ACI), default access,
initiator ACI, target ACI, and access-decision outcome. X.741 targets
the X.500 directory, so the schema does not port directly, but the
attribute taxonomy is a good completeness check for what
LabelAuthority and PolicyEngine requests should expose to a
decision engine.
GOST-Style MAC and MIC
Russian GOST framing is stricter than the generic “MAC means labels” summary. The relevant standards split at least two policies that capOS should keep separate:
-
Mandatory access control for confidentiality. ГОСТ Р 59383-2021 describes mandatory access control as classification labels on resources and clearances for subjects. ГОСТ Р 59453.1-2021 goes further: a formal model that includes users, subjects, objects, containers, access levels, confidentiality levels, subject-control relations, and information flows. The safety goal is preventing unauthorized flow from an object at a higher or incomparable confidentiality level to a lower one.
-
Mandatory integrity control for integrity. ГОСТ Р 59453.1-2021 treats this separately from confidentiality MAC. The integrity model constrains subject integrity levels, object/container integrity levels, subject-control relationships, and information flows so lower-integrity subjects cannot control or corrupt higher-integrity subjects.
For capOS, this should map to labels on sessions, objects, wrapper caps, and eventually hold edges:
struct ConfidentialityLabel {
level @0 :Text; # e.g. public, internal, secret.
compartments @1 :List(Text);
}
struct IntegrityLabel {
level @0 :Text; # ordered by deployment policy.
domains @1 :List(Text);
}
struct MandatoryLabel {
confidentiality @0 :ConfidentialityLabel;
integrity @1 :IntegrityLabel;
}
Capability methods need a declared flow class. capOS cannot rely on generic
read and write syscalls:
- read-like:
File.read,Secret.read,LogReader.read; - write-like:
File.write,Namespace.bind,ManifestUpdater.apply; - control-like:
ProcessSpawner.spawn,ServiceSupervisor.restart; - transfer-like:
CAP_OP_CALL,CAP_OP_RETURN, and result-cap insertion when they carry caps or data across labeled domains.
Initial rules can be expressed as broker/wrapper checks:
read data-bearing cap:
subject.clearance dominates object.classification
write data-bearing cap:
target.classification dominates source.classification
# no write down
control process or supervisor:
controlling subject is same label, or is an explicitly trusted subject
integrity write/control:
writer.integrity >= target.integrity
This is not enough for a GOST-style formal claim, because uncontrolled cap transfer can bypass the broker. A higher-assurance design needs:
- kernel object identity for every labeled object,
- label metadata on kernel objects or per-process hold edges,
- transfer-time checks for copy, move, result caps, and endpoint delivery,
- explicit trusted-subject/declassifier caps,
- an audit trail for every label-changing or declassifying operation,
- a formal state model covering users, subjects, objects, containers, access rights, accesses, and memory/time information flows.
The proposal therefore has two levels:
- Pragmatic capOS MAC/MIC: userspace brokers and wrapper caps enforce labels on grants and method calls.
- GOST-style MAC/MIC: a formal information-flow model plus kernel-visible labels/hold-edge constraints for transfers that cannot be forced through trusted wrappers. See Formal MAC/MIC for the dedicated abstract-automaton and proof track.
References:
- ГОСТ Р 59383-2021, access-control foundations: https://lepton.ru/GOST/Data/752/75200.pdf
- ГОСТ Р 59453.1-2021, formal access-control model: https://meganorm.ru/Data/750/75046.pdf
Composition Order
When policies compose, use this order:
- Mandatory policy defines the maximum possible authority.
- RBAC selects coarse eligibility and default bundles.
- ABAC narrows the decision for context, freshness, object attributes, and requested duration.
- The broker returns specific capabilities or denies the request.
- Audit records the plan, decision, grant, use, release, and revocation.
The composition result is still a CapSet, leased cap, wrapper cap, or denial.
Service Architecture
The policy stack should be decomposed into ordinary capOS services. Init or a trusted supervisor grants broad authority only to the small services that need to mint narrower caps.
SessionManager
Creates and manages session metadata/control caps:
guest()for local guest sessions,anonymous(purpose)for ephemeral unauthenticated callers,login(method, proof)for authenticated users,renew(session, proof, requestedDurationMs)for narrow continuation or recovery when policy allows it,logout(session)through theUserSessioncontrol cap.
The first implementation can be manifest-seed backed. It does not need a
persistent account database, but its seed records must use the same principal,
account, policy-profile, and resource-profile vocabulary as the later local
account store. UserSession should describe the principal, session ID, policy
profile, resource profile, auth strength, expiry, and audit context. It should
not be a general-purpose authority vending machine unless it was itself minted
as a narrow wrapper around a fixed cap bundle. Session IDs should come from the
same dedicated entropy source that the bootstrap login/setup flow in
Boot to Shell uses for credential
salts and setup tokens; if fresh randomness is unavailable, authenticated
session creation should fail closed instead of recycling predictable IDs.
SessionManager should own the mutable liveness cell for sessions it mints. The
kernel-installed process SessionContext (owned by
kernel/src/session_context.rs; see
Service Architecture) remains
immutable; renewal changes the cell or produces a successor session, not a new
subject label inside the same process. This is the mechanism that makes
long-running shells usable without treating fixed short wall-clock expiry as
the only safety boundary.
Safer first split:
SessionManager -> UserSession metadata cap
AuthorityBroker(session, policyProfile, resourceProfile) -> base cap bundle
Supervisor/Launcher -> spawn shell with that bundle
AuthorityBroker
The broker owns or receives powerful caps from init/supervisors and returns narrow caps after RBAC, ABAC, and mandatory checks.
Examples:
- broad
ProcessSpawner->RestrictedLauncher(allowed = ["shell", "editor"]), - broad
NamespaceRoot->Namespace("/users/alice"), - broad
ServiceSupervisor->LeasedSupervisor("net-stack", expires = 60s), - broad
BootPackage->BinaryProvider(allowed = ["shell", "editor"]).
The broker is the normal policy decision and cap minting point.
AuditLog
Append-only audit interface. Initially this can write to serial or a bounded log buffer; later it should be Store-backed.
Record at least:
- session creation,
- cap request,
- policy input summary,
- policy decision,
- cap grant,
- cap release or revocation,
- denial,
- declassification or relabel operation.
Audit entries must not contain raw auth proofs, private keys, bearer tokens, or broad environment dumps. For auth/session flows, the initial backend should record opaque credential/token record IDs, volatility flags, and policy/result codes rather than secret-bearing payloads. Failed pre-auth attempts should log only a terminal-local event ID and generic failure class; do not emit principal-identifying fields to the serial-backed path before authentication actually succeeds.
RoleDirectory
Role lookup should start static and boot-config backed:
guest -> guest-shell
alice -> developer
ops -> net-operator
net-stack -> service:network
This is enough for early RBAC bundles. Dynamic role assignment moves into the local account store once persistent storage and administrative tooling exist. Provider groups are not imported as roles automatically; a binding rule may map a provider group to a local role only for a named provider/tenant, expiry, and policy version.
LabelAuthority
Owns the label lattice and dominance checks. In the pragmatic phase, it is a userspace dependency of brokers and wrappers. In a GOST-style phase, the same lattice needs a kernel-visible representation for transfer checks.
Wrapper Caps
Wrappers are the main mechanism. Prefer them over per-call ACL checks in a central service:
RestrictedLauncherwrapsProcessSpawner.ScopedNamespacewraps a broader namespace/store.ScopedLogReaderfilters by session ID or service subtree.LeasedSupervisorwraps a broader supervisor with expiry and target binding.ApplicationManifestUpdaterrejects kernel/device/service-manager grants.LabelledEndpointenforces declared data-flow and control-flow constraints.
This keeps authority visible in the capability graph.
Bootstrap Sequence
Early boot can be static:
init
-> starts AuditLog
-> starts SessionManager
-> starts AuthorityBroker with broad caps
-> asks broker for a system, guest, or operator shell bundle
-> spawns shell through a restricted launcher
Before durable storage exists, policy config comes from BootPackage /
manifest config. Early authentication may still use bootstrap verifier or
public-credential records plus guest/anonymous/local-presence profiles, but it
must keep fresh-entropy requirements fail-closed and treat any RAM-only
credential or disable-state changes as volatile.
Revocation, Audit, and Quotas
User/session policy depends on the Stage 6 authority graph work:
- one-session-per-process plus privacy-preserving endpoint caller-session metadata lets shared services distinguish session/client relations; receiver selectors are only routing metadata,
- mutable session liveness cells distinguish live, logged-out, revoked, expired, and recovery-only sessions without relabeling running processes,
- resource ledgers and session quotas prevent denial-of-service through session creation,
CAP_OP_RELEASEand process-exit cleanup reclaim local hold edges,- epoch revocation lets a broker invalidate leased or compromised caps,
- renewal mints or refreshes session/grant leases under policy; it must not revive stale ordinary grants by accident,
- audit logs record the cap grant and release lifecycle.
The cross-cutting quota model lives in Resource Accounting and Quotas. Account and session resource profiles are templates; brokers, supervisors, and resource owners translate them into concrete ledgers and wrapper caps.
Audit should record identity and policy metadata, but it should not contain secrets, raw authentication proofs, or broad environment dumps.
Implementation Plan
-
Document the model. Keep user identity out of the kernel architecture, publish the principal/user/account/profile/session/role/workload vocabulary, and link this proposal from the shell, service, storage, and roadmap docs.
-
Manifest-seeded account and profile schema. Define boot-package seed records for first operators, recovery identities, service identities, guest policy, policy profiles, resource profiles, and initial role bindings. Validate that seed data names policy inputs only and does not grant ordinary accounts privileged kernel caps directly.
-
Session-aware native shell profile. Treat the shell proposal’s minimal daily cap set as a session bundle. Add
self/sessionintrospection and scopedlogs/homecaps once the underlying services exist. -
Authority broker and audit log. Add
ActionPlan,ActionStep,CapRequest,ApprovalClient,ApprovalInbox,ApprovalEntry, leased grant records, and an append-only audit path. The shell-proposal Approval and Authentication section defines the schemas; the broker is the single producer for both the requester-sideApprovalGrantand the decider-sideApprovalInbox. Start with RBAC-style policy/resource profile bundles and explicit local authentication. -
Local account store and external bindings. Add a Store/Namespace-backed
AccountStorefor account records, credential references, role bindings, external identity bindings, policy versions, resource profiles, and storage-root references. Include version and rollback checks before treating disk-backed account mutation as durable. -
ABAC policy engine. Extend the broker decision with session freshness, auth strength, object attributes, requested duration, and environment state. Prefer Cedar for the runtime broker interface; use OPA/Rego for host-side manifest and deployment checks. Keep decisions visible in audit records.
-
Mandatory policy labels. Add pragmatic labels to policy-managed services and wrappers. Keep confidentiality and integrity separate. Defer kernel-visible labels until a specific MAC/MIC policy cannot be enforced by trusted grant paths.
-
Guest and anonymous demos. Show a guest shell with
terminal,tmp, and restrictedlauncher, and show an anonymous workload with strict quotas and no persistent storage. -
POSIX profile adapter. Provide synthetic
uid/gid,$HOME,/etc/passwd, and cwd behavior from session policy/resource profiles and granted namespace caps. -
GOST-style formalization checkpoint. If capOS later claims GOST-style MAC/MIC, write the abstract state model before implementation: users, subjects, objects, containers, access rights, accesses, labels, control relations, and information flows. Then decide which labels must become kernel-visible.
Non-Goals
- No kernel
uid/gid. - No ambient
root. - No global login namespace in the kernel.
- No authorization from serialized identity structs.
- No model-visible authentication secrets.
- No POSIX permission bits as system-wide authority.
- No per-call role/attribute/label interpreter in the kernel fast path.
- No claim of GOST-style MAC/MIC until the formal model and transfer enforcement story exist.
Open Questions
- Which session interfaces are needed before persistent storage exists?
- Which audit store is acceptable before durable storage and replay exist?
- Which MAC policies, if any, justify kernel-visible hold-edge labels?
- How should remote capnp-rpc or future OCapN/CapTP-style identities map into local principals? Transport identity, locator hints, and routing metadata are not local user/session identity by themselves; remote peers should enter through broker/session policy rather than raw protocol fields. See Cloudflare, Cap’n Proto, Workers RPC, and Cap’n Web and Spritely, OCapN, and CapTP.
- Should the first broker prototype embed Cedar directly, or use a simpler hand-written evaluator until the policy surface stabilizes?
Proposal: Default User Avatar From Identity Hash
How capOS should pick a default avatar for an account or session in a way that is deterministic, stable across reboots and devices, free of network side-effects, and easy for the user to override with an explicit choice.
Status: partially implemented. The current tree implements the first
shell-side phase without adding schema, kernel state, or broker state. The
native shell derives a default avatar from existing UserSession.info data:
account, operator, and service sessions use their principal identifier, while
anonymous and guest sessions use their minted session identifier. It uses a
shell-local BLAKE2b-512 domain prefix over the same framed identity input
shape, then prints the selected set-flat tile in session output as
avatar=set-flat/<asset-stem> avatar_source=default avatar_override=none.
The capability-carried UserSession.avatar method, durable account override
storage, active-set discovery through SystemInfo, and remote-session view
model propagation remain future phases.
Problem
Today every UserSession is metadata-only: name (sometimes), profile class
(anonymous / guest / operator / future durable accounts as defined by
user identity and policy), and a
session-token entropy field. Any UI that needs to show “who is this” — the
boot-to-shell login prompt, the shell prompt,
the remote-session client, future GUI — has nothing to draw beyond a
profile-class fallback. The consequences:
- New accounts and anonymous sessions look identical even when they have different identities, which is misleading in any multi-account context.
- If an admin assigns avatars by hand, the assignment lives outside the identity surface and is not stable across re-imports of the account.
- Without an authority-controlled default, every UI invents its own, including potentially a Gravatar-style network call that exposes the account email to a third party.
The branding asset set already ships 144 curated rounded-card tiles
(branding/user-icons/set-flat/, 72 icons; branding/user-icons/set-modern/,
72 icons). They are typed as user avatars but have no consumer yet.
Goals
- Every account and session resolves to a concrete avatar without relying on network lookups or external services.
- The default is stable: the same identity always resolves to the same tile, on every host that imports the account, until the user explicitly sets an override.
- The default is derived from a stable identifier, not from mutable profile fields like display name. Renaming an account does not change its avatar.
- The override is persistent and travels with the account, not with a per-host UI preference store.
- Anonymous and short-lived sessions still get some deterministic avatar so they look distinct from each other within a session lifetime, without leaking durable identity.
- The avatar surface is a capability, not an ambient lookup. UI code asks
for an
Avatarfrom theUserSessionit already holds.
Non-Goals
- Generative identicons (jdenticon-style pixel art). The curated tile sets are already on disk and visually consistent with the rest of the branding.
- Per-user avatar uploads. The override is a selection from the shipped set for now; arbitrary blob uploads are a separate, larger design question (storage, scanning, capability scope).
- Avatar themes that follow OS dark/light mode. Theme handling is the responsibility of the rendering surface; the identity layer commits to a single tile per account.
- Group/role icons, badges, presence indicators. Those layer on top of the avatar, they do not replace it.
Design
Identity Inputs
The hash input is the stable account identifier, never the display name and never anything that can be rotated for security reasons. Every subject class is length-framed and domain-tagged, so an attacker who can choose bytes for one class cannot synthesize a collision against another:
input := classTag || u16(len(field_1)) || field_1 || ... || u16(len(field_k)) || field_k
| Subject | Class tag | Fields (in order) |
|---|---|---|
| Durable account | "acct" | principalId |
| Manifest-seeded operator account | "oper" | principalId (resolved to the seeded operator) |
| Service identity | "svc " | principalId (manifest or registry) |
| Federated account | "fed " | providerKind, issuer, tenant, subject |
| Anonymous session | "anon" | session-token entropy |
| Guest session | "gst " | session-token entropy |
Class tags are 4 fixed bytes (space-padded where shorter) so the input
prefix is unambiguous without needing a separator. The federated layout
matches the canonical external subject key from
user identity and policy: the same
(providerKind, issuer, tenant, subject) tuple that produces
AccountExternalBinding.subjectHash — subject alone is not unique across
identity providers and must not be used directly. Length-framing ensures
that, e.g., (issuer="A", tenant="BC") and (issuer="AB", tenant="C")
hash differently even though their concatenations would otherwise be equal.
Hash and Mapping
message = "capos-avatar-v1" || 0x00 || input
digest = BLAKE2b-512(message)
tile_index = u32_be(digest[0:4]) % len(active_set)
BLAKE2bis the digest primitive named by the cryptography and key management proposal; no new primitive."capos-avatar-v1"is the public domain-separation tag, not a secret key. The current shell-side phase prepends it to the BLAKE2b-512 message with a zero separator rather than using the BLAKE2 parameter block, so the mapping is explicit in the no-std shell code. The avatar selection is fully derivable from public account metadata; no MAC and no HKDF subkey derivation is involved. Bumpingv1tov2would let us re-issue defaults across the fleet (e.g., if a future tile set deprecates an icon) without affecting any other hash that consumes the same identifier.u32_beover the first four digest bytes is sufficient: the tile-set sizes (72) are far smaller than2^32, and modulo bias on a 32-bit space against 72 buckets is below2^-25— visually irrelevant.- Collisions are fine and expected: with 72 tiles, two arbitrary accounts collide with probability ~1.4%; in a tenant of 36 users, the birthday probability of any collision is roughly 50%. The avatar conveys identity hint, not identity proof. Higher-assurance UIs combine the avatar with the display name and account id.
Active Set
Each system commits to one active set (set-flat or set-modern). The
active set is a system-level configuration value, not a per-user choice, so
that:
- All accounts on a host look stylistically uniform.
- Switching the system theme remaps every account’s default deterministically but consistently — every account moves to its set-modern tile of the same hash position, not to a random new one.
The active set is exposed via SystemInfo.avatarSet (extension to the
existing SystemInfo capability). Future themes add new sets without
reshuffling existing assignments.
The implemented shell-side phase is narrower: it compiles the current
set-flat catalog directly into the native shell and does not add
SystemInfo.avatarSet. That keeps the first proof off the schema serial
surface while still making the default mapping visible to users. The later
capability-carried phase should replace the compiled shell catalog with the
system-advertised active set described above.
Override
A durable account can pin an explicit tile that wins over the hash-derived
default. The override is a new optional field on AccountRecord:
struct AccountRecord {
# ...existing fields @0..@18 from the identity proposal...
avatarOverride @19 :AvatarRef; # zero-length set/name means "no override"
}
struct AvatarRef {
set @0 :Text; # e.g. "set-flat", "set-modern"
name @1 :Text; # e.g. "panda" (the bare semantic name, without NNN- prefix)
}
- The override is mutated through the existing
AccountStoreManager.update(recordId, expectedStoreEpoch, expectedRecordVersion, expectedHash, patch)compare-and-set protocol defined by the identity proposal. Setting or clearing an avatar bumpsrecordVersion, recomputescontentHash, and links topreviousHashexactly like any other field change; nothing about avatar overrides bypasses the record-version, store-epoch, or hash-chain checks. - Validation:
setmust name a set the active build ships, andnamemust resolve to a tile within that set. Records that fail this check at load time fall back to the hash-derived default and emit an audit record; the record is not silently rewritten. - The override is checked first; the hash is the fallback.
updatepatches use the standard “absent field means unchanged” convention. Clearing an override is an explicit operation: the patch must containavatarOverridewith bothsetandnameempty. An unrelated update that omitsavatarOverridefrom its patch must not drop a previously pinned override.- The override is a name, not a tile blob. Storing only the name keeps the account record compact, makes shipped-asset replacement automatic (a re-rendered tile with the same name applies everywhere), and avoids embedding image data in identity records.
- Account export/import carries the field unchanged: since the import
path validates
set/nameagainst the importing host’s shipped tile catalog, an override that names a tile the importing host does not ship is downgraded to the hash-derived default at import time and audited, never silently dropped. - Anonymous and guest sessions cannot pin an override: they are short-lived and have nowhere durable to store it. Their hash result is the only avatar they get.
Capability Surface
UserSession gains:
interface UserSession {
info @0 () -> (info :SessionInfo);
auditContext @1 () -> (sessionId :Data, principalId :Data);
logout @2 () -> ();
avatar @3 () -> (avatar :Avatar);
}
interface Avatar {
# Stable, content-addressed handle for the chosen tile. `digest` is the
# SHA-256 of the encoded WebP bytes, NOT the identity hash. Two accounts
# that resolve to the same tile (whether through hash collision or
# explicit override) return the same `digest`, so UIs can cache by it.
ref @0 () -> (set :Text, name :Text, digest :Data);
# Bytes of the encoded WebP, when the caller is allowed to render it
# locally. Same caps that grant `UserSession` are sufficient; no separate
# avatar-read authority.
read @1 () -> (image :Data, mime :Text);
}
The avatar ordinal @3 follows the existing info @0, auditContext @1,
logout @2 ordinals on UserSession and slots into the next free position.
A future schema change that lands ahead of this one must shift the avatar
ordinal accordingly; the cap-name is the contract, not the ordinal number.
refreturns a content-addressed identifier suitable for caching across reboots without re-reading the bytes. The asset SHA-256 is computed once per shipped tile at build time and is identical across accounts that resolve to the same tile, so UI clients can key their local cache bydigestand dedupe across many sessions. The identity-derived digest from theHash and Mappingsection is internal to the avatar resolver and is not exposed byref.readreturns the WebP bytes from the active set’s tile. The ABI does not expose alternate formats — surfaces that need PNG can decode locally.
When the Avatar is Bound
Resolution happens lazily, the first time avatar() is called on a session:
- If the underlying account has an override, pick that tile.
- Otherwise, hash the account/session identity input, take
index % len(set)in the active set, look upbranding/user-icons/<set>/<NNN>-<name>.webp. - Cache the result on the in-memory session object until the session is torn down.
There is no precomputation step at boot or login; the cost is one domain-separated BLAKE2b digest plus one filesystem read per session, both negligible.
Surfaces That Consume Avatars
- Login UI (text shell
loginper boot to shell, future web login, future GUI): show the avatar next to the typedusername>while waiting for the hiddenpassword>prompt, so the user has a non-cryptographic visual confirmation that they are selecting the account they expect. The avatar itself is not a secret and exposing it pre-auth is intentional — the same is true of display names, and the boot-to-shell flow already accepts pre-auth account selectors. - Shell prompt and
whoami: the current text shell prints a deterministicset-flat/<asset-stem>default in the existingsessionoutput. Future graphical terminals can render the referenced tile inline, and the shell can switch from the compiled catalog to theUserSession.avatarcapability once that schema phase lands. - Remote-session client and Tauri wrapper: the bridge already receives a
view model from the trusted backend; add
avatarto the session view model so the browser/desktop UI never queries identity directly, and the same view-model field carries the operator’s chosen tile afterloginupgrades the anonymous session to an authenticatedUserSessionthroughSessionManager.loginas described in boot to shell. - System monitoring / audit views: the avatar identifies the actor in
human-readable timelines without leaking the underlying id; the audit
trail for override edits flows through the same
AccountStoreManagerrecord-chain the identity proposal already audits.
Anonymous and Guest Sessions
Anonymous and guest sessions, in the sense the user identity and policy and boot to shell proposals already define them, get a hash-derived avatar with these constraints:
- The input uses the four-byte class tags
"anon"and"gst "from the framed-input table above, so an anonymous session and a durable account with the same entropy field do not collide. - The avatar lives only as long as the session-token entropy. Re-anonymizing
through
SessionManager.anonymous()produces a new tile. - The login UI distinguishes
anonymousandguestsessions from durable accounts by a chrome accent (border colour, badge, label), not by reusing one fixed tile. Reusing a fixed tile would make every anonymous user look identical, which loses the “tell sessions apart at a glance” property.
Privacy and Security
- The hash uses a public domain-separation tag, not a secret key. The tile
derivation is fully reproducible from public account metadata; the privacy
guarantee is “the avatar leaks no extra information beyond what the
identity surface already exposes,” not “the underlying id is hidden by
cryptography.” The identity digest never leaves the identity layer — only
its modulo-N tile-index result reaches UI surfaces, embedded in the
resolved
set/name. - Cross-host correlation is intentionally observable. Because the hash
has no per-host salt, the same durable or pseudonymous principal imported
into two hosts produces the same
set/nameanddigeston both. Anyone who watches the avatar surface on multiple hosts can correlate “same account is here too,” and combined with display name or session metadata the avatar acts as a low-entropy identifier. This is the same correlation surface the principal id and external-bindingsubjectHashalready expose, so we treat it as acceptable for ordinary multi-host accounts and call it out explicitly so privacy-sensitive deployments can pin a generic override or set a per-host override policy. - Operators can audit-log avatar overrides as account-record edits, like
any other identity field; the override mutation goes through
AccountStoreManager.updateand produces the same store-epoch / record-version / content-hash audit chain as other record changes. - The avatar is not authentication. Two accounts with the same tile are not equivalent; the system always uses the principal id internally. The avatar is an identification aid for humans, like a display name.
- No network lookups. No Gravatar, no third-party calls.
Open Questions
- Should the hash include a per-system salt so an account imported into two hosts does not always show the same default tile, similar to how Unix uid/gid space is host-local? This proposal currently says no — cross-host stability is more useful than host-local distinctness, since durable accounts already have a globally unique id.
- Should
Avatar.readexpose only the active-set bytes, or bothset-flat/set-modernso a UI can render adaptive variants? Current preference: only the active set. Adaptive themes are the surface’s job, not the identity layer’s. - How should the manifest seed an override for the operator account? A
seed.operator.avatar = "set-flat:robot"field insystem.cueis the natural extension, but only if operators express a need — the hash-derived default is already deterministic.
Relevant Research
- Curated tile set with set-aware rounded-card masking:
branding/extract_user_icons.py,branding/user-icons/set-{flat,modern}/. - Identity surfaces that will host the new method: User Identity, Sessions, and Policy; Boot to Shell; Delegated Subject Context.
- Hash primitive and domain-separation conventions: Cryptography and Key Management.
Proposal: Delegated Subject Context
This proposal records the future model for acting on behalf of another subject. It was intentionally out of scope for the completed Session-Bound Invocation Context milestone and is treated as future work by the User Identity and Policy proposal and the Service Architecture proposal. The current state of the implemented session-bound model and its known residuals is tracked in Design Risks Register entries R2 (session-bound invocation context) and R14 (user identity / policy maturity).
The implemented milestone established the simpler rule first:
capability = authority to invoke
calling process session = who invokes
Cross-session capability transfer may delegate authority to invoke when the
capability’s transfer policy permits it. That is not subject delegation. The
User Identity and Policy proposal
already carries a delegationChain field on SessionInfo that records when a
session was minted through an AuthorityBroker approval flow or a federated
IdP; that field is session provenance, not the per-call represented-subject
context this proposal introduces.
Problem
Some workflows legitimately need a process to act on behalf of a different subject:
- a user asks an agent process to send a chat message for them;
- an operator grants a support session bounded access to perform one action;
- a service account performs a maintenance action for a tenant;
- an approval flow lets a worker complete a task in another principal’s name.
The system must support this without making the receiving process “become” the source subject. The caller’s own process session remains the invoker for audit, resource accounting, and privacy. The represented subject is separate, explicit, scoped, and revocable.
Design
Use a delegated-subject capability:
SubjectDelegation {
source_subject,
delegate_subject_or_session,
target_service,
allowed_methods_or_purpose,
disclosure_scope,
expires_at,
}
The exact ABI may be a SubjectDelegation interface, a broker result cap, or a
service-specific delegation cap. The invariant is stable:
invoked service cap = authority to call
calling process session = invoker
SubjectDelegation = represented subject context
Holding a SubjectDelegation is not enough to call a service. The caller must
also hold the service capability being invoked. This composes cleanly with the
service architecture’s existing rule that authority to act flows through the
service capability itself, not through ambient subject identity; see
Service Architecture proposal.
Example
Bob process session = Bob
Bob holds ChatRoot
Bob holds SubjectDelegation(Alice -> Bob, target_service = ChatRoot, scope = post)
ChatRoot.post(channel = "ops", text = "...", represented = AliceDelegation)
The service records:
invoker = Bob session reference
represented_subject = Alice, through bounded delegation
authority_to_call = ChatRoot
Bob has not become Alice. Audit and abuse handling can still identify Bob as
the actor while showing that Alice delegated a bounded representation. This
preserves the audit identity model the
User Identity and Policy proposal
already specifies for UserSession.auditContext: the invoker session reference
is the audit subject, and the represented-subject context is a separate facet
on the call.
Privacy
A delegated-subject capability must not disclose all source-subject facts. It should carry or vend only the facts the issuer allowed for the target service, such as:
- per-service display name;
- guest/operator class;
- a per-service audit pseudonym;
- a narrow claim such as “may approve invoice 123”.
It should not expose account-store records, external IdP claim bags, credential
identifiers, global principal ids, or unrelated profile attributes by default.
The default-private endpoint subject-disclosure rule introduced by the
session-bound milestone applies here too: explicit disclosure is opt-in per
method and bounded by the delegation’s disclosure_scope. See
User Identity and Policy proposal
for the broader privacy posture and
Session-Bound Invocation Context
for the implemented baseline.
Relationship To Capability Transfer
Capability transfer and subject delegation are different operations:
cap transfer only:
receiver gets authority to invoke;
receiver invokes as its own process session.
delegated subject context only:
receiver may present a represented subject;
no service method is callable unless receiver also holds a service cap.
cap transfer + delegated subject context:
receiver invokes the cap as its own process session;
service also sees the represented subject through explicit delegation.
The first implementation path should not depend on this proposal. Implement session-bound invocation context, transfer scopes, and shared-service migration first; add delegated subject context only after those rules are observable and reviewed. The session-bound prerequisites are landed (see Session-Bound Invocation Context and R2 in Design Risks Register); durable identity, ABAC/MAC, and broker maturity tracked under R14 of the same register are still proposal-shaped, so a delegated-subject implementation should not be selected until those mature far enough to give it a stable issuer.
Open Questions
- Whether the kernel should validate generic delegation metadata such as
target_serviceand expiry, or whether services should validate the delegation cap through a method call. - Whether delegated-subject caps are broker-owned, service-owned, or both.
- How revocation of delegated subject context composes with ordinary cap revoke/lease behavior.
- Whether the disclosure scope should be encoded as schema-specific facets or as a common metadata envelope.
- How
SessionInfo.delegationChain(session provenance) and a futureSubjectDelegation(per-call represented subject) compose without re-introducing ambient subject authority; the User Identity and Policy proposal owns the session-provenance side of that boundary.
Proposal: System Configuration and Operator Extensibility
Current operator-facing design authority now lives in Configuration. Manifest/startup authority lives in Manifest and Service Startup. This proposal is retained as the archival rationale and implementation history.
A small, layered CUE configuration model for the boot manifest that lets
operators extend the default boot (system.cue) without forking it,
unifies the host operator into a single principal regardless of which
authentication method they use, and moves the per-user toolchain cache
out of the repository root.
Problem
The default boot manifest (system.cue) and its focused-proof siblings
(system-spawn.cue, system-shell.cue, the various
system-ssh-*.cue, etc.) are each self-contained CUE files with a large
shared scaffold copy-pasted across them. Three concrete pain points
follow from that.
- No clean operator extension surface. An operator who wants to add
their own SSH public key, a second principal, or a different MOTD
has to edit
system.cuedirectly and carry that as a local diff againstmain. There is no documented “drop a small file, get an overlay” mechanism, so changes accumulate as untracked checkout-local state or get lost duringgit pull. - No host-user awareness. The default operator account in
system.cueis hardcoded asname="operator"/displayName="operator". The host user typingmake runsees a generic login identity, and adding their real SSH key requires manual conversion of the.pubfile into the manifest’s hex format. The build environment already knows the host user ($USER), the SSH key (~/.ssh/id_ed25519.pub), and the typical operator preferences; none of that information reaches the manifest. - Superseded cache default: the original implementation used
$(GIT_COMMON_DIR)/../.capos-tools, which created one pinned-tool cache per clone. The implemented default is now$(HOME)/.capos-toolsthroughCAPOS_TOOLS_ROOT, with per-version subdirectories such aslimine/<commit>/andcue/<version>/.
Adjacent design pressure: the SSH Shell Gateway milestone needs a
plausible answer to “where does the host operator’s SSH key go?” before
its run-target/init-mandate Gate D can close, and the local-users
backlog wants the host operator’s session to be a single account with
multiple authentication bindings (password, SSH key, future passkey)
rather than parallel operator/ssh-operator/passkey-operator seeds.
Design
The proposal is four small, independent moves that compose into one operator-facing extension surface.
1. Per-user toolchain cache
CAPOS_TOOLS_ROOT defaults to $(HOME)/.capos-tools instead of
$(GIT_COMMON_DIR)/../.capos-tools. The override path stays available
(set the variable explicitly to relocate). Existing per-version
subdirectories (limine/<commit>/, cue/<version>/, etc.) keep
multiple capOS clones from colliding on a single host. The first
make after the change repopulates the new path; the old in-repo
.capos-tools/ is left in place and can be removed manually.
Slice 2 must update every consumer that derives the pinned CUE path from the old default. At minimum:
tools/mkmanifest::expected_cue_pathvalidatesCAPOS_CUEagainst$(CAPOS_TOOLS_ROOT)/cue/<version>/bin/cue.tools/check-generated-adventure-content.shrecomputes the same path in shell and is invoked bymake generated-code-check. If the Makefile exportsCAPOS_CUEto the new path but the script recomputes the old one, the generated-code gate will reject the pinned CUE binary.
Any future tool that pins repo-selected helpers must follow
CAPOS_TOOLS_ROOT in lock step with the Makefile.
This change is independent of the rest of the proposal — it could
ship on its own — but it is bundled because the same operator-extension
narrative covers it: per-user state belongs in $HOME, not in the
repository.
2. cue/defaults/ package, packaged-default directory, and overlay shape
A new cue/defaults/defaults.cue declares package defaults and
exports #DefaultSystem capturing the shared scaffold. The
manifest decoder reads root-level schemaVersion, binaries,
initConfig, and kernelParams (with seed accounts, resource
profiles, authorized SSH keys, MOTD, UART config, and log level all
nested under kernelParams), so #DefaultSystem mirrors that exact
shape — final fields are at the document root, with kernelParams
holding the kernel-side config tree:
binariesdeclarations common to interactive bootsinitConfig.initandinitConfig.servicesskeletons for the password-login + anonymous-shell flowkernelParams.consoleUart/kernelParams.terminalUart/kernelParams.logLevelkernelParams.seedAccountswith a single canonical host-operator entry (32-byte fixedprincipalId)kernelParams.resourceProfileswith a single canonical operator resource profilekernelParams.motdkernelParams.authorizedSshKeys(empty by default)- A documented set of appendable extension inputs (see below)
that overlays use to extend lists. CUE list unification is
element-wise conflict, not concatenation; CUE v0.16 also rejects
the legacy
[a] + [b]list-arithmetic form, requiringlist.Concatfrom the standard library.
The repo’s cue.mod/module.cue declares module: "capos.local" with
language v0.16.0. The defaults package lives at cue/defaults/
and uses package defaults (not package capos) so the root
overlay can import it without a self-import.
The packaged default manifest stays at the repo root as system.cue,
declaring package capos. The overlay companion is system.local.cue
(repo root, package capos, gitignored). Focused-proof manifests
migrate independently to their own packages so they can import the
defaults package without joining package capos. Every repo-root
system-*.cue manifest now declares its own CUE package and imports
the defaults package, except system-paperclips.cue and
system-adventure.cue (demo-owned, package-less but still importing
defaults) and system-measure.cue (owned by the measure-mode-repair
plan and intentionally not migrated yet). See the Slice-3 inventory
table below for the full mapping.
Keeping the default manifest at the repo root preserves the current
embed_binaries contract — tools/mkmanifest resolves
binaries[].path relative to the manifest’s parent directory and
rejects .., so the manifest must live in a directory from which
existing repo-root-relative paths like init/target/... are
reachable. Moving the default into a subdirectory would force a
parallel binary-path-base change in mkmanifest; that is not worth
the additional surface for the value of co-locating the overlay.
// system.cue (repo root, packaged default)
package capos
import defaults "capos.local/cue/defaults"
_user: string | *"operator" @tag(user)
#Manifest: defaults.#DefaultSystem & {
user: _user
}
// Final manifest fields the decoder consumes are at document root.
// The decoder ignores any unused names like #Manifest.
schemaVersion: #Manifest.schemaVersion
binaries: #Manifest.binaries
initConfig: #Manifest.initConfig
kernelParams: #Manifest.kernelParams
The default MOTD value lives only in the defaults package
(motd: string | *_defaultMotd, where _defaultMotd is the
multi-line capOS welcome with chat/adventure shell hints — see
cue/defaults/defaults.cue). system.cue does not assign MOTD
itself, so a cue export .:capos without an overlay still resolves
to a complete value — two sibling string | *"..." defaults from
different files would unify to “incomplete” in CUE v0.16. An overlay
refines the field by declaring a concrete value (no *), which is
more specific than the default and wins under unification:
// system.local.cue (overlay)
package capos
#Manifest: kernelParams: motd: "Hi alice — capOS dev box."
tools/mkmanifest today invokes cue export <file> against a single
file path; CUE then loads only that file (plus its imports) and does
not unify other root files even when they share a package name. Slice
2 adds a --package <name> flag that switches mkmanifest to
cue export <dir>:<name> (where <dir> is the file’s parent and
<name> is capos). The Makefile passes --package capos only for
the default-boot recipe; focused make run-* targets keep
single-file mode and are not affected by the new packaged default.
Two Makefile changes are required for slice 2 to be safe:
- The
manifest.binrule’s prerequisites must include the defaults package (cue/defaults/*.cue) andsystem.local.cue(when it exists). Otherwise, edits to those files leave a stalemanifest.binandmake runboots the previous configuration. - Tag-dependent builds (
CAPOS_CUE_USER=$(USER)and optionalCAPOS_CUE_DISPLAY_NAME=...) must invalidate cachedmanifest.binwhen the tag value changes. The intended pattern is a sentinel file undertarget/whose contents record the tag values; the manifest rule depends on the sentinel, and the sentinel is regenerated whenever the CUE tag environment changes. Without this,make runafter amake run-smoke(different tag) silently boots the cachedoperator-tagged manifest.
3. @tag(user) injection contract
The host user name is injected into the manifest at cue export
time via a CUE tag. Because the manifest’s authoritative tag site
must be in a file that cue export actually reads, the tag is
declared in the root overlay file (system.cue), not the
imported defaults package. CUE evaluates tag attributes at the file
where they are declared.
The tag site is in the packaged default manifest file (system.cue
at the repo root, shown above) — that file declares
_user @tag(user) and threads it into defaults.#DefaultSystem via
the user field. The defaults package itself does not need a
@tag because tags are evaluated where they appear in the input.
// cue/defaults/defaults.cue (excerpt)
package defaults
import "list"
// Fixed 32-byte principal ID — manifest validation rejects shorter
// or longer values. Only display strings vary by host user; the
// audit-correlatable principal stays stable.
_canonicalOperatorPrincipalId: "local-operator-principal-default"
#DefaultSystem: {
user: string | *"operator"
schemaVersion: 1
binaries: [...] // shared list
initConfig: {...} // anonymous-shell flow
extraSeedAccounts: [...#SeedAccount] | *[]
extraResourceProfiles: [...#ResourceProfile] | *[]
extraAuthorizedSshKeys: [...#AuthorizedSshKey] | *[]
kernelParams: {
motd: string | *"capOS default boot. Type 'login' or 'setup'."
consoleUart: {...}
terminalUart: {...}
logLevel: string | *"debug"
seedAccounts: list.Concat([[{
name: user
displayName: userDisplayName
principalId: _canonicalOperatorPrincipalId
kind: "operator"
// ...
}], extraSeedAccounts])
resourceProfiles: list.Concat([[{
name: "default-operator-profile"
// ...
}], extraResourceProfiles])
authorizedSshKeys: extraAuthorizedSshKeys
}
}
tools/mkmanifest today invokes cue export <path> from Rust and
does not pass --inject / -t flags. Slice 2 adds a tag
pass-through: either a new mkmanifest --tag user=alice CLI option
that mkmanifest forwards to the underlying cue export, or — simpler
— mkmanifest reads environment tags and forwards each value as
--inject key=value. The Makefile sets CAPOS_CUE_USER=$(USER) for
make run only; mkmanifest derives displayName from that same
account’s passwd comment unless CAPOS_CUE_DISPLAY_NAME is explicitly
set. make run-smoke and CI-shaped targets leave them unset, so
untagged system.cue continues to see account=operator /
display=operator; focused smoke manifests may pin demo-specific
account fixtures independently.
@tag is the standard CUE pattern for build-time string injection
and is preferred over preprocessing the file with sed or generating
a wrapper file. It generalizes: future tags can carry hostname,
locale, timezone, or other build-environment-derived values without
adding more mechanisms.
4. system.local.cue overlay hook
The overlay file is system.local.cue at the repo root, declaring
package capos. It is gitignored explicitly. CUE in package mode
ignores files whose names start with ., so a leading-period
variant would not be loaded; the chosen filename has no leading dot.
In package mode (slice-2 mkmanifest invocation
cue export .:capos), CUE unifies every non-hidden *.cue file in
the directory that declares package capos — today that is just
system.cue; once the operator adds system.local.cue, both files
are unified automatically with no imperative include. Focused-proof
manifests are not picked up because migrated variants use their own
package names and unmigrated variants remain package-less.
A checked-in system.local.cue.example (repo root, package capos)
documents the supported extension shapes with worked examples. The
operator copies it to system.local.cue to activate.
Appendable extension inputs
CUE list unification is element-wise conflict, not concatenation, so
an overlay cannot extend the defaults’ seedAccounts or
authorizedSshKeys by re-assigning the same field. The defaults
package therefore exposes named extension lists that it concatenates
into the final manifest fields:
See the defaults excerpt above for the appendable inputs
(extraSeedAccounts, extraResourceProfiles,
extraAuthorizedSshKeys) and how they are concatenated via
list.Concat — the form [a] + [b] is rejected by CUE v0.16.
The overlay populates the extra* fields on #Manifest (which is
the named definition produced by the packaged-default file), never
the final lists:
// system.local.cue (repo root, gitignored, copied from .example)
package capos
#Manifest: extraAuthorizedSshKeys: [{
keyId: "host-laptop-ed25519-2026-04"
principalId: "local-operator-principal-default"
algorithm: "ssh-ed25519"
publicKey: "hex:..." // see how-to doc
fingerprintSha256: "..."
allowedShellProfiles: ["operator"]
source: "manifest"
comment: "host laptop"
}]
The principal id stays the fixed 32-byte canonical value — the
overlay does not derive a per-user principal id. Display strings
change with @tag(user); the audit-correlatable identity does not.
Worked extension scope (slices 2 and 3)
The overlay ships supporting these operator extensions:
- MOTD: re-declare
#Manifest.kernelParams.motdin the overlay with a concrete string. The default isstring | *"...", so a more concrete overlay value wins under CUE unification. - Console password verifier: override
#Manifest.kernelParams.consolePasswordVerifierPhc(Argon2id PHC string) so the development verifier shipped by the defaults package is replaced for any non-research deployment. - Extra SSH keys for the host operator: append to
extraAuthorizedSshKeyswithprincipalIdmatching the canonical operator. Multiple keys allowed. - Extra non-operator principals: append to
extraSeedAccountswithkind: "guest",kind: "service", or future kinds. Adding a secondkind: "operator"is not supported in slice 2 —kernel/src/cap/mod.rs::operator_seed_accountrejects manifests with more than one operator seed for password login. Multi-operator support is a separate change in the user-identity-and-policy track. - Extra resource profiles: append to
extraResourceProfilesfor custom quota templates referenced by extra accounts. - Extra boot binaries: append to
extraBinarieswithnameand repo-relativepath. The defaults package concatenates the list onto its_baseBinariessomkmanifestembeds the operator binary intomanifest.binalongside the default service set. - Extra init-launched services: append to
extraServiceswithname,binary(resolved againstbinaries),restart, and the cap graph the service should receive at spawn. The defaults concatenate operator extras after_baseServices, so init starts the operator service after the default chat server, remote-session gateway, and shell.
Task 4 closeout (2026-05-03 18:51 EEST): system.local.cue.example
covers every extension above. The plan calls for make run as the
verification target, but make run is interactive, so verification
ran make manifest (default MANIFEST_SOURCE=system.cue, package
mode --package capos) with the example copied to
system.local.cue. The package-mode rebuild emitted the operator
MOTD into manifest.bin (3 services, 12 binaries → 2551416 bytes,
log target/manifest-refreshed-example.log); rebuilding the same
target with the overlay absent produced 2553224 bytes, confirming
the operator MOTD overrode the defaults’ default value. make run-smoke was not a useful overlay verification because that
target builds manifest-smoke.bin from system-smoke.cue in
single-file mode (no --package flag, no sibling-file unification);
md5 of manifest-smoke.bin was identical with and without the
overlay file present.
The proposal does not generate the SSH hex/fingerprint conversion
in the Makefile — that lives in docs/configuration.md as a
short ssh-keygen -lf ~/.ssh/id_ed25519.pub + xxd/base64 -d
pipe. Keeping this manual avoids importing arbitrary host SSH keys
into the boot manifest by default.
5. Single-account-multi-auth invariant
The host operator is one account with potentially many authentication bindings:
- Password verifier — current
consoleCredentialPHC blob; bound to the host operator account by being declared at the same manifest scope (today there is no explicitprincipalIdreference in the credential record, but the kernel resolves the operator principal from the seed account at session-mint time). - SSH public keys — multiple records in
authorizedSshKeys, each carryingprincipalIdmatching the host operator’s seed account. - Future passkey/OIDC bindings — same pattern; the
user-identity-and-policy proposal already shows
ExternalIdentityBindingshaped this way.
The kernel’s operator_session_metadata already pulls the principal
from the manifest seed account when present (see
kernel/src/cap/session_manager.rs OperatorSeedAccount); the
hardcoded compatibility fallback fires only when no seed account is
declared. Once system.cue declares the host-operator seed account
explicitly, both password login and SSH public-key login mint a
session for the same principal. The AuthorityBroker.shellBundle
path is unchanged — it already routes through the AccountStore by
principal id (after the SSH AccountStore-bound auth slice landed at
commit 33100f4).
Importantly: this is not a kernel change. It is a manifest-shape
choice that makes the existing kernel resolution path the canonical
one. The bootstrap fallback (no seed account → hardcoded operator
principal) stays in place for focused proofs that intentionally test
the no-account-store path.
Migration Plan
| Slice | Scope | Risk |
|---|---|---|
| 1 (this) | Proposal + task ledger pointer + index entry. No code. | None. |
| 2 | Makefile (CAPOS_TOOLS_ROOT default, CAPOS_CUE_TAGS sentinel-file dependency for make run, manifest-rule prerequisites for the defaults package and system.local.cue); cue/defaults/defaults.cue; system.cue rewrite (stays at repo root, becomes package capos); system.local.cue.example (committed at repo root); tools/mkmanifest package-mode flag (--package capos switching to cue export <dir>:capos), tag pass-through, and updated expected_cue_path for the new tools-root default; docs/configuration.md; CLAUDE.md project-layout note. | Medium — touches Makefile, mkmanifest CLI surface, the default boot manifest, and adds a new package directory. Smoke harness assertions on principal=operator must keep passing because slice 2 leaves the default tag at operator. |
| 3 | Migrate focused-proof variants onto the defaults package. Closed at commit a50f610d (2026-05-03 21:54 UTC): Task 2 migrated the owned set (see the Slice-3 inventory table below), Task 3 tightened the manifest decoder to reject unknown root fields with regression tests at commit f3d89757 (see the Slice-3 Task-3 closeout below), Task 4 refreshed system.local.cue.example and docs/configuration.md to cover every defaults-package extension hook, and Task 5 stamped this status header, the task ledger System Configuration ad-hoc bullet, and the docs/changelog.md entry. One commit per variant or grouped by audit area. | Low per variant once slice 2 is in. Coordinated with parallel agents to avoid worktree collisions. |
| 4 | Add mkmanifest cue-to-capnp, a general host-side conversion path for CUE-authored data messages rooted at a caller-specified Cap’n Proto struct. The tool reuses the slice-2 CUE package/tag machinery, validates both CAPOS_CUE and CAPOS_CAPNP against the pinned per-user tool cache, checks cue version v0.16.0 and Cap'n Proto version 1.2.0, passes import paths through safe Command arguments, and writes the converted binary only after capnp convert json:binary succeeds. | Low for boot behavior because the existing manifest pipeline is unchanged. Medium host-tool risk because schema, CUE, and JSON are hostile inputs; the implementation delegates Cap’n Proto type rules to the pinned upstream converter and keeps filesystem/process boundaries explicit. |
Slice-3 manifest inventory
The table below records the migration state of every repo-root
system-*.cue manifest at the slice-3 Task-2 closeout. “Imports
defaults” means the file declares a CUE package and pulls in
capos.local/cue/defaults. “Migration shape” distinguishes between
manifests that unify the full defaults.#DefaultSystem scaffold (and
inherit MOTD, seed accounts, resource profiles, the base service graph,
etc.) and focused-proof manifests that intentionally reference the
defaults package only as a constant lookup for schemaVersion,
logLevel, and UART configuration. Both shapes are valid migration
targets — focused proofs need a narrow cap graph and cannot inherit the
default service tree.
| Manifest | Package | Imports defaults | Migration shape | Driven by |
|---|---|---|---|---|
system.cue | capos | yes | full scaffold | make run, make remote-session-ui |
system-spawn.cue | spawn | yes | constant lookup | make run-spawn |
system-shell.cue | shell | yes | constant lookup | make run-shell |
system-terminal.cue | terminal | yes | constant lookup | make run-terminal |
system-credential.cue | credential | yes | constant lookup | make run-credential |
system-login.cue | login | yes | full scaffold | make run-login |
system-login-setup.cue | loginsetup | yes | full scaffold | make run-login-setup |
system-local-users.cue | localusers | yes | constant lookup | make run-local-users |
system-revocable-read.cue | revocableread | yes | constant lookup | make run-revocable-read |
system-memoryobject-shared.cue | memoryobjectshared | yes | constant lookup | make run-memoryobject-shared |
system-restricted-shell-launcher.cue | restrictedshelllauncher | yes | constant lookup | make run-restricted-shell-launcher |
system-chat.cue | chat | yes | full scaffold | make run-chat |
system-smoke.cue | smoke | yes | full scaffold | make run-smoke, make run-diagnostics, make run-iommu-acpi, make run-acpi-pcie, make run-net, make run-uefi, make run-pci-nvme, make run-ringtap-failing-call |
system-session-context.cue | sessioncontext | yes | constant lookup | make run-session-context |
system-ipc-zerocopy.cue | ipczerocopy | yes | constant lookup | make run-ipc-zerocopy |
system-service-object-routing.cue | serviceobjectrouting | yes | constant lookup | make run-service-object-routing |
system-tcp-listen-authority.cue | tcplistenauthority | yes | constant lookup | make run-tcp-listen-authority |
system-capnp-chat-interop.cue | capnpchatinterop | yes | constant lookup | make run-capnp-chat-interop-vm |
system-thread-scale.cue | threadscale | yes | constant lookup | make run-thread-scale |
system-smp-process-scale.cue | smpprocessscale | yes | constant lookup | make run-smp-process-scale |
system-remote-session-capset-interop.cue | remotesessioncapsetinterop | yes | constant lookup | make run-remote-session-capset-interop-vm |
system-remote-session-adventure-interop.cue | remotesessionadventureinterop | yes | constant lookup | make run-remote-session-adventure-interop-vm |
system-ssh-host-key.cue | sshhostkey | yes | constant lookup | make run-ssh-host-key |
system-ssh-authorized-key.cue | sshauthorizedkey | yes | constant lookup | make run-ssh-authorized-key |
system-ssh-public-key-session.cue | sshpublickeysession | yes | constant lookup | make run-ssh-public-key-session |
system-ssh-public-key-auth.cue | sshpublickeyauth | yes | constant lookup | make run-ssh-public-key-auth |
system-ssh-feature-policy.cue | sshfeaturepolicy | yes | constant lookup | make run-ssh-feature-policy |
system-paperclips.cue | none | yes | demo-owned scaffold use | make run-paperclips |
system-adventure.cue | none | yes | demo-owned scaffold use | make run-adventure |
system-measure.cue | none | no | unmigrated; owned by measure-mode-repair plan | make run-measure |
system-paperclips.cue and system-adventure.cue are demo-owned and
not part of the slice-3 conflict surface. They already pull
#DefaultSystem for the operator account fixture but stay package-less
because their make run-* targets predate the package-mode flag.
Migrating them onto a package paperclips / package adventure shape
is a follow-up coordinated through the demo plans rather than slice 3.
system-measure.cue waits for docs/backlog/scheduler-evolution.md to
close, then can be migrated in its own batch.
All manifests added after the Slice-3 closeout (C payload manifests,
DDF grant manifests, hardware-audit variants, POSIX adapter smokes, WASI
smokes, wasm-host, thread-fairness variants, scheduler/scheduling-context,
limit proofs, and remote-session variants) follow the same convention: each
declares its own CUE package and imports capos.local/cue/defaults. The
table above is a Slice-3 migration snapshot; it is not exhaustive of all
current repo-root system-*.cue files.
Slice-3 Task 3 closeout
Closed 2026-05-03 20:22 UTC at commit f3d89757. The
SystemManifest CUE decoder
(capos-config/src/manifest.rs) now validates the document root
against an explicit allow-list and returns
Error::UnknownField { path, field, expected } for any other top-level
name. The accepted set lives in the decoder
(SYSTEM_MANIFEST_ROOT_FIELDS) and is schemaVersion, binaries,
initConfig, kernelParams — adding a future field is a deliberate
edit to that list. Two host-side tests in
capos-config/src/manifest.rs (system_manifest_rejects_unknown_root_field
and system_manifest_accepts_only_known_root_fields) pin both the
rejection path and the positive case so a regression is caught by
cargo test-config before any QEMU run. The Cap’n Proto schema for
SystemManifest is closed by construction, so the strictness check
only needs to live at the CUE/JSON boundary; capnp decode paths
remain unchanged. The slice-3 inventory above guarantees that every
owned focused-proof manifest already projects only those four fields
at the document root, so the rule does not break any migrated
manifest. docs/configuration.md records the operator-facing
behavior of the new error.
Slice 2 is intentionally minimal so that any breakage shows up on the
default make run / make run-smoke path immediately, rather than
hidden behind a fan-out of converted variants.
Slice 4 deliberately does not make CueValue universal. CueValue remains the
project-defined generic tree used inside SystemManifest.initConfig.
The general converter has a different contract:
mkmanifest cue-to-capnp \
[--package capos] [--tag key=value ...] \
[--import-path schema ...] [--no-standard-import] \
input.cue schema/example.capnp Example output.bin
input.cue is exported as JSON, then the pinned Cap’n Proto tool validates
that JSON against schema/example.capnp and root struct Example. This covers
normal Cap’n Proto data fields, nested structs, lists, enums, unions, defaults,
and imports according to upstream capnp convert semantics. It does not
serialize live capOS capability table entries or meaningful Cap’n Proto
interface objects; authority still travels through capOS capability transfer
mechanics, not through JSON-authored data files.
Cross-References
- Manifest and Service Startup
— describes the CUE evaluation, boot manifest build, and general
cue-to-capnphost-tool flow that this proposal extends. - Local Users, Storage, and Policy — Gate 1 manifest-seeded accounts; this proposal shapes the default manifest’s seed account to match the single-account-multi-auth invariant the backlog calls for.
- Run Targets, Init Mandate, and Default-Run Integration
— Gate D (default-
make runintegration); this proposal makes Gate D closure for the SSH milestone tractable by giving the default manifest a clean place to absorb optional services and authorized keys. - SSH Shell Gateway — consumes the host-user authorized-key surface in a future slice once OpenSSH transport gates land.
- User Identity and Policy
— defines the principal/account/session model and
ExternalIdentityBindingshape that this proposal’s single-account-multi-auth invariant relies on. Multi-operator support is tracked there. - Service Architecture
— primary consumer of the layered manifest:
initConfig.servicesand theextraServicesextension hook described above feed the authority-at-spawn service graph. The defaults package owns the base service tree (chat server, remote-session gateway, shell); overlays append operator-owned services without forking it. - Userspace Binaries
— defines the binary set the layered manifest embeds. The
binaries/extraBinariesshape covers native Rust capos-rt binaries, libcapos C-substrate binaries, the POSIX adapter binaries, and the wasm-host binary uniformly; per-language payload conventions (for example, the wasm-host’s stablewasi-payloadmanifest name) are documented there. - POSIX Adapter
— POSIX adapter smokes (
make run-posix-dns-smoke,make run-posix-pipe-smoke,make run-posix-stdio-smoke) are driven by focusedsystem-posix-*.cuemanifests that live in the same package-mode/overlay regime as the rest of the migrated manifest set. Operator-installable POSIX-ported services attach throughextraBinaries/extraServicesand inherit the same authority-at-spawn grants the default service tree uses. - WASI Host Adapter
— per-instance text grants (
initConfig.init.wasiArgs,initConfig.init.wasiEnv) are CUE-authored manifest fields that flow through this proposal’s package-mode evaluation; the manifest-decoder strictness invariant closed in Slice 3 Task 3 is the same gate that catches mistyped WASI argv/env field names before a payload boots. - System Info Capability — adjacent precedent for “rename + structural cleanup + worked Phase 2”; this proposal adopts the same status-header and cross-reference shape.
- Trusted Build Inputs —
needs entries for the new
cue/defaults/defaults.cue, thesystem.local.cueoverlay surface, theCAPOS_CUE_TAGSenvironment variable (and thetarget/-side sentinel that records it), and the host$USERvalue injected via@tag(user)— all become trusted boot-manifest inputs once slice 2 lands.
Non-Goals
- This proposal does not auto-ingest
~/.ssh/id_ed25519.pubinto the manifest. Thesystem.local.cue.exampleshows how the operator ingests their key explicitly. Auto-ingestion is a separate decision that has security implications (which keys count? how is the hex/fingerprint conversion validated?) and should not be bundled with the configuration-shape change. - This proposal does not auto-start
ssh-gatewayinsystem.cue. The SSH gateway service is added when its OpenSSH transport gates close (decomposed indocs/backlog/runtime-network-shell.md). Until then, an authorized SSH key declared insystem.local.cueis plumbing-only. - This proposal does not introduce a CUE-level imperative
“include if file exists” mechanism. CUE’s same-package unification
already provides the overlay behavior; the operator’s only action
is to drop a file with the right
package caposheader. - This proposal does not define a remote operator-extension
delivery channel (cloud-metadata, fleet config). Those are
addressed by
cloud-metadata-proposal.mdand stay separate.
Open Questions
- Whether
principalIdshould ever follow the host user. This proposal fixesprincipalIdat 32 bytes (local-operator-principal-default) so audit history is stable even if$USERchanges. A future per-user-derived principal id would need a deterministic, validated 32-byte derivation and a rollover plan; that is out of scope here. - Where
system.local.cuelives. This proposal places it at the repo root next tosystem.cue. That scopes the overlay to the samepackage caposCUE loads in package-mode export, keeps binary path resolution unchanged, and is gitignored cleanly. Focused-proof manifests are not picked up bypackage caposexport because migrated variants use separate package names and unmigrated variants declare no package directive — so this is settled. - Whether to migrate focused proofs to the defaults package.
Slice 3 assumes yes because it removes copy-paste, but each variant
must keep its proof shape and checks. The Slice-3 inventory table
above records the migration state for every repo-root
system-*.cuemanifest. The intentionally divergentsystem-measure.cueis left for a follow-up batch keyed off the measure-mode-repair plan. - Tag injection for
run-shell/run-terminal/ focused interactive proofs. Slice 2 only wiresmake run. Ifmake run-shellshould also personalize, slice 3 adds it; if focused proofs should always useoperator, slice 3 leaves them alone.
Proposal: Cryptography and Key Management
Capability-native abstractions for cryptographic keys and key sources. Keys are capability objects; key material never crosses cap boundaries. One interface serves every consumer — volume encryption, TLS, code signing, instance identity, authenticated backups, per-service secrets.
Implementation Status
This proposal is partially implemented. schema/capos.capnp now contains the
minimal SymmetricKey, PrivateKey, and PublicKey ABI plus a RAM-only
KeyVault subset needed by the TLS/ACME precursor. capos-tls provides
host-tested RAM-only XChaCha20 plus HMAC-SHA256 authenticated encryption,
HMAC-SHA256 MAC/verify, and P-256 signing cores. A development-only software
KeySource bootstrap now mints TLS and ACME
account key handles for local proofs, labels the source as non-production, and
is rejected by production/public profiles. The implemented key surface requires
an explicit requested KeyPurpose, exports only public material (spkiDer and
P-256 JWK for ACME account JWS registration), lists non-secret vault/source
metadata, and has no raw symmetric or private-key export surface. There is
still no runtime key service, persistence,
hardware/cloud custody, symmetric-key derivation or wrapping, ACME protocol,
TLS server handshake, or production KeySource.
The first implementation chain is the narrow TLS/ACME precursor owned by Certificates / TLS:
crypto-privatekey-publickey-ram-signing-local-proof– done 2026-06-04: minimalPrivateKey/PublicKeyschema and RAM signing proof for TLS server keys and ACME account JWS keys.crypto-keyvault-ram-privatekey-custody-local-proof– done 2026-06-05: RAM-onlyKeyVaulthandles for those private keys, with generation, open/list/destroy, purpose separation, and stale-handle failure.crypto-development-keysource-tls-acme-bootstrap-local-proof– done 2026-06-05: development-only softwareKeySourcebootstrap for local TLS/ACME proofs, rejected for production/public profiles.
That precursor intentionally excludes persistent storage, TPM, cloud KMS, passphrase/passkey unlock, raw private-key import, ACME protocol, and TLS server handshakes. It also tightens the TLS/ACME invariant: raw private-key material is not written to manifests, boot images, logs, task records, or evidence.
The capability-infrastructure reconciliation
(cap-infra-crypto-key-caps-phase1-reconcile-local-proof, done 2026-06-06)
added the minimal RAM-only SymmetricKey ABI and local proof for XChaCha20 plus
HMAC-SHA256 authenticated encryption and HMAC-SHA256 MAC/verify. It follows the
same RAM-only rule for symmetric key bytes and adds no key export, persistence,
wrapping, or production custody.
Problem
Nearly every forthcoming capOS subsystem wants cryptography. A partial list:
- Volume encryption at rest (Volume Encryption).
- TLS termination in the web text shell gateway (Boot to Shell).
- Inter-service mTLS on a multi-host capability graph (Networking).
- Instance identity tokens (signed JWTs) produced from cloud hypervisor metadata (Cloud Metadata).
- WebAuthn/passkey public-key verification for login.
- Signed audit logs (System Monitoring).
- Signed boot manifests and measured boot (Storage and Naming Open Question #5).
- Cloud KMS integration (envelope encryption for volumes and object stores).
- Future: signed release artifacts, encrypted swap, session tokens.
Without a shared abstraction each of these invents its own key
interface, its own “where does the key live” story, and its own audit
trail. That is how Linux ended up with dm-crypt, fscrypt,
keyctl, PKCS#11, ssh-agent, gpg-agent, systemd-creds, TPM
tools, and cloud-specific SDKs as mutually-unaware silos. capOS is
young enough to avoid that.
Design Principle: Keys Are Capabilities
In every Unix-lineage system, a key is a byte string — a secret stored somewhere (keyring, file, memory, HSM handle), protected by a mechanism orthogonal to the system’s main abstractions (syscalls + files + processes). Every new subsystem therefore invents a new protection mechanism.
In capOS, a key is a capability object. Holding a SymmetricKey or
PrivateKey cap means “you may compute with this key.” It does not
mean “you may see this key.” Key material lives in the address space
of the service that implements the cap; callers reach it by invoking
methods.
Consequences:
- Attenuation falls out of the capability model. A decrypt-only
SymmetricKeyis a wrapper CapObject that rejectsencrypt. A key bound to a single AAD domain is a wrapper that fixes theaadargument. A sign-onlyPrivateKeyis a wrapper that rejectsdecrypt. No new kernel mechanism is needed. - Revocation is a cap drop. Drop the cap, the key is gone from that holder’s reach. Other holders are unaffected.
- Audit is intrinsic. Every method invocation can flow through an
audit cap. A malicious service granted
decryptauthority generates audit records for every use; it cannot exfiltrate the raw key material silently. - Hardware isolation composes cleanly. A TPM-backed key service
implements the same
PrivateKeyinterface as an in-process software key service; callers cannot distinguish, and should not need to.
A service granted a SymmetricKey with both encrypt and decrypt
can still run arbitrary oracle queries against the key. That is
weaker than “the key material never leaves an HSM” and stronger than
“the key is a byte string in the process heap.” When stronger
containment is required, the key service is a thin process sitting on
top of a hardware primitive (TPM, Secure Enclave, cloud KMS).
Schemas
Symmetric keys
interface SymmetricKey {
# Authenticated encryption. The Phase-1 RAM implementation supports
# `xchacha20HmacSha256` only: XChaCha20 stream encryption with HMAC-SHA256
# authentication. It generates a fresh nonce internally and returns
# ciphertext plus tag separately so callers cannot choose nonce reuse.
encrypt @0 (plaintext :Data, aad :Data, purpose :KeyPurpose)
-> (ciphertext :Data, nonce :Data, tag :Data);
# Authenticated decryption. `aad`, `nonce`, and `tag` must match the values
# from `encrypt`; failures return an application error, not plaintext.
decrypt @1 (ciphertext :Data,
nonce :Data,
tag :Data,
aad :Data,
purpose :KeyPurpose)
-> (plaintext :Data);
# MAC-only modes for keys with `KeyPurpose.integrity`.
mac @2 (message :Data, purpose :KeyPurpose) -> (tag :Data);
verify @3 (message :Data, tag :Data, purpose :KeyPurpose) -> (ok :Bool);
info @4 () -> (algorithm :SymmetricAlgorithm,
purpose :KeyPurpose,
identifier :Data);
}
enum SymmetricAlgorithm {
aes256Gcm @0;
aes256GcmSiv @1;
xchacha20Poly1305 @2;
aes256Xts @3; # block-device only; no authentication
hmacSha256 @4; # mac/verify only
hmacSha384 @5;
hmacSha512 @6;
xchacha20HmacSha256 @7; # landed local proof construction
}
Subkey derivation and key wrap/unwrap remain outside the landed Phase 1 ABI.
Later slices that add them must allocate new method ordinals after info @4
instead of reusing the Phase 1 slots.
Asymmetric keys
interface PublicKey {
# Verify only for the requested purpose. A public key derived from a
# TLS certificate key rejects an ACME account verification request, and
# vice versa.
verify @0 (message :Data,
signature :Data,
scheme :SignatureScheme,
purpose :KeyPurpose)
-> (ok :Bool);
# Export raw public material (SPKI DER, JWK, OpenSSH, PGP) for
# callers that need to distribute it. Public material is freely
# shareable; the cap itself is an authority only to invoke
# methods, not to "own" the public key.
export @1 (format :PublicKeyFormat) -> (encoded :Data);
info @2 () -> (algorithm :AsymmetricAlgorithm,
purpose :KeyPurpose,
identifier :Data);
}
interface PrivateKey {
# Sign only for the requested purpose. The first implementation accepts
# P-256 with `default` / `ecdsaSha256` and rejects other schemes.
sign @0 (message :Data,
scheme :SignatureScheme,
purpose :KeyPurpose)
-> (signature :Data);
public @1 () -> (pk :PublicKey);
info @2 () -> (algorithm :AsymmetricAlgorithm,
purpose :KeyPurpose,
identifier :Data);
}
enum AsymmetricAlgorithm {
ed25519 @0;
x25519 @1;
p256 @2;
p384 @3;
rsa2048 @4;
rsa3072 @5;
rsa4096 @6;
# Post-quantum placeholders; added as capOS ships them.
mlKem768 @7; # ML-KEM (Kyber) for KEM
mlDsa65 @8; # ML-DSA (Dilithium) for signatures
}
enum SignatureScheme {
default @0; # algorithm's natural default (Ed25519 pure, RSA-PSS, etc.)
ecdsaSha256 @1;
ecdsaSha384 @2;
rsaPssSha256 @3;
rsaPssSha512 @4;
rsaPkcs1Sha256 @5; # for compatibility only
}
enum PublicKeyFormat {
spkiDer @0;
jwk @1;
opensshWire @2;
pgpPacket @3;
}
Shared metadata
enum KeyPurpose {
generic @0;
blockVolume @1;
objectStore @2;
envelope @3; # KEK — only wraps/unwraps
integrity @4; # MAC-only
tls @5;
codeSigning @6;
instanceIdentity @7;
authToken @8; # session tokens, JWTs
webauthn @9;
audit @10;
oauthClientAssertion @11; # RFC 7523 private_key_jwt client auth
oidcIdToken @12; # IdP-side ID token signing (LocalIdentityProvider)
dpopBinding @13; # RFC 9449 proof-of-possession keypairs
acmeAccount @14; # RFC 8555 account JWS signing
}
identifier (bytes in info()) is an opaque, stable handle usable
for logging, correlating audit records, and looking up the key in a
KeyVault. It is not a secret. It is not a cryptographic hash of the
key (that would let an attacker confirm a guessed key); it is a
random ID chosen at key creation.
Key sources
A KeySource produces keys given some unlock context. Different
implementations realize different trust models.
interface KeySource {
# Produce a key given an unlock context (passphrase bytes, a
# passkey assertion, a sealed blob, an attestation report, empty
# for sources that hold keys directly).
unlockSymmetric @0 (context :Data, purpose :KeyPurpose)
-> (key :SymmetricKey);
unlockPrivate @1 (context :Data, purpose :KeyPurpose)
-> (key :PrivateKey);
# Seal a key under this source's policy. The returned blob can be
# stored in the clear; unlock will refuse to produce the key
# unless its policy is satisfied.
sealSymmetric @2 (key :SymmetricKey, policy :SealPolicy)
-> (blob :Data);
sealPrivate @3 (key :PrivateKey, policy :SealPolicy)
-> (blob :Data);
# Rewrap: unseal under current policy, reseal under new policy.
# Used for KEK rotation without touching the underlying key.
rewrap @4 (blob :Data, newPolicy :SealPolicy) -> (newBlob :Data);
info @5 () -> (kind :KeySourceKind, identifier :Data);
}
enum KeySourceKind {
manifestEmbedded @0; # dev/CI only
passphrase @1;
passkeyPrf @2; # WebAuthn PRF extension
tpm2 @3;
secureEnclave @4;
cloudKms @5;
attestation @6; # SEV-SNP / TDX / Nitro
network @7; # Tang/Clevis-style
softwareStored @8; # encrypted-at-rest in a KeyVault
oidcFederated @9; # OIDC AccessToken -> KMS / remote unlock, no baked creds
}
struct SealPolicy {
union {
none @0 :Void;
pcr @1 :PcrPolicy;
kms @2 :KmsPolicy;
attested @3 :AttestationPolicy;
composite @4 :List(SealPolicy); # AND of sub-policies
tokenExchange @5 :TokenExchangePolicy; # OIDC/OAuth2-gated unlock
}
}
struct TokenExchangePolicy {
# The OIDC issuer whose tokens satisfy this policy.
issuer @0 :Text;
# Required token audience (the KMS / STS endpoint).
audience @1 :Text;
# Required subject predicate. Union allows exact or pattern matches
# without growing this struct; see oidc-and-oauth2-proposal for the
# full pattern grammar.
subjectPattern @2 :Text;
# Additional required claims (e.g. `groups`, tenant ID, attestation
# fields). Values are JSON-encoded bytes.
requiredClaims @3 :List(NamedClaim);
# Acceptable LoA levels mapped from `acr`/`amr`.
minAuthStrength @4 :UInt8;
}
struct NamedClaim {
name @0 :Text;
value @1 :Data;
}
struct PcrPolicy {
pcrMask @0 :UInt32; # bitmap of PCR indices
pcrDigest @1 :Data; # expected composite digest
bank @2 :TpmHashBank;
}
struct KmsPolicy {
provider @0 :Text; # "aws", "gcp", "azure", "vault", ...
keyId @1 :Text;
grantTokens @2 :List(Text);
}
struct AttestationPolicy {
platform @0 :AttestationPlatform;
measurement @1 :Data;
signerPublicKey @2 :Data;
allowedVariant @3 :List(Data); # e.g. permitted firmware versions
}
enum AttestationPlatform {
sevSnp @0;
tdx @1;
nitro @2;
}
Key lifecycle — the KeyVault
A KeyVault is a stateful service that stores key material, issues
key handles, handles rotation, and emits audit events. It is distinct
from KeySource: a KeySource is a factory producing keys; a
KeyVault is a registry tracking the keys a deployment knows
about. The schema below is the landed RAM-only TLS/ACME subset. Future
symmetric-key, import, seal-policy, unlock, persistence, and rotation methods
append to this interface; they do not renumber the landed methods.
enum KeyMaterialSource {
ramGenerated @0;
imported @1;
keySource @2;
}
interface KeyVault {
generatePrivate @0 (
algorithm :AsymmetricAlgorithm,
purpose :KeyPurpose,
createdAtEpochSeconds :UInt64,
auditLabel :Text
) -> (handle :KeyHandle, key :PrivateKey);
openPrivate @1 (handle :KeyHandle) -> (key :PrivateKey);
list @2 (filter :KeyFilter) -> (entries :List(KeyEntry));
destroy @3 (handle :KeyHandle, reason :Text) -> ();
}
struct KeyHandle {
identifier @0 :Data;
generation @1 :UInt64;
}
struct KeyEntry {
handle @0 :KeyHandle;
algorithm @1 :AsymmetricAlgorithm;
purpose @2 :KeyPurpose;
createdAtEpochSeconds @3 :UInt64;
lastUsedEpochSeconds @4 :UInt64;
source @5 :KeyMaterialSource;
auditLabel @6 :Text;
}
struct KeyFilter {
purposes @0 :List(KeyPurpose); # OR
algorithms @1 :List(AsymmetricAlgorithm); # OR
}
Concrete Key Sources
Not all of these ship on day one. Phases below give a sequence.
ManifestEmbeddedKeySource — development and CI only
Key material baked into SystemManifest. Unsealable. Boot-time
validation refuses to build a production-profile image against this
source. Used for QEMU smoke tests and hermetic CI.
Do not use manifest-embedded raw private keys for the TLS/ACME precursor chain. Those local proofs use a development-only software source that generates key handles at boot instead, so private key material does not enter manifests, images, logs, task records, or evidence.
PassphraseKeySource — interactive unlock
Consumes a passphrase from the console login flow (Boot to Shell), runs Argon2id with per-source parameters, derives a KEK, unwraps sealed blobs. No persistent state beyond the salt and KDF parameters (which are public).
PasskeyPrfKeySource — session unlock from WebAuthn
Consumes a WebAuthn assertion whose hmac-secret / PRF extension
yields a per-credential symmetric secret. Derives a KEK from the PRF
output; KEK unwraps the user’s sealed DEK. Key material never leaves
the authenticator; the PRF output never leaves the key service
process.
Tpm2KeySource — hardware-bound, measured-boot-gated
A TPM 2.0 driver service holds the TPM; this source wraps it. Seal policies bind keys to PCR digests; unseal succeeds only if the running boot chain matches. Enables unattended boot while keeping the key off the disk.
SecureEnclaveKeySource — platform key stores
Analog for Apple Secure Enclave, Android StrongBox, Intel CSE. Same
interface shape as Tpm2KeySource; different backing primitive.
CloudKmsKeySource — cloud envelope encryption
Wraps a cloud KMS (AWS KMS, GCP KMS, Azure Key Vault, HashiCorp
Vault, KMIP). Unlock calls the KMS Decrypt operation with a wrapped
DEK and returns the plaintext DEK as a SymmetricKey cap. Seal calls
KMS Encrypt under a named KEK.
Authentication to KMS uses the InstanceIdentity cap from
Cloud Metadata; no
long-lived credentials live in the capOS image.
Properties the system gets by following the envelope pattern:
- Free KEK rotation (rewrap the DEK; volume data is untouched).
- Revocation by disabling the KMS key or revoking the IAM grant.
- Cross-account / cross-region access via KMS grants.
- Every unwrap appears in the cloud provider’s audit log — observability comes for free.
AttestationKeySource — confidential computing
Consumes SEV-SNP, TDX, or Nitro attestation reports. unlock submits
the report to a remote verifier (often cloud KMS with attestation
policy) which returns the unwrapped DEK only if the report matches
an approved measurement. Enables “only this specific capOS image,
running on genuine attested hardware, can decrypt this volume.”
NetworkKeySource — Tang / Clevis-style
Unlock derives a key by interacting with one or more remote servers; no single server sees the plaintext key (when combined with secret sharing). Supports the “revoke access by taking the server offline” model without physical-access requirements.
SoftwareStoredKeySource — encrypted on disk, under another source
The recursive case: a source whose seal policy points at another source. Used to compose, e.g., a file-backed key store encrypted under a TPM-sealed master key. The outer source provides integrity (TPM seal); the inner source provides convenience (named key lookup).
OidcFederatedKeySource — token-exchange-gated unlock
Derives a key from a short-lived OIDC/OAuth2 access token. The source
holds an OAuthClient or WorkloadIdentityFederation cap (from
OIDC and OAuth2). unlock
obtains a fresh token for the configured audience — either by
exchanging a local InstanceIdentity JWT, a Kubernetes projected
service-account token, or a user session’s access token — then
presents it to a remote KMS / STS / custom key service which
returns the wrapped DEK.
Two common shapes:
- Cloud KMS with workload identity federation. Audience is the cloud STS; after token exchange the resulting cloud credential calls KMS Decrypt. Replaces every baked long-lived cloud IAM credential in the image.
- Per-user volume. Audience is a capOS-internal key service;
the user’s
AccessTokencap proves the caller is Alice; the key service enforcesTokenExchangePolicyand returns Alice’s DEK.
Properties the envelope + token-exchange pattern gets the system:
- No long-lived credentials in any capOS image.
- Per-principal KMS audit (the token
subappears in every KMS decrypt log). - Revocation by IdP account disable, token revocation, or KMS grant removal.
- Step-up authentication gating: a
TokenExchangePolicyrequiringminAuthStrength >= loa3means Alice must have MFA-backedacr/amrclaims before her volume unlocks.
Consumers
A non-exhaustive list of how this interface is meant to be used. Each consumer either exists as a proposal or is called out as future work.
| Consumer | Interface | Key source |
|---|---|---|
EncryptedBlockDevice | symmetric | any |
EncryptedNamespace | symmetric | passphrase / passkeyPrf / KMS |
| TLS termination (web gateway) | both | passphrase / KMS / cloud certs |
| SSH host key signing | private | KeyVault / softwareStored / KMS |
| SSH public-key login | public | CredentialStore / authorized key store |
| mTLS between services | both | KeyVault with KMS seal |
| Instance identity JWT signing | private | cloudKms / softwareStored |
| Signed audit logs | private | KeyVault, append-only policy |
| WebAuthn verification | public | CredentialStore (public keys) |
| Signed boot manifests | public | public key baked into firmware |
| Encrypted swap | symmetric | per-boot ephemeral (in-RAM) |
| Encrypted backups | symmetric | dedicated KMS key |
| Session tokens (HMAC) | symmetric | KeyVault, rotated frequently |
Relationship to CredentialStore
The CredentialStore in
Boot to Shell stores
verifiers — WebAuthn public keys, password hashes, recovery codes.
Its job is authentication: matching a claim from a user against a
stored verifier.
The KeyVault proposed here stores keys — symmetric DEKs,
signing private keys, KEKs. Its job is cryptography: producing keys
for use by capOS services.
Overlap happens at passkey unlock: the CredentialStore verifies the
WebAuthn assertion; the resulting PRF output feeds a
PasskeyPrfKeySource that produces a SymmetricKey usable by
EncryptedNamespace. Two services, one flow.
Keeping these distinct matters because their audit, retention, and
exposure models differ. A CredentialStore can expose every stored
entry as metadata (public keys are public) without leaking secrets; a
KeyVault cannot. A deployment may want different replication,
backup, and recovery policies for authenticators vs. encryption keys.
Threat Model
Separate from the consumer-specific threat models, the crypto/key management service itself has these:
- Memory scraping of a live key service. The service holds
plaintext keys in RAM. Mitigation: small trusted-computing-base
(one crate, audited),
mlockthe heap (no swap leakage), zeroize on drop, no panic-induced core dumps, cap-scoped access so only callers with aKeycap can trigger operations. Against a kernel exploit, no defense; that is a separate threat. - Oracle abuse. A malicious service granted a
SymmetricKeycap uses it as a decryption oracle. Mitigation: granting callers attenuated caps (decrypt-only,aad-pinned). Audit records make abuse detectable. - Side-channel leakage. Timing, cache, power. Mitigation: use
constant-time implementations (
aescrate’s hardware backend;chacha20poly1305crate is constant-time), prefer AEAD modes that resist nonce-reuse gracefully (GCM-SIV), avoid bespoke crypto. - Downgrade attacks on algorithm selection. A caller requests a
weak algorithm on a key that supports stronger modes. Mitigation:
info()records the canonical algorithm;KeyPurposeconstrains the method set; algorithm negotiation is the caller’s job, not a feature of the key cap. - Key persistence in unintended places. Kernel DMA buffers, swap, crash dumps, core files. Mitigations are deployment-level (no swap, or encrypted swap with a per-boot key; disable core dumps for the key service process; measure the boot chain so a tampered kernel is detectable).
Phases
Phases align with the subsystems that need keys. Crypto primitives come first; consumers follow their own proposals’ phases.
Future asymmetric-key methods such as public-key encryption, private-key decryption, and key agreement append after this implemented subset in later slices.
Phase 1 — Interfaces and RAM-only implementation
- Landed first increment: minimal
PrivateKey/PublicKeyinterfaces plusAsymmetricAlgorithm,SignatureScheme,PublicKeyFormat, andKeyPurposeinschema/capos.capnp, backed by host-tested RAM-only P-256 signing incapos-tls. This proves TLS-vs-ACME purpose separation and public export without raw private-key export. - Landed second increment: RAM-only
KeyVaultgeneration/open/list/destroy,KeyHandle, source metadata, audit labels, and stale-handle fail-closed behavior for TLS and ACME local proofs. - Landed third increment: development-only software
KeySourcebootstrap that mints TLS and ACME account keys into the RAMKeyVaultwithout manifest or evidence private-key bytes, and rejects production/public profiles. - Landed fourth increment: minimal RAM-only
SymmetricKeyABI plus XChaCha20 stream encryption with HMAC-SHA256 authentication and HMAC-SHA256 MAC/verify cores. The local QEMU proof covers encrypt/decrypt, tag failure, MAC verification, purpose failure, and operation denial without logging raw key material or generated metadata. - Remaining Phase 1 surface: production/runtime
KeySourceservices, symmetric-key derivation and wrapping, and any broader enum/struct metadata those services need. - Implement a RAM-only key service using vetted Rust crates
(
aes-gcm-siv,chacha20poly1305,ed25519-dalek,x25519-dalek,p256,rsa,hmac,hkdf). No persistence. Pure interface exercise. ManifestEmbeddedKeySourcefor dev/CI.- Host tests: AEAD round-trips, signature round-trips, key agreement, fuzz the decrypt/verify paths.
Phase 2 — KeyVault with in-memory storage
- Landed local-proof subset: RAM-only key generation, handle-based lookup, metadata listing, destroy, and stale-handle refusal.
- Remaining production-oriented surface: sealed blob storage.
rotateSealimplementation (metadata-only KEK rotation).- Policy enforcement for seal/unseal.
- Audit cap integration (System Monitoring).
Phase 3 — Persistent KeyVault over the Store
- Sealed blobs live in a Store or Namespace.
- Access control:
KeyVaultcap is itself attenuable (read-only, purpose-filtered). - Cross-reboot survival requires the Store, which requires persistent
storage tracked in
docs/roadmap.md.
Phase 4 — PassphraseKeySource and PasskeyPrfKeySource
- Passphrase flow wires into console login.
- PasskeyPRF flow wires into WebAuthn assertions from the web text shell gateway.
- Per-user
EncryptedNamespacebecomes implementable end-to-end.
Phase 5 — Tpm2KeySource
- TPM 2.0 driver as a userspace service (separate crate; talks to the TPM over x86 platform TIS or a virtio passthrough in cloud VMs).
- Seal policies bound to PCR digests.
- Measured-boot chain definition (firmware → bootloader → kernel → init → key service). PCR composition documented.
Phase 6 — CloudKmsKeySource
- AWS KMS first; GCP KMS, Azure Key Vault, HashiCorp Vault, KMIP follow.
- Depends on
InstanceIdentityfrom cloud-metadata and a functioning network stack. - Cross-region / cross-account grant handling documented.
Phase 6b — OidcFederatedKeySource
- Depends on
OAuthClientandWorkloadIdentityFederationfrom OIDC and OAuth2. - Workload identity federation to cloud KMS (no baked long-lived
IAM credentials). Subject token sources:
InstanceIdentity, attestation report envelope, Kubernetes projected token, GitHub Actions OIDC. - Per-user volume unlock via user
AccessTokenagainst a capOS-internal key service honoringSealPolicy.tokenExchange. TokenExchangePolicyenforcement for seal/unseal.
Phase 7 — AttestationKeySource
- SEV-SNP, TDX, or Nitro — whichever the first target cloud environment requires.
- Verifier can be cloud KMS with attestation policy or a standalone service.
Phase 8 — Post-quantum migration
- Add ML-KEM and ML-DSA to the algorithm enums when capOS picks its
PQ stack. Primarily a schema evolution and an added
sign/agreepath; no change to the interface shape.
Relationship to Other Proposals
volume-encryption-proposal.md— primary first consumer.EncryptedBlockDeviceFactory.open(raw, key, format)andEncryptedNamespaceboth take aSymmetricKeycap defined here (KeyPurpose.blockVolume/objectStore, typicallyaes256GcmSiv/aes256Xts/xchacha20Poly1305). Per-user session unlock invokesPasskeyPrfKeySource.unlockSymmetric(Phase 4) to mint the user DEK; system volumes unwrap a DEK throughTpm2KeySourceorCloudKmsKeySource(Phases 5–6).KeyVaultowns the sealed DEK blob and appliesSealPolicyon every unlock;rotateSealis how that proposal achieves KEK rotation without rewriting volume data.boot-to-shell-proposal.md—CredentialStorestores authenticator verifiers;PasskeyPrfKeySourcehere produces keys from assertions that passCredentialStoreverification.networking-proposal.md— TLS and mTLS needPrivateKey/PublicKey; instance mTLS bootstraps from aCloudKmsKeySourceorKeyVault-issued service identity key.ssh-shell-proposal.md— SSH host keys are sign-onlyPrivateKeywrappers backed byKeyVault; accepted OpenSSH-format public keys are verifier material that map to sessions but never grant shell authority directly.certificates-and-tls-proposal.md— layers X.509, trust stores, CT, OCSP, pinning, ACME, and TLS config on top of the keys defined here.TlsServerConfig.key()andTlsClientConfig.clientAuth()return aPrivateKeycap minted by this proposal, typically generated byKeyVault.generatePrivate( algorithm, KeyPurpose.tls, policy). ACME account JWS signing uses a purpose-separatedKeyPurpose.acmeAccountkey; ACME enrollment (AcmeClient.requestCertificate(orderId, certKey, ...)) consumes the TLS certificatePrivateKeyfrom the sameKeyVault. CA private keys live inKeyVaultunder a strictSealPolicy(typicallypcror composite KMS + attestation). Public material flows throughPublicKey.export(PublicKeyFormat.spkiDer)into that proposal’s certificate chain and trust-store structures, so this proposal’s cap boundary is the only place TLS private material is reachable.oidc-and-oauth2-proposal.md— OIDC/OAuth2 client, token, JWKS, JWT wrapper, DPoP, and workload identity federation caps compose with the keys defined here.OidcFederatedKeySourceandSealPolicy.tokenExchange(withTokenExchangePolicy/NamedClaim/minAuthStrength) live in this proposal because they are key-source shapes; the token protocol frame, discovery, JWKS handling, grant types, and verifier live there.JwtSignerandJwtVerifierare thin wrappers defined there that hold aPrivateKey/PublicKeyfrom here and bind it to a fixed(issuer, audience, claim_constraints)tuple before emitting compact-serialized JWTs.KeyPurpose.oauthClientAssertiontags the key thatClientAuthMethod.privateKeyJwtandlocalPrivateKeyJwtsign with (RFC 7523 §2.2 client assertion against the token endpoint or a local STS).KeyPurpose.oidcIdTokentags the IdP-side signing key held byLocalIdentityProviderand published in itsJwksrotation set.KeyPurpose.dpopBindingtags the per-client DPoP keypair surfaced asDpopKeysoAccessTokenresults stayjkt-bound (RFC 9449). Token-exchange-gated unlock flows in Phase 6b consumeAccessTokenandWorkloadIdentityFederationcaps from that proposal and feed the cloud KMS or capOS-internal key service named inTokenExchangePolicy.audience.cloud-metadata-proposal.md—InstanceIdentitycap consumed byCloudKmsKeySourceandAttestationKeySource.user-identity-and-policy-proposal.md— per-user keys are bound to session identity; the same cap chain that says “you are Alice” yields Alice’sSymmetricKeyviaPasskeyPrfKeySource.cloud-deployment-proposal.md— hardware abstraction for self-encrypting drives sets up a futureSelfEncryptingBlockDevicecap with hardware-held keys, a distinct trust model from software-crypto keys here.security-and-verification-proposal.md— crypto is a top target for tiered tooling: constant-time linting, AEAD fuzzing, Loom models of the unlock state machine, Kani-style proofs of nonce-uniqueness.system-monitoring-proposal.md— everyKeymethod call, everyKeyVaultoperation, and everyKeySource.unlockshould flow through the audit cap. Schema for audit events is defined there; key-management produces a specific event family.hardware-audit-persistence-proposal.md— the DDF audit step 1 schema (SegmentHeaderand durable-pathHardwareAuditRecordfields, landed inschema/capos.capnp) can useSymmetricKey.mac(HMAC,KeyPurpose.integrity) andPrivateKey.sign(asymmetric signing) to seal each audit segment.KeyPurpose.auditis the intended tag for signing keys held by the audit log service. Phase 1 of this proposal (RAM-only key service) is the minimum prerequisite for that signing path to become functional.formal-mac-mic-proposal.md— includes GOST-style modeling. GOST symmetric (Kuznyechik, Magma) and asymmetric (Streebog-signed schemes) algorithms can be added to the enums when a deployment requires them.storage-and-naming-proposal.md— Open Question #5 (manifest trust, secure boot) is a prerequisite forTpm2KeySourceto be meaningful.../design-risks-register.md— R14 (durable identity / session liveness) lists this proposal among its owners: per-userEncryptedNamespaceunlock, session-token HMAC keys, andLocalIdentityProviderID-token signing keys all live behindKeyVaultandKeySourcehere, so durable identity work cannot land before persistentKeyVault(Phase 3) plusPassphraseKeySource/PasskeyPrfKeySource(Phase 4) do.
Open Questions
- Canonical algorithm set for v1. Overshooting the enum invites
implementation sprawl; undershooting forces schema evolution
early. Proposed minimum:
aes256GcmSiv,xchacha20Poly1305,hmacSha256,ed25519,x25519. Addrsa*,p256, post-quantum as real consumers arrive. - Does
SymmetricKeyexpose raw encrypt-without-AAD? AEAD with empty AAD is trivially expressible, but some callers may want explicit guarantees that non-AEAD modes are unavailable. Decide whether the interface permitsaad == Data()universally or whetherKeyPurposeconstrains it. - Public key distribution.
PublicKeyis a cap, but public material is public — should there be a “public key is freely-shareable bytes” escape hatch outside the cap system? Probably yes;export()exists for exactly that reason. How does a caller obtain aPublicKeycap from raw bytes? Via aPublicKeyImporterfactory that verifies format, or directly inKeyVault.importPublic? - Revocation of in-flight caps. If a
SymmetricKeycap is granted to 10 services and the key is compromised, can the issuer revoke it? capOS cap revocation is generally “drop at each holder”; this might warrant aKeyVault.revoke(handle)that breaks the server-side object so everyencrypt/decryptreturns an error. Worth designing explicitly rather than leaving implicit. - Audit record granularity. Logging every
encryptcall for a high-throughput volume is noisy; logging only unseal events misses oracle abuse. Probably: unseal and policy-violation events are always logged; per-operation logging is a per-KeyVaultpolicy, off by default. - Key-use quotas. Rate-limit
decryptoperations per cap-holder to contain oracle abuse? Nice to have; not clear whether it belongs at theKeyinterface or at aKeyVaultpolicy. - HSM integration.
PKCS#11is the de facto standard for HSM access. Does capOS grow aPkcs11KeySource, or does each HSM vendor ship a capability-native driver? The cap-native path is cleaner but depends on vendor cooperation. - Backwards compatibility with stored blobs.
SealPolicy, algorithm IDs, and seal blob formats will evolve. Define a versioned envelope around every sealed blob from day one, so rolling upgrades are possible. - Side-channel guarantees per implementation. Document the
expectation for each
KeyAlgorithm(e.g. “constant-time required foraes*; use theaescrate’s hardware backend on x86_64 and bit-sliced implementation elsewhere”). Without this, the security posture varies silently across builds. - GOST and other jurisdiction-mandated algorithms. The
formal-mac-mic-proposal.mdcarves out a GOST-style track. Adding Kuznyechik, Magma, and Streebog-signed schemes is an additive extension; what matters is that the enums stay forward- compatible so a GOST-capable build does not require a schema fork.
Proposal: Certificates, TLS, and Certificate Transparency
Capability-native abstractions for X.509 certificates, trust stores, chain verification, Certificate Transparency (CT), revocation, pinning, automated issuance (ACME), and the TLS contexts built from all of these.
Implementation Status
The schemas and Phase 1-9 ordering below are design beyond the landed Phase 1
subset: vendored WebPKI roots, capos-tls host verifier logic, and the
Certificate / CertificateChain / TrustStore / CertVerifier schema
surface. The remaining near-term work is decomposed into a bounded slice chain
owned by Certificates / TLS and the
Certificates / TLS track in docs/tasks/README.md. The cut lands the lowest-risk real
logic first. The Phase 2 client local proof landed on 2026-06-08: a userspace
TLS 1.3 client completes one handshake over a userspace-served TcpSocket cap
with a vendored embedded-tls state machine while validating the peer chain
with capos-tls. The
key-management proposal now has
the minimal PrivateKey / PublicKey ABI, RAM signing core, and RAM-only
KeyVault custody plus a development-only software KeySource for local
TLS/ACME proofs, but no production custody source yet. Production/public
server-side TLS remains blocked on reviewed custody and a server cert source:
- Phase 1 deps [DONE 2026-06-03]. vendor
rustls-webpki+webpki-rootsas no_std+alloc snapshots with provenance:cloud-tls-vendor-rustls-webpki-roots-no-std-provenance. - Phase 1 [DONE 2026-06-03].
Certificate/CertificateChain/TrustStore/CertVerifierschema + host-tested verify logic over a RAM-only webpki-roots store:cloud-tls-cert-truststore-certverifier-phase1-host-proof. - Phase 2 (client) [DONE 2026-06-08]. One userspace TLS client handshake
over the Phase C userspace
TcpSocketcap, validating the peer chain with the Phase 1 verifier and a vendoredembedded-tlsTLS 1.3 state machine:cloud-tls-client-handshake-over-tcpsocket-local-proof. - Phase 2 (server consumer) – capOS-terminated TLS for the self-hosted Web
UI (the direct-termination successor to the provider-terminated bootstrap
below, not the closeout path for the first public proof), blocked additionally
on a sealed
PrivateKeycap and a server cert source:cloud-tls-self-hosted-webui-terminated-endpoint. - Minimal TLS/ACME key custody [DONE for local proofs]. The TLS server key and ACME
account key need a
PrivateKey/KeyVault/KeySourcesubset. The minimalPrivateKey/PublicKeyABI and RAM signing proof landed 2026-06-04; RAMKeyVaultcustody landed 2026-06-05; development-only softwareKeySourcebootstrap landed 2026-06-05. - Phase 3 (ACME successor chain) [PARTIAL]. The local ACME account/order
core landed on 2026-06-08:
capos-tlssigns ES256 JWS requests through anAcmeAccountPrivateKeycap, submits a CSR signed by a TLS-purpose key cap, and parses a returned local test certificate chain. Remaining Phase 3 work is scopedhttp-01challenge solving,CertificateStore.watchrenewal/rotation, and then a public GCE capOS-terminated direct-termination proof. These are successor tasks after the provider-managed first public proof, not replacements for it:cloud-tls-acme-account-order-local-proof[DONE 2026-06-08],cloud-tls-acme-http01-challenge-solver-local-proof,cloud-tls-acme-renewal-certstore-rotation-local-proof, andcloud-gce-public-webui-letsencrypt-direct-termination-proof.
Phases 4-9 (OCSP, CT, pinning, CRL, private CA) remain undecomposed design.
Why a Separate Proposal
Keys and certificates are related but different concerns. Keys are secret material whose contract is “compute with me.” Certificates are public assertions whose contract is “believe this identity, if the chain and CT/revocation evidence pass policy.” The two failure modes (key compromise vs. mis-issuance, revocation vs. renewal, HSM custody vs. CA trust) barely overlap.
Cryptography and Key Management
already covers SymmetricKey, PrivateKey, PublicKey, KeySource,
and KeyVault. This proposal covers everything on top: certificates,
trust anchors, CT logs, OCSP, CRLs, pinning, ACME, and TLS
configuration. A TLS server is composed from a PrivateKey cap (from
the key proposal) plus the certificate/verification/revocation caps
defined here.
Two adjacent proposals draw their own trust boundaries instead of extending this one:
- OIDC and OAuth2 tokens are not X.509.
OIDC and OAuth2 covers
short-lived bearer tokens (ID tokens, access tokens, DPoP proofs,
client assertions) signed by JWKS-published keys, not by X.509
trust chains. Where an OIDC issuer’s
private_key_jwtclient assertion or workload-identity federation flow does need an X.509 cert, the signing key is aPrivateKeycap from the key proposal and the cert is aCertificatecap from this one. The token capability objects, JWKS verifier, and DPoP machinery live in the OIDC proposal; this proposal only supplies the verifier when an OIDC flow happens to land on an X.509 binding. - SSH host keys are not X.509 certs.
SSH Shell Gateway uses raw SSH
host-key signatures (
SshHostKey.signExchangeHash) and TOFU/authorized-key trust, not WebPKI chains. The host key is a narrow wrapper around aPrivateKeycap from the key proposal, constrained to SSH host-key signing; this proposal’sCertificate,TrustStore,CertVerifier, and ACME flow are not consumed by the SSH transport. SSH and TLS/mTLS are intentional siblings — SSH for raw operator/agent access without a CA, TLS for PKI-integrated services.
Problem
capOS will need certificate and TLS infrastructure for:
- TLS termination in the web text shell gateway (Boot to Shell).
- mTLS between services on a multi-host capability graph
(Networking). TLS wraps the
TcpSocketcap defined there; in Phase A-B that socket state is kernel-resident smoltcp, and TLS sees it through the same cap boundary after Phase C migrates the stack to userspace. - WebAuthn attestation statement verification (Boot to Shell).
- Code signing verification for binaries, boot manifests, update bundles (Storage and Naming Open Question #5).
- Cloud KMS HTTPS API clients
(Cryptography and Key Management
CloudKmsKeySource). - Attestation report verification chains
(Cryptography and Key Management
AttestationKeySource). - Any outbound HTTPS client invoked from a service.
Without a shared abstraction each consumer invents its own “where do
trust anchors live”, its own CT policy (or skips CT silently), its
own revocation story (or skips revocation silently), and its own
config surface for rustls. That is how the Linux ecosystem ended up
with /etc/ssl/certs, NSS, GnuTLS’ own store, OpenSSL’s SSL_CTX,
update-ca-certificates, and per-language HTTPS clients with
divergent trust policies. capOS is young enough to avoid that.
Design Principle: Certificates Are Typed Capabilities
A certificate in capOS is a Certificate CapObject, not an opaque
byte blob flowing between services. Trust evaluation, CT and
revocation policy, and TLS configuration are expressed as cap
compositions — never as well-known paths (/etc/ssl/certs) or
library singletons (rustls::RootCertStore::load_native_certs()).
Consequences mirror the key-cap case:
- Attenuation by scope. A service that only needs to verify one
signer receives a
TrustStorecap containing that one anchor, not the full Mozilla root bundle. A service that must not bypass CT receives aCertVerifierwhose policy hasminScts >= 2; no method on that cap lets the caller lower the bar. - Revocation is a cap drop. A compromised anchor is removed from
the
TrustStoreit lives in; holders of a stale restricted view that still trusts it keep trusting it until they pick up the new version. No library’s “just reload the roots” ambient step. - Audit is intrinsic. Every
verifyChain, everyaddAnchor, every OCSP query flows through the audit cap. A service that bypasses revocation shows up in the audit log as a service that stopped callingOcspResponder.status. - Rotation without restart. A TLS server holds a
CertificateStore.watchsubscription; when an ACME renewal lands a fresh chain under the server’s handle, the TLS stack swaps chains on the next handshake. No filesystem signaling, no SIGHUP, no “reloaded 0 of 1 certs” log lines. - Composition, not configuration. A
TlsServerConfigis a cap that encapsulates the key, chain source, stapler, client-auth verifier, and cipher policy. Building a TLS server means acquiring those caps and composing them, not filling in a struct with raw bytes.
Schemas
Certificates and chains
interface Certificate {
# Raw DER encoding — for logging, CT submission, export.
der @0 () -> (encoded :Data);
# Structured fields — callers should prefer these over re-parsing.
subject @1 () -> (name :DistinguishedName);
issuer @2 () -> (name :DistinguishedName);
serial @3 () -> (bytes :Data);
notBefore @4 () -> (epochSeconds :Int64);
notAfter @5 () -> (epochSeconds :Int64);
subjectAltNames @6 () -> (names :List(GeneralName));
# Public key as a cap — callers verify signatures through this.
publicKey @7 () -> (pk :PublicKey);
# Extensions the platform cares about. Returning typed views
# forces the implementation to parse once.
keyUsage @8 () -> (usage :KeyUsageFlags);
extendedKeyUsage @9 () -> (ekus :List(ExtendedKeyUsage));
basicConstraints @10 () -> (ca :Bool, pathLenConstraint :Int32);
nameConstraints @11 () -> (constraints :NameConstraints);
# Embedded SCTs (RFC 6962 §3.3). Callers that only allow
# CT-qualified certs filter on this.
embeddedScts @12 () -> (scts :List(SignedCertificateTimestamp));
# Must-staple marker (RFC 7633).
mustStaple @13 () -> (required :Bool);
# Fingerprint used for pinning, logging, and human display.
fingerprint @14 (hash :HashAlgorithm) -> (digest :Data);
info @15 () -> (kind :CertificateKind,
algorithm :AsymmetricAlgorithm);
}
interface CertificateChain {
# Leaf first, root (or closest-to-root) last. Length-one chains are
# permitted (self-signed leaf).
certificates @0 () -> (chain :List(Certificate));
leaf @1 () -> (cert :Certificate);
# Convenience: verify this chain against a trust store using a
# given verifier. Shortcuts the CertVerifier flow for simple cases.
verify @2 (against :TrustStore,
verifier :CertVerifier,
atEpochSeconds :Int64,
hostname :Text)
-> (outcome :VerificationOutcome);
}
enum CertificateKind {
endEntity @0;
intermediate @1;
trustAnchor @2;
crossSigned @3;
}
GeneralName, DistinguishedName, KeyUsageFlags,
ExtendedKeyUsage, and NameConstraints are plain struct/enum
definitions mirroring RFC 5280 (omitted here for brevity).
Trust stores
interface TrustStore {
# List anchors as WebPKI trust-anchor records. Mozilla/WebPKI roots may not
# be representable as full Certificate caps.
anchors @0 () -> (anchors :List(TrustAnchorInfo));
# Attenuate to a subset (e.g. only WebPKI roots, only corporate
# CAs, only a specific CA). The resulting cap is a fresh
# TrustStore that no longer references anchors outside the filter.
restrict @1 (filter :TrustFilter) -> (subset :TrustStore);
# Add a trusted anchor. Only holders with write authority succeed;
# read-only TrustStore caps reject this method.
addAnchor @2 (cert :Certificate, pin :AnchorPin) -> ();
# Remove an anchor. Matches either fingerprint or subject DN.
removeAnchor @3 (selector :AnchorSelector) -> ();
# Monotonic version bumped on every mutation; consumers cache by
# version to avoid revalidating unchanged trust chains.
version @4 () -> (n :UInt64);
}
struct TrustFilter {
purposes @0 :List(CertPurpose); # Only anchors usable for these
fingerprints @1 :List(Data); # Allow-list by SHA-256
subjects @2 :List(Data); # Allow-list by subject DN
excludeFingerprints @3 :List(Data); # Deny-list
}
struct AnchorPin {
spkiHash @0 :Data; # SHA-256 of SPKI
hashAlgorithm @1 :HashAlgorithm;
}
struct AnchorSelector {
union {
fingerprint @0 :Data;
subject @1 :DistinguishedName;
}
}
enum CertPurpose {
tlsServerAuth @0;
tlsClientAuth @1;
codeSigning @2;
emailSmime @3;
clientIdentity @4;
ctLog @5; # TrustStore of CT log public keys
ocspSigning @6;
webauthnRoot @7; # FIDO metadata / attestation roots
}
Verifier
interface CertVerifier {
verifyChain @0 (chain :CertificateChain,
trust :TrustStore,
purpose :CertPurpose,
atEpochSeconds :Int64,
hostname :Text)
-> (outcome :VerificationOutcome);
# Thin wrapper over a single signature check against a cert's
# public key. Useful for WebAuthn attestation, signed manifests,
# signed audit records.
verifySignature @1 (cert :Certificate,
message :Data,
signature :Data,
scheme :SignatureScheme)
-> (ok :Bool);
policy @2 () -> (policy :VerificationPolicy);
}
struct VerificationPolicy {
minScts @0 :UInt8;
ctLogs @1 :TrustStore; # which logs count
allowedAlgorithms @2 :List(AsymmetricAlgorithm);
allowedSignatureSchemes @3 :List(SignatureScheme);
requireOcsp @4 :Bool;
maxChainLength @5 :UInt8;
permitNameConstraints @6 :Bool;
clockSkewSeconds @7 :UInt32;
# When set, certificates not carrying the must-staple extension
# are still required to deliver a stapled OCSP response.
staplingRequired @8 :Bool;
}
struct VerificationOutcome {
union {
valid @0 :ValidChain;
invalid @1 :VerificationFailure;
}
}
struct ValidChain {
anchor @0 :TrustAnchorInfo;
sctCount @1 :UInt8;
ocspStatus @2 :OcspStatus;
notAfter @3 :Int64; # min notAfter across the verified path
}
struct VerificationFailure {
reason @0 :FailureReason;
detail @1 :Text;
}
enum FailureReason {
unknownAnchor @0;
expired @1;
notYetValid @2;
signatureMismatch @3;
nameMismatch @4;
insufficientScts @5;
revoked @6;
ocspUnavailable @7;
weakAlgorithm @8;
policyViolation @9;
badEku @10;
chainTooLong @11;
nameConstraintViolation @12;
mustStapleMissing @13;
pinMismatch @14;
}
Default VerificationPolicy presets:
webPkiStrict—minScts = 2,requireOcsp = true, allowed algorithms and schemes drawn from Mozilla’s “modern” profile.webPkiLenient—minScts = 0,requireOcsp = false. Used by low-value clients where misrouting is acceptable.privateMtls—minScts = 0,requireOcsp = true,maxChainLength = 3. Used between capOS services holding CA-issued identity certs.codeSigning—minScts = 0, longnotAftertolerances, narrow allowed EKU set.
Certificate Transparency
capOS treats CT as a first-class verification input, not an add-on.
Consumers that need WebPKI trust configure a CertVerifier with
minScts >= 2 and a ctLogs trust store; verification fails closed
if the leaf lacks that many valid SCTs signed by logs the policy
accepts.
struct SignedCertificateTimestamp {
logId @0 :Data; # SHA-256 of the log's public key
timestamp @1 :UInt64; # ms since epoch
extensions @2 :Data;
signature @3 :Data;
hashAlgorithm @4 :HashAlgorithm;
signatureAlgorithm @5 :SignatureScheme;
origin @6 :SctOrigin;
}
enum SctOrigin {
embedded @0; # X.509 extension (RFC 6962 §3.3)
ocspStapled @1; # OCSP response extension
tlsExtension @2; # TLS handshake extension
}
interface CtLog {
# Submission — used by ACME responders and capOS-internal CAs to
# obtain SCTs before serving newly issued certs.
addChain @0 (chain :CertificateChain)
-> (sct :SignedCertificateTimestamp);
addPreChain @1 (precert :CertificateChain)
-> (sct :SignedCertificateTimestamp);
# Monitoring — STH, entries, consistency proofs.
signedTreeHead @2 () -> (sth :SignedTreeHead);
entries @3 (start :UInt64, count :UInt32)
-> (entries :List(LogEntry));
consistencyProof @4 (first :UInt64, second :UInt64)
-> (proof :List(Data));
info @5 () -> (name :Text,
publicKey :PublicKey,
url :Text);
}
interface CtMonitor {
# Watch for certificates issued under a subject-name pattern (for
# phishing / mis-issuance detection). Events flow to the audit cap.
watchSubject @0 (pattern :Text) -> (subscription :CtSubscription);
listWatched @1 () -> (subscriptions :List(CtSubscription));
}
interface CtSubscription {
events @0 () -> (events :List(CtEvent)); # since last call
cancel @1 () -> ();
}
struct SignedTreeHead {
treeSize @0 :UInt64;
timestamp @1 :UInt64;
rootHash @2 :Data;
signature @3 :Data;
}
struct LogEntry {
index @0 :UInt64;
timestamp @1 :UInt64;
entryType @2 :CtEntryType;
certificate @3 :Data; # ASN.1 TimestampedEntry payload
}
enum CtEntryType {
x509Entry @0;
precertEntry @1;
}
struct CtEvent {
union {
observed @0 :CtObservation;
error @1 :CtWatchError;
}
}
struct CtObservation {
log @0 :Text; # log name or URL
index @1 :UInt64;
certificate @2 :Certificate;
matched @3 :Text; # matched pattern
}
CT integration depends on networking and audit being available. A
capOS build without networking falls back to minScts = 0 and skips
monitoring. The CtMonitor service is optional — its absence means
capOS does not detect mis-issuance against its own domains but does
not affect leaf verification, which uses only embeddedScts and any
SCTs delivered in the TLS handshake.
The log trust store (the ctLogs field of VerificationPolicy) is
itself a TrustStore cap, populated from Chrome’s CT log list with
the same bundling and signing approach used for WebPKI roots. CT logs
are rotated regularly; the log list is the first place a deployment
without fresh updates starts failing in a visible way, which is the
intended failure mode.
Revocation
interface OcspResponder {
# Query an OCSP responder for status. `issuer` supplies the cert
# used to verify the responder signature chain back to a trust
# anchor.
status @0 (cert :Certificate,
issuer :Certificate,
atEpochSeconds :Int64)
-> (response :OcspResponse);
}
interface OcspStapler {
# TLS server side: fetch and cache an OCSP response for the
# server's own certificate. The TLS stack staples the cached
# response into every handshake.
currentResponse @0 () -> (response :OcspResponse);
refresh @1 () -> ();
setCertificate @2 (chain :CertificateChain,
responder :OcspResponder) -> ();
}
interface CrlStore {
# Look up a CRL for a given issuer DN; fallback when OCSP is
# unavailable. Discouraged; CRLs do not scale.
crlFor @0 (issuer :DistinguishedName) -> (crl :Data);
contains @1 (issuer :DistinguishedName, serial :Data)
-> (revoked :Bool);
}
struct OcspResponse {
der @0 :Data; # RFC 6960 DER-encoded response
status @1 :OcspStatus;
thisUpdate @2 :Int64;
nextUpdate @3 :Int64;
}
enum OcspStatus {
good @0;
revoked @1;
unknown @2;
stapledAbsent @3; # handshake carried no stapled response
responderUnreachable @4;
}
Policy choices capOS bakes into the defaults:
VerificationPolicy.requireOcsp = truemeans OCSP-unreachable is a hard verification failure. Default forCertPurpose.tlsClientAuthon services facing untrusted networks; soft-fail otherwise.- A certificate carrying the
id-pe-tlsfeaturemust-staple extension fails verification if no stapled response is present, regardless ofrequireOcsp. VerificationPolicy.staplingRequired = trueextends must-staple behavior to all certs checked under that verifier, not only the ones that set the extension.- CRL support exists for legacy compatibility and explicit code-signing fallback. Services that can choose prefer OCSP stapling, which pulls revocation latency to handshake time without leaking the client’s identity to the responder.
Pinning
interface PinSet {
# A pin set is a list of (SPKI-hash, algorithm) pairs. Verification
# succeeds only if at least one cert in the chain has an SPKI hash
# matching a pin.
pins @0 () -> (entries :List(Pin));
enforce @1 (chain :CertificateChain) -> (outcome :VerificationOutcome);
addPin @2 (pin :Pin) -> ();
removePin @3 (pin :Pin) -> ();
info @4 () -> (mode :PinMode, expires :Int64);
}
struct Pin {
spkiHash @0 :Data;
hashAlgorithm @1 :HashAlgorithm;
}
enum PinMode {
enforce @0; # fail closed on mismatch
reportOnly @1; # succeed; emit audit event
}
A PinSet restricts an already-trusted chain; it does not add trust.
Composition is intersection: trust + CT + OCSP + pins must all pass
for verification to succeed. Pin sets are per-consumer; the web shell
gateway’s client-side ACME challenge fetches do not share a pin set
with the fleet mTLS layer.
Issuance and renewal
ACME is the only supported issuance protocol for v1. Challenge solvers
are caps so the ACME client has no ambient authority over DNS or the
HTTP server. Self-signing and internal-CA use cases are covered by a
separate CertificateAuthority cap (future work, see Open Questions).
interface AcmeClient {
# Register or rediscover an account using an account key cap.
register @0 (accountKey :PrivateKey, contact :List(Text))
-> (account :AcmeAccount);
# Order a certificate for a list of identifiers.
order @1 (account :AcmeAccount,
identifiers :List(AcmeIdentifier),
certKey :PrivateKey,
solver :ChallengeSolver)
-> (chain :CertificateChain);
# Renew a previously-issued chain when notAfter is near.
renew @2 (chain :CertificateChain,
certKey :PrivateKey,
solver :ChallengeSolver)
-> (chain :CertificateChain);
# Revoke a cert.
revoke @3 (cert :Certificate, reason :RevocationReason) -> ();
directory @4 () -> (url :Text, meta :AcmeDirectoryMeta);
}
interface ChallengeSolver {
# Publish a challenge token and wait for the ACME server to
# validate. The solver owns whatever authority is required —
# DNS record write, HTTP server handler registration, TLS-ALPN
# responder slot — and nothing more.
solve @0 (challenge :AcmeChallenge) -> (ok :Bool);
cleanup @1 (challenge :AcmeChallenge) -> ();
supports @2 () -> (types :List(AcmeChallengeType));
}
enum AcmeChallengeType {
http01 @0;
dns01 @1;
tlsAlpn01 @2;
}
struct AcmeIdentifier {
type @0 :Text; # "dns", "ip", ...
value @1 :Text;
}
interface CertificateStore {
# Store a certificate chain under a stable handle; used by TLS
# servers to retrieve the current chain on handshake.
put @0 (handle :Text, chain :CertificateChain) -> ();
get @1 (handle :Text) -> (chain :CertificateChain);
list @2 () -> (handles :List(Text));
delete @3 (handle :Text) -> ();
watch @4 (handle :Text) -> (subscription :CertSubscription);
}
interface CertSubscription {
events @0 () -> (events :List(CertRotationEvent));
cancel @1 () -> ();
}
struct CertRotationEvent {
handle @0 :Text;
newChain @1 :CertificateChain;
rotatedAt @2 :Int64;
}
The CertificateStore.watch subscription is the point at which an
ACME renewal service notifies a TLS server to rotate its chain. The
TLS server does not poll files, no filesystem signaling is involved,
and rotation is atomic from a handshake’s perspective.
TLS configuration
interface TlsServerConfig {
key @0 () -> (k :PrivateKey);
chainSource @1 () -> (store :CertificateStore, handle :Text);
stapler @2 () -> (s :OcspStapler);
# Optional: require client auth against these verifier + trust
# caps. If unset, the server accepts any client or no client.
clientVerifier @3 () -> (v :CertVerifier, trust :TrustStore);
alpn @4 () -> (protocols :List(Text));
minVersion @5 () -> (v :TlsVersion);
cipherPolicy @6 () -> (policy :CipherPolicy);
}
interface TlsClientConfig {
verifier @0 () -> (v :CertVerifier);
trust @1 () -> (t :TrustStore);
pins @2 () -> (p :PinSet); # null for no pinning
clientAuth @3 () -> (k :PrivateKey, chain :CertificateChain);
alpn @4 () -> (protocols :List(Text));
minVersion @5 () -> (v :TlsVersion);
serverNameOverride @6 () -> (host :Text);
}
enum TlsVersion {
tls12 @0;
tls13 @1;
}
enum CipherPolicy {
modern @0; # TLS 1.3 + AEAD only; Mozilla "modern"
intermediate @1; # TLS 1.2 + 1.3; Mozilla "intermediate"
legacy @2; # Explicit opt-in for ancient peers
}
The TLS stack consumes a TlsServerConfig or TlsClientConfig cap plus a raw
TcpSocket and produces a TlsSocket. The first landed local client proof uses
embedded-tls directly over a userspace-served TcpSocket; the broader
config-cap service surface remains the Phase 2 TLS-service design. The
TlsSocket draft interface lives in the
“TLS Layering” section of
Networking; this proposal only
defines the configuration surface. While TcpSocket state remains
kernel-resident through Phase A-B of the networking proposal, the
TLS stack itself is a userspace consumer of that cap and does not
move into the kernel — the certificate parser, path builder, and
TLS state machine all run in the userspace TLS service.
Trust Anchor Bootstrap
The v1 trust anchor bundle is Mozilla’s NSS store, synthesized from
the webpki-roots crate data embedded in the boot manifest. Rationale:
the bundle is well-curated, auditable (Mozilla’s CA Certificate
Program publishes policy and meeting minutes), and already the de
facto default for every Rust TLS stack. capOS does not invent a new
root program.
CT log lists follow the same pattern, drawn from Chrome’s published CT log list.
Update policy:
- Root-store bundles are versioned and signed.
addAnchoron the systemTrustStoreis restricted to the trust-admin service, which accepts bundles whose signature chains to a build-time key embedded in the boot manifest. - Deployment overrides (corporate CAs, explicit Mozilla-root removal)
compose with the Mozilla bundle via
TrustStore.restrictandaddAnchoron an override store. Overrides are themselves signed and manifest-addressable. - Replacement ships as a manifest update (see Storage and Naming Open Question #5 on manifest signing).
The manifest-embedded root store has no background network update
path by design. A compromised root requires a new signed manifest,
which requires the measured-boot chain. Root updates are a deliberate
operational event, not a silent refresh. This is a deliberate
trade-off against the Linux-style ca-certificates package that
updates on every apt run.
Bootstrap TLS for the First Public GCE Web UI
The schemas above are no longer entirely pre-implementation design: the Phase 1
verifier, Phase 2 client handshake over a userspace TcpSocket, local
key-custody precursors, and the local ACME account/order/finalize core have
landed. The server-side TlsServerConfig / TlsSocket consumer, scoped
http-01, CertificateStore.watch renewal, production key custody, and later
CT/OCSP/pinning surfaces remain future or blocked work. The first time the
self-served capOS Web UI (remote-session-web-ui) is exposed to a public
operator browser on GCE, capOS therefore still does not terminate TLS
itself. The reviewed first ingress terminates HTTPS at the GCP external load
balancer’s Google front end against a provider-managed certificate; capOS serves
only plain HTTP/1.1 on a backend port reachable solely from the load balancer
and health-check source ranges. The full posture (firewall scope, browser
session rules, evidence, teardown) is recorded in the “Public Web UI Ingress
Policy” section of
Cloud Deployment and the on-hold
public Web UI ingress task;
this note records only where TLS terminates and who holds the key.
Bootstrap consequences specific to this proposal:
- No capOS private-key custody in the first proof. The TLS private
key stays on the provider side. No
PrivateKeycap,KeyVault, orKeySourcefrom Cryptography and Key Management is consumed for the first public Web UI endpoint, and no key material is written into the disk image, manifest, or evidence directory. - No capability-native verification on the public hop. Because the
Google front end performs TLS, the
Certificate,TrustStore,CertVerifier,OcspStapler, and ACME flows defined here are not exercised by the first public Web UI proof. Provider-managed certificate lifecycle (issuance, renewal, revocation) is the provider’s, not capOS’s. - Successor path is the direct-termination shape. When this
proposal’s
TlsServerConfigplus anAcmeClient/ChallengeSolver(Phases 2-3) ship over the userspace TLS stack, a direct-external-IP, capOS-terminated ingress becomes a separately reviewed second option. At that point the certificate is aCertificateChaincap, the key is a sealedPrivateKeycap,CertificateStore.watchdrives rotation, and the load-balancer-terminated path becomes one deployment choice rather than the only buildable one. The bootstrap step does not foreclose the capability-native model; it precedes it. - Let’s Encrypt is successor-only until the remaining prerequisites land.
The
Certificates / TLS backlog now names the
landed local key-custody precursor (
PrivateKey/KeyVault/ developmentKeySource), landed TLS client over the userspaceTcpSocket, and landed local ACME account/order/finalize core. Remaining successor prerequisites are the capOS-terminated Web UI TLS endpoint, scopedhttp-01solver,CertificateStore.watchrenewal and rotation, and then the on-hold Let’s Encrypt direct-termination GCE proof. Local ACME proofs use a local Let’s Encrypt-compatible directory. A real GCE or Let’s Encrypt staging/production run additionally needs a controlled public DNS name and explicit billable/public-ingress and CA authorization. Raw key material must not be written to manifests, images, logs, or evidence.
This mirrors the trust-anchor bootstrap above: capOS ships a pragmatic, reviewed interim posture (here, provider-terminated TLS) and migrates to the capability-native model as the implementing subsystems land, rather than blocking the first public proof on the full stack.
Consumers
| Consumer | Uses |
|---|---|
| Web text shell gateway | TlsServerConfig + OcspStapler; cert from AcmeClient |
| Inter-service mTLS | TlsServerConfig + TlsClientConfig with private-PKI TrustStore |
| Outbound HTTPS clients (KMS, IMDS) | TlsClientConfig with WebPKI-strict verifier |
| WebAuthn attestation verification | CertVerifier.verifySignature with FIDO MDS TrustStore |
| Code signing verification | CertVerifier with codeSigning trust store + OCSP |
| Signed manifest verification | CertVerifier.verifySignature + pinned build-time root |
| CT mis-issuance monitoring | CtMonitor.watchSubject on capOS-owned domains |
Threat Model
Specific to this subsystem, independent of the crypto/key threat model:
- Bogus CA in the trust store. Compromise of any CA in the
trust store compromises every cert the verifier accepts.
Mitigations: restrict the trust store as narrowly as each consumer
permits (private-PKI services use a private-PKI-only store, not
WebPKI); require CT for
tlsServerAuth; enableCtMonitorfor capOS-owned subject patterns. - CT log compromise or collusion. A log signs a non-existent
certificate. Mitigations: require SCTs from multiple independent
logs (
minScts >= 2); enforce log list freshness (policy rejects SCTs from retired or disqualified logs); monitor STH inclusion proofs for capOS-issued certs. - OCSP responder compromise. The responder signs “good” for a
revoked cert. Mitigations: OCSP response signature chains back to
a trust anchor via the OCSP-signing EKU; short
nextUpdatewindows limit stale “good” responses; fail-closed whenrequireOcspis set. - Stapling stripping. A MITM strips OCSP staples between a
compliant server and the client. Mitigations: must-staple
extension on the server cert forces closed-fail; client-side
staplingRequiredpolicy extends this to all certs. - Name-constraint bypass. An intermediate CA issues for names
outside its constrained scope. Mitigations:
permitNameConstraintsalways on; verifier enforces name constraints before reporting success. - Pin brittleness. A pin prevents legitimate rotation,
locking out users. Mitigations: short pin expiries,
reportOnlymode for rollout, pins bound to SPKI (not to full certificates). - ACME challenge hijack. A challenge solver with excessive authority forges validation tokens. Mitigations: each solver is a scoped cap (one DNS zone, one HTTP path prefix, one ALPN slot); solvers are per-consumer, not shared.
- Revocation denial-of-service. An attacker saturates the OCSP responder, forcing soft-fail everywhere. Mitigations: OCSP stapling (server-side caching takes the responder off the hot path); CRL fallback under deployment policy only.
- Clock skew attacks. A client with a wrong clock accepts
expired certs or rejects valid ones. Mitigations:
clockSkewSecondshas a tight default; consumers requiring hard-fail use an attested time source (Cloud Metadata).
Phases
Phases follow the consumers that need this infrastructure.
Phase 1 — Certificate, CertificateChain, TrustStore, CertVerifier
- Add the schemas above to
schema/capos.capnp. - Implement a RAM-only trust store seeded from
webpki-roots. - Implement a
CertVerifierusingrustls-webpkifor path building and signature verification. - Host tests: chain verification against known-good and known-bad samples, name constraints, algorithm gating.
Phase 2 — TLS server and client configs
- Add
TlsServerConfigandTlsClientConfigschemas. - Wire a userspace TLS state machine into the networking stack as a
TlsSocketoverTcpSocket; defined in Networking. CertificateStorewith in-memory backing for the web shell gateway.
Phase 3 — ACME client and challenge solvers
AcmeClientcore speaking the local RFC 8555 account/order/finalize flow has landed incapos-tls; served capability wiring and public CA transport remain future.ChallengeSolverimplementations forhttp-01(against the web shell gateway’s HTTP listener) andtls-alpn-01.dns-01follows once a DNS cap exists.CertificateStore.watchsubscription drives TLS rotation without gateway restart.
Phase 4 — OCSP stapling
OcspResponder+OcspStaplerservices.- Must-staple enforcement in
CertVerifier. - Cached stapled responses refresh in the background.
Phase 5 — Certificate Transparency (submission + verification)
- SCT verification in
CertVerifier(both embedded and TLS-extension SCTs). CtLogclient for submission; ACME flows submit precertificates to required logs before handing the cert to the caller.- Chrome CT log list bundled and signed like the WebPKI bundle.
Phase 6 — CT monitoring
CtMonitorservice withwatchSubjectsubscriptions.- Observations flow to the audit cap (System Monitoring).
- Proof verification: STH signatures, inclusion proofs for capOS-issued certs, consistency proofs across STHs.
Phase 7 — Pinning
PinSetservice with enforce and report-only modes.- Per-consumer pin policy plumbing.
- Audit records on mismatch.
Phase 8 — CRL fallback and legacy compat
CrlStoreimplementation for code-signing flows that require CRL.- Policy knob to enable CRL fallback for OCSP-unreachable cases.
Phase 9 — Private CA
CertificateAuthoritycap for capOS-internal issuance (mTLS fleet bootstrapping without an external ACME dependency).- CA keys live in
KeyVaultwith strict seal policy. - Internal CT log (optional) for mis-issuance detection within a private fleet.
Relationship to Other Proposals
- Cryptography and Key Management
— supplies the key primitives this proposal consumes. Its minimal
PrivateKey/PublicKeyABI, RAM signing core, RAM-onlyKeyVaulthandle custody, and development-only softwareKeySourcebootstrap exist for the local TLS/ACME precursor. Persistence and production custody remain future. A TLS server’s key cap, an ACME account key, and an internal CA signing key all live in aKeyVaultsealed under aKeySource(typical choices:Tpm2KeySourcefor fleet mTLS identities,PasskeyPrfKeySourceorPassphraseKeySourcefor operator client-auth,CloudKmsKeySourcefor cloud-anchored CAs, development-only software sources for local ACME accounts). TLS certificate keys and ACME account JWS keys remain purpose-separated; the key proposal names that split asKeyPurpose.tlsandKeyPurpose.acmeAccount. - Networking — defines the
TcpSocketthis proposal wraps and the draftTlsSocketinterface that consumesTlsServerConfig/TlsClientConfig. In the proposal’s Phase A-B the socket state is kernel-resident smoltcp; the TLS stack consumes that cap from userspace and does not move into the kernel even before Phase C. mTLS between services uses this proposal’s verifier and trust store on top of that same cap. - OIDC and OAuth2 —
separate trust model (JWKS-signed bearer tokens, not X.509
chains). The two proposals meet only at the corners where OIDC
flows do bind to an X.509 cert:
private_key_jwtclient assertions andtls_client_auth/self_signed_tls_client_authOAuth2 client authentication consume aPrivateKeyfrom the key proposal plus aCertificate/CertificateChainfrom this one; workload-identity federation (RFC 8693) and outbound HTTPS to IdP/JWKS endpoints consume aTlsClientConfigwith awebPkiStrictverifier. Inbound bearer-token verification stays in the OIDC proposal. - SSH Shell Gateway — explicitly a
non-consumer. SSH uses raw host-key signatures and
TOFU/authorized-key trust, not WebPKI; the host key wraps a
PrivateKeyfrom the key proposal directly, not aCertificatefrom this one. SSH and TLS/mTLS coexist as the two operator-facing remote-shell paths: SSH for CA-free operator/agent access, TLS/mTLS (the web text shell gateway plus future Telnet-over-TLS paths) for PKI-integrated environments. - Boot to Shell — web
text shell gateway consumes
TlsServerConfig; ACME via this proposal provides the cert. WebAuthn attestation verification usesCertVerifier.verifySignature. - Cloud Metadata —
InstanceIdentityis often expressed as a signed JWT or X.509 certificate; verifying attestation statements usesCertVerifier. - Storage and Naming — Open Question #5 (manifest trust, secure boot) is the source of the build-time key that signs root-store and CT-log bundles.
- System Monitoring
— every
verifyChain,addAnchor, OCSP query, CT observation, and pin mismatch flows through the audit cap. - Security and Verification
— the certificate parser, path builder, and policy engine are top
targets for fuzzing and property testing. The landed client path uses
embedded-tlsplusrustls-webpki-backedcapos-tlsverification; capOS-specific policy glue (CT, stapling, pinning) gets its own tier of tooling. - User Identity and Policy
— client-auth certs and per-user mTLS identity consume
TlsClientConfigwith a per-sessionPrivateKeycap.
Open Questions
- Canonical default policy. Should
webPkiStrictrequire `minScts= 2` from day one, or is that too aggressive before CT log list curation ships? Chrome requires 2; Apple requires varying counts by cert lifetime. Probably match Chrome initially and revisit.
- CRL scope. Is CRL support worth the footprint at all, or should capOS ship OCSP-only and refuse to verify against CRL-only CAs? Leaning “CRL for code signing only”, not for TLS.
- Private CA surface. A
CertificateAuthoritycap withissue,revoke, andlistIssuedmethods is straightforward, but the policy for issuance (SAN constraints, lifetime caps) deserves its own schema pass. Deferred to Phase 9. - Trust-store delta signing. Signing every bundle replacement is expensive. A delta format (add/remove anchors with signed manifest patches) would be lighter; worth it only once bundle churn becomes a real operational cost.
- OCSP nonce support. Nonces prevent replay but most responders do not honor them. Ship without and revisit if a deployment needs replay-resistance.
webpki-rootscrate churn. The crate publishes a new version on Mozilla NSS changes, which is frequent. capOS needs a clean bump story — probably “new release triggers a trust-store bundle rebuild”, automated in CI.- Stapling cache persistence. Must the
OcspStaplercache survive reboot? Surviving reboot avoids a refresh storm at startup but risks serving very stale responses. Probably: cache is per-boot, with a short pre-refresh window beforenextUpdate. - Client-cert private key reuse. If a client uses one mTLS
identity across many outbound connections, does each
TlsClientConfighold its ownPrivateKeycap (wasteful) or share one (safe, since the cap’ssignmethod is the only surface)? Probably share by default; make duplication explicit if needed. - Integration with
CredentialStore. Some WebAuthn authenticators return attestation certs that must be chain-verified against FIDO MDS. The verification usesCertVerifier; the MDS trust store is aTrustStoremaintained separately from the WebPKI bundle. How does MDS update cadence fit the no-background-update policy? Probably: MDS updates ride manifest updates, same as root bundles. - GOST trust chains. The
Formal MAC/MIC GOST
track implies GOST-signed certificate chains. The
CertVerifieralgorithm enum is already open-ended; the work is algorithm implementation, not schema evolution.
Proposal: OIDC and OAuth2
Capability-native abstractions for OpenID Connect identity providers, OAuth 2.0 clients, issued tokens, workload identity federation, and the authentication/authorization flows that every modern cloud and enterprise deployment depends on.
Why a Separate Proposal
OIDC and OAuth2 are related but distinct from certificates and keys. Keys (Cryptography and Key Management) are secret material. Certificates (Certificates and TLS) are public assertions of identity binding validated against a PKI trust store. OIDC/OAuth2 is a delegated authority protocol family: tokens are short-lived bearer credentials or proof-of-possession handles issued by an identity provider after authenticating a subject, scoped by a set of permissions, and consumed by a relying party or resource server.
The failure modes barely overlap. Key compromise vs. IdP compromise vs. CA mis-issuance require different detection and recovery stories. Revoking a TLS certificate, revoking an access token, and revoking a KEK are three different operations with three different operational tempos (manifest update / IdP admin action / KMS grant edit).
Putting this in a separate proposal also matches the cross-cutting
nature of the feature: OIDC/OAuth2 shows up in login
(Boot to Shell), session
state (User Identity and Policy),
key unlock (Volume Encryption),
cloud KMS access (Cryptography and Key Management
CloudKmsKeySource), and service-to-service authentication
(Networking). Threading it
through every touchpoint without a shared definition would be a
silo-per-consumer repeat of the Linux ssh-agent/gpg-agent/keyctl
story.
Problem
capOS needs to:
- Accept federated authentication for console and web-terminal login (corporate IdP, Google, GitHub, Okta, Azure AD, Keycloak, Dex) so sessions do not depend on capOS storing or managing primary user credentials.
- Run as a cloud workload without baked-in long-lived IAM credentials, using the modern workload identity federation pattern (RFC 8693 token exchange) to obtain short-lived cloud provider credentials from an attested instance identity or a local keypair.
- Authenticate service-to-service calls using OAuth2 client
credentials,
private_key_jwt, DPoP, or mTLS, chosen by policy rather than hard-wired per consumer. - Consume OAuth2 access tokens from external clients (an HTTP API running under capOS verifies bearer tokens against an issuer’s JWKS) without every service writing its own JWT parser.
- Expose scopes and OIDC claims as policy input to
AuthorityBroker/PolicyEnginewithout letting them act as ambient authority. - Map external subjects to local principals, accounts, sessions, policy profiles, and resource profiles through explicit admission configuration rather than treating provider claims as local roles.
Without a shared abstraction each consumer would invent its own JWT parser, JWKS cache, issuer list, discovery-document fetcher, refresh scheduler, and token storage. That is roughly the OAuth/OIDC mess already visible in most operating systems and app stacks.
Scope
In scope:
- Relying-party (RP) role for OIDC. capOS is the RP; the IdP is external.
- Client role for OAuth2. capOS acts as confidential client, public client (PKCE), or federated workload client.
- Resource-server role for OAuth2. A capOS service can validate inbound bearer tokens.
- Token capability objects for ID tokens, access tokens, refresh tokens, DPoP proofs, and client assertions.
- Integration with
SessionManager,CredentialStore,AuthorityBroker,CloudKmsKeySource,EncryptedBlockDeviceunlock flows.
Out of scope for v1:
- OAuth2 authorization server / OIDC provider role. capOS does not
issue tokens to third parties in the first iteration. A
LocalIdentityProviderthat issues tokens to other capOS services is possible later work; it sits on top of the same primitives. - SAML. The modern direction is OIDC; a deployment that needs SAML
can add a second
SessionManager.loginadapter without reshaping the model. - OAuth 1.0a. Dead.
- UMA 2.0. Out until a concrete consumer appears.
- CIBA (Client-Initiated Backchannel Authentication). Useful for step-up on mobile devices; revisit once the web shell gateway ships.
Design Principle: Tokens Are Typed Capabilities
In the OAuth/JWT world a token is a byte string. Possession equals authority. Every library re-parses, re-validates, and re-caches the token; every log line risks leaking it; every service that needs to forward it must hand it over unattenuated. That is the same architectural failure mode as “a key is a byte string” — the protection mechanism (TLS in transit, DPoP, audience binding) is orthogonal to the system’s main abstraction.
In capOS, a token is a capability object. Holding an AccessToken cap
means “you may present this token for one outbound request, read a
bounded subset of its claims, or exchange it for a more specific
token.” The raw token bytes live in the address space of the
OAuth/OIDC service; callers reach them by invoking typed methods.
Consequences mirror the key-cap case:
- Attenuation by scope. A caller that only needs to read
subreceives aTokenClaimsfacet that does not expose the raw JWT. A caller that only needs to call one resource server receives aBoundTokenthat rejects use against other audiences. - Revocation is a cap drop plus server-side revocation. Dropping the cap prevents that holder from using the token; revoking the token at the IdP prevents any holder from using it. Both paths exist; neither requires a kernel mechanism.
- Audit is intrinsic. Every token present, every refresh, every token exchange flows through the audit cap. Bearer-token leakage to logs becomes harder because the raw string is never returned.
- Composition, not configuration. An OAuth client is a cap that encapsulates client_id, client authentication method, allowed grants, scopes, and target IdP. Building one means composing caps, not stringifying URLs into config files.
- Per-consumer issuance by default. When a service needs a token
for a downstream call, it asks its
OAuthClientcap; the OAuth service issues a per-consumer down-scoped token. Transfer between processes is possible but explicit.
This also gives capOS a natural story for the agent shell’s
“agent never sees secrets” rule: the agent holds an ApprovalGrant
cap that internally holds the access token; the agent invokes the
wrapped resource server cap; the token never appears as data in
the model’s prompt window.
Schemas
Identity provider
interface OidcIdentityProvider {
# The stable issuer URL (RFC 8414 / OIDC Discovery).
issuer @0 () -> (url :Text);
# Discovery document (cached; refreshed on a schedule or on
# signature-verification failure). Returning the parsed metadata,
# not raw JSON, forces the implementation to validate it once.
metadata @1 () -> (meta :OidcProviderMetadata);
# JWKS fetched from the provider's `jwks_uri`, exposed as a set of
# PublicKey caps keyed by `kid`. Key rotation is invisible to
# callers: they ask for a `kid`, they get a current PublicKey or
# an error.
jwks @2 () -> (set :Jwks);
# Verify an ID token fully — signature against jwks, issuer match,
# audience match against the registered client, nonce, exp/nbf/iat
# with clockSkewSeconds from policy, and required `acr`/`amr`
# predicates if set by the OAuthClient.
verifyIdToken @3 (jwt :Data,
policy :IdTokenPolicy)
-> (claims :IdTokenClaims);
}
struct OidcProviderMetadata {
issuer @0 :Text;
authorizationEndpoint @1 :Text;
tokenEndpoint @2 :Text;
userinfoEndpoint @3 :Text;
jwksUri @4 :Text;
endSessionEndpoint @5 :Text;
deviceAuthorizationEndpoint @6 :Text;
revocationEndpoint @7 :Text;
introspectionEndpoint @8 :Text;
responseTypesSupported @9 :List(Text);
grantTypesSupported @10 :List(Text);
tokenEndpointAuthMethodsSupported @11 :List(Text);
scopesSupported @12 :List(Text);
idTokenSigningAlgValuesSupported @13 :List(Text);
codeChallengeMethodsSupported @14 :List(Text);
dpopSigningAlgValuesSupported @15 :List(Text);
requestObjectSigningAlgValuesSupported @16 :List(Text);
# Present when the IdP advertises OAuth 2.0 Token Exchange (RFC 8693).
tokenExchangeSupported @17 :Bool;
}
struct IdTokenPolicy {
expectedAudience @0 :Text; # registered client_id
expectedAzp @1 :Text; # authorized party, if set
requiredAcr @2 :List(Text); # any-of
requiredAmr @3 :List(Text); # any-of
maxAgeSeconds @4 :UInt32; # per OIDC core `max_age`
clockSkewSeconds @5 :UInt32;
nonceMustMatch @6 :Data; # empty = no nonce check
requireAtHashMatch @7 :Bool; # if accessToken present
requireCHashMatch @8 :Bool; # if code present
}
struct IdTokenClaims {
issuer @0 :Text;
subject @1 :Text;
audience @2 :List(Text);
issuedAt @3 :Int64;
expiresAt @4 :Int64;
notBefore @5 :Int64;
nonce @6 :Data;
acr @7 :Text;
amr @8 :List(Text);
azp @9 :Text;
authTime @10 :Int64;
email @11 :Text;
emailVerified @12 :Bool;
preferredUsername @13 :Text;
name @14 :Text;
groups @15 :List(Text);
# Opaque claim map for everything else (profile-specific fields,
# custom claims). Values are JSON-encoded bytes.
additional @16 :List(NamedBlob);
}
struct NamedBlob {
name @0 :Text;
value @1 :Data;
}
IdTokenClaims is intentionally read-only metadata. Possessing a
serialized copy must not grant authority. The durable external subject key is
subjectHash = hash(providerKind, issuer, tenant, subject); admission policy
maps it to a local principal, account, and profiles before a UserSession is
minted.
External identity admission
OIDC authentication produces verified claims. It does not by itself create a
local account, select local roles, or grant capabilities. After
verifyIdToken succeeds, SessionManager resolves the external subject
through one of the identity proposal’s admission sources:
- a manifest-seeded external admission rule for bootstrap, recovery, or early console login before durable storage exists;
- a local account-store
ExternalIdentityBindingthat mapshash(providerKind, issuer, tenant, subject)to an existing local principal; or - an explicit auto-creation rule that creates a pseudonymous or tenant-scoped account with named policy and resource profiles.
The binding shape belongs with the account model, but OIDC consumers depend on its semantics:
struct ExternalIdentityBinding {
bindingId @0 :Data;
provider @1 :Text; # OIDC issuer or configured provider name
subjectHash @2 :Data; # hash(provider kind, issuer, tenant, subject)
principalId @3 :Data; # local or pseudonymous principal
tenant @4 :Text;
acceptedClaims @5 :List(Text);
expiresAtMs @6 :UInt64;
policyProfile @7 :ProfileRef;
resourceProfile @8 :ProfileRef;
schemaVersion @9 :UInt32;
storeEpoch @10 :UInt64;
recordVersion @11 :UInt64;
policyEpoch @12 :UInt64;
previousHash @13 :Data;
contentHash @14 :Data;
}
OIDC groups, roles, acr, amr, tenant IDs, device posture, source
network, and token age are normalized ABAC inputs. A binding rule may map a
provider group to a local role only for a named provider/tenant, expiry, and
policy version. Imported claims are discarded or refreshed when stale, and
roles selected from them remain broker inputs rather than authority.
An external session receives durable storage only when a binding or auto-creation rule maps it to a local principal and a resource profile. Without that mapping, the session is guest, anonymous, or one-shot pseudonymous policy with narrow temporary resources.
OAuth client
interface OAuthClient {
# Bound configuration.
info @0 () -> (meta :OAuthClientMetadata);
# Authorization Code + PKCE (OAuth 2.1 default). The caller owns
# the redirect side-channel; this cap drives the token exchange
# after the code returns.
startAuthCode @1 (requested :TokenRequest)
-> (authUrl :Text, state :AuthCodeState);
completeAuthCode @2 (state :AuthCodeState,
code :Data)
-> (bundle :TokenBundle);
# Device Authorization Grant (RFC 8628). Appropriate for serial
# consoles, embedded displays, and TVs — anywhere the capOS
# process has no browser.
startDeviceCode @3 (requested :TokenRequest)
-> (userCode :Text,
verificationUri :Text,
verificationUriComplete :Text,
expiresIn :UInt32,
interval :UInt32,
state :DeviceCodeState);
pollDeviceCode @4 (state :DeviceCodeState)
-> (outcome :DeviceCodePoll);
# Client Credentials (RFC 6749 §4.4). Backend-to-backend; the
# caller principal is the client itself.
clientCredentials @5 (requested :TokenRequest)
-> (bundle :TokenBundle);
# Refresh an existing bundle (RFC 6749 §6). Fails if the stored
# refresh token is expired or revoked.
refresh @6 (token :RefreshToken,
requested :TokenRequest)
-> (bundle :TokenBundle);
# JWT Bearer (RFC 7523). Caller presents a signed assertion about
# a subject; issuer returns a token for that subject. Used for
# service delegation and some IdP federation flows.
jwtBearer @7 (assertion :Data,
requested :TokenRequest)
-> (bundle :TokenBundle);
# Token Exchange (RFC 8693). Foundation of modern workload
# identity federation: exchange a subject token (e.g. a signed
# instance-identity JWT or an attestation report envelope) for
# an access token at the remote issuer.
tokenExchange @8 (subjectToken :Data,
subjectTokenType :Text, # RFC 8693 §2.1
actorToken :Data,
actorTokenType :Text,
requested :TokenRequest)
-> (bundle :TokenBundle);
# Revoke (RFC 7009). Best-effort; not all IdPs honor it.
revoke @9 (token :TokenRef, reason :Text) -> ();
}
struct OAuthClientMetadata {
clientId @0 :Text;
issuer @1 :Text;
authMethod @2 :ClientAuthMethod;
defaultScopes @3 :List(Text);
defaultAudience @4 :Text;
redirectUris @5 :List(Text);
dpopRequired @6 :Bool;
pkceRequired @7 :Bool; # true for public clients
}
enum ClientAuthMethod {
none @0; # public client with PKCE
clientSecretBasic @1; # HTTP Basic; confidential clients
clientSecretPost @2; # form-encoded; legacy
privateKeyJwt @3; # RFC 7523 §2.2 — JwtSigner over a PrivateKey with KeyPurpose.oauthClientAssertion;
# when the IdP requires a bound X.509 cert the signer carries a Certificate cap
# from certificates-and-tls-proposal.md
tlsClientAuth @4; # RFC 8705 — TlsClientConfig with a client Certificate + PrivateKey from
# certificates-and-tls-proposal.md (PKI-rooted) and key-management
selfSignedTlsClientAuth @5; # RFC 8705 §2.2 — same shape with a self-signed Certificate published in
# OAuthClientMetadata, no PKI chain required
}
struct TokenRequest {
scopes @0 :List(Text);
audience @1 :Text;
resource @2 :List(Text); # RFC 8707
acrValues @3 :List(Text);
maxAgeSeconds @4 :UInt32;
prompt @5 :List(Text); # "login", "consent", "select_account", "none"
loginHint @6 :Text;
nonce @7 :Data; # empty = generate fresh
requestedExpirySeconds @8 :UInt32; # hint; IdP has final say
dpopKey @9 :PrivateKey; # optional DPoP binding
extraParams @10 :List(NamedBlob);
}
struct AuthCodeState {
opaque @0 :Data; # server-held state; PKCE verifier, nonce, etc.
}
struct DeviceCodeState {
opaque @0 :Data;
}
struct DeviceCodePoll {
union {
pending @0 :Void;
slowDown @1 :Void;
expired @2 :Void;
denied @3 :Void;
granted @4 :TokenBundle;
}
}
Tokens
interface AccessToken {
# Claims view (parsed once; opaque tokens return empty claims and
# let the IdP's introspection endpoint be the source of truth).
claims @0 () -> (claims :TokenClaims);
# Present the token for a single outbound HTTP request. The token
# service inserts the Authorization / DPoP headers into the
# outbound request built by the caller. Raw bytes do not leave
# the token service through this path.
authorize @1 (request :OutboundHttpRequest)
-> (prepared :OutboundHttpRequest);
# Down-scope to a narrower scope set or audience. Fails if the
# requested scopes are not a subset of the current token's scopes.
# Implementation can either return a wrapper cap that performs
# client-side attenuation (for simple bearer tokens) or call the
# IdP's token-exchange endpoint (for cross-audience narrowing).
attenuate @2 (scopes :List(Text),
audience :Text)
-> (narrower :AccessToken);
# Explicit export for rare cases where the caller truly needs the
# raw token (e.g. a capOS HTTP client that has no token-aware
# stack). Emits an audit event naming the reason. Excluded by
# default from attenuated caps returned by `attenuate`.
exportRaw @3 (reason :Text) -> (bytes :Data);
# Token reference for revocation, introspection, or logging. Always
# a hash or opaque ID; never the raw token.
reference @4 () -> (ref :TokenRef);
# Expiry information for client-side backoff and pre-refresh.
expiry @5 () -> (notBefore :Int64, notAfter :Int64);
}
interface RefreshToken {
# Refresh tokens are always longer-lived secrets. They do not
# expose an `authorize` path — their only use is through
# OAuthClient.refresh. `reference` and `expiry` match AccessToken.
reference @0 () -> (ref :TokenRef);
expiry @1 () -> (notBefore :Int64, notAfter :Int64);
# Export is available but guarded and audited; used for migration
# between token stores, not for ordinary operation.
exportRaw @2 (reason :Text) -> (bytes :Data);
}
interface IdToken {
claims @0 () -> (claims :IdTokenClaims);
raw @1 (reason :Text) -> (bytes :Data);
}
struct TokenBundle {
access @0 :AccessToken;
refresh @1 :RefreshToken; # may be null
id @2 :IdToken; # may be null
expiresIn @3 :UInt32;
tokenType @4 :Text; # "Bearer" or "DPoP"
scopes @5 :List(Text);
resource @6 :List(Text);
}
struct TokenClaims {
issuer @0 :Text;
subject @1 :Text;
audience @2 :List(Text);
scope @3 :List(Text);
clientId @4 :Text;
issuedAt @5 :Int64;
expiresAt @6 :Int64;
notBefore @7 :Int64;
jwtId @8 :Text;
# Confirmation (cnf, RFC 7800) for proof-of-possession tokens.
confirmation @9 :TokenConfirmation;
additional @10 :List(NamedBlob);
}
struct TokenConfirmation {
union {
none @0 :Void;
jkt @1 :Data; # DPoP: thumbprint of holder public key
x5tS256 @2 :Data; # mTLS client cert thumbprint
}
}
struct TokenRef {
kind @0 :TokenRefKind;
value @1 :Data; # hash, jti, or opaque server ref
}
enum TokenRefKind {
jti @0; # JWT `jti`
sha256 @1; # SHA-256 of the raw token
serverId @2; # opaque IdP-side identifier
}
struct OutboundHttpRequest {
method @0 :Text;
url @1 :Text;
headers @2 :List(NamedBlob);
body @3 :Data;
}
Token verifier (resource-server side)
interface TokenVerifier {
# Validate an inbound bearer token: signature against the
# provider's JWKS (or introspection endpoint for opaque tokens),
# issuer, audience, expiry, required scopes, confirmation claim
# (DPoP proof or mTLS peer cert).
verifyAccess @0 (token :Data,
policy :TokenVerificationPolicy,
proof :VerificationProof)
-> (outcome :TokenVerificationOutcome);
}
struct TokenVerificationPolicy {
expectedIssuer @0 :Text;
expectedAudience @1 :Text;
requiredScopes @2 :List(Text); # all-of
anyRequiredScopes @3 :List(Text); # any-of if non-empty
clockSkewSeconds @4 :UInt32;
requireConfirmation @5 :ConfirmationKind; # none / dpop / mtls
allowedAlgorithms @6 :List(SignatureScheme);
allowIntrospection @7 :Bool; # opaque token support
}
enum ConfirmationKind {
none @0;
dpop @1;
mtls @2;
}
struct VerificationProof {
union {
none @0 :Void;
dpopProof @1 :DpopProof;
peerCert @2 :Certificate;
}
}
struct DpopProof {
jwt @0 :Data; # the DPoP header value
httpMethod @1 :Text;
httpUrl @2 :Text;
nonce @3 :Data; # server-issued DPoP nonce (RFC 9449 §8)
}
struct TokenVerificationOutcome {
union {
valid @0 :ValidToken;
invalid @1 :TokenVerificationFailure;
}
}
struct ValidToken {
claims @0 :TokenClaims;
algorithm @1 :SignatureScheme;
keyId @2 :Text;
}
struct TokenVerificationFailure {
reason @0 :TokenFailureReason;
detail @1 :Text;
}
enum TokenFailureReason {
badSignature @0;
unknownKeyId @1;
unexpectedIssuer @2;
audienceMismatch @3;
expired @4;
notYetValid @5;
insufficientScopes @6;
missingConfirmation @7;
dpopMismatch @8;
mtlsThumbprintMismatch @9;
revoked @10;
malformed @11;
introspectionDenied @12;
weakAlgorithm @13;
}
JWKS
interface Jwks {
# Public keys exposed as PublicKey caps keyed by `kid`. Key
# rotation is invisible to callers: they ask for a `kid`, they
# get a current PublicKey or `unknownKeyId`.
keyById @0 (kid :Text) -> (key :PublicKey);
# Enumerate keys (for diagnostics and admin). Returns metadata
# only; PublicKey caps come from keyById.
listKeyMeta @1 () -> (keys :List(JwkMeta));
# Monotonic version bumped on every refresh that changes the key
# set; consumers cache validated signatures by (kid, version).
version @2 () -> (n :UInt64);
# Force a refresh. Audited. Called automatically on
# verification-time `unknownKeyId`; exposed for admin use.
refresh @3 () -> ();
}
struct JwkMeta {
kid @0 :Text;
algorithm @1 :AsymmetricAlgorithm;
use @2 :Text; # "sig" or "enc"
scheme @3 :SignatureScheme;
createdAt @4 :Int64;
}
Workload identity federation
interface WorkloadIdentityFederation {
# Produce a fresh remote access token by exchanging a local
# subject token (instance-identity JWT, attestation report,
# Kubernetes projected token, GitHub Actions OIDC token, ...)
# at the remote issuer per RFC 8693.
exchange @0 (requested :TokenRequest)
-> (bundle :TokenBundle);
info @1 () -> (meta :WorkloadFederationMeta);
}
struct WorkloadFederationMeta {
remoteIssuer @0 :Text;
audience @1 :Text;
subjectSource @2 :SubjectSource;
allowedScopes @3 :List(Text);
minRefreshInterval @4 :UInt32;
}
enum SubjectSource {
instanceIdentityJwt @0; # from CloudMetadata.InstanceIdentity
attestationReport @1; # SEV-SNP / TDX / Nitro
projectedServiceAccount @2; # e.g. Kubernetes projected token
githubActionsOidc @3; # ci/cd trust anchor
localPrivateKeyJwt @4; # private_key_jwt against remote STS
}
DPoP
DPoP (RFC 9449) binds a token to a client-held key. The binding is
expressed as a TokenConfirmation.jkt on the access token plus a
short-lived proof JWT per outbound request.
interface DpopSigner {
# Returns a fresh DPoP proof JWT for an outbound request. The
# signer holds the PrivateKey; the access token is `jkt`-bound to
# that key's thumbprint.
newProof @0 (method :Text,
url :Text,
accessTokenHash :Data,
serverNonce :Data)
-> (proofJwt :Data);
publicKey @1 () -> (pk :PublicKey);
}
A deployment requiring DPoP composes AccessToken + DpopSigner in
a wrapper cap so ordinary callers just invoke authorize and receive
a request with both Authorization: DPoP ... and DPoP: headers set.
JWT wrappers over key-management primitives
These live here rather than in
Cryptography and Key Management
because JWT is a protocol frame, not a crypto primitive. JwtSigner and
JwtVerifier are thin adapters over PrivateKey / PublicKey caps issued by
a KeyVault sealed under a KeySource from that proposal. The signing key’s
KeyPurpose is oauthClientAssertion for private_key_jwt client
authentication or oidcIdToken for a LocalIdentityProvider mint path;
KeyVault refuses to bind a key whose declared purpose set does not include
the JWT use, so a key minted for TLS server identity cannot silently sign
client assertions.
interface JwtSigner {
# Sign a compact-serialized JWT. The key lives in a KeyVault;
# JwtSigner is the schema-aware wrapper.
sign @0 (header :JwtHeader, claims :Data) -> (jwt :Data);
publicKey @1 () -> (pk :PublicKey);
keyId @2 () -> (kid :Text);
}
interface JwtVerifier {
verify @0 (jwt :Data, policy :JwtVerifyPolicy)
-> (outcome :JwtVerifyOutcome);
}
struct JwtHeader {
algorithm @0 :SignatureScheme;
keyId @1 :Text;
type @2 :Text; # "JWT", "at+jwt", "dpop+jwt"
contentType @3 :Text;
}
struct JwtVerifyPolicy {
expectedIssuer @0 :Text;
expectedAudience @1 :Text;
expectedType @2 :Text;
allowedAlgorithms @3 :List(SignatureScheme);
clockSkewSeconds @4 :UInt32;
jwksSource @5 :Jwks; # preferred
staticKey @6 :PublicKey; # for single-signer cases
}
struct JwtVerifyOutcome {
union {
valid @0 :JwtValid;
invalid @1 :JwtInvalid;
}
}
struct JwtValid {
header @0 :JwtHeader;
claims @1 :Data;
}
struct JwtInvalid {
reason @0 :TokenFailureReason;
detail @1 :Text;
}
SignatureScheme and AsymmetricAlgorithm are reused from
Cryptography and Key Management;
adding ps256/ps384/ps512 and es256/es384/es512 aliases there
covers the JWT algorithm registry without duplicating the enum here. Token
verifiers receive the allow-list as a List(SignatureScheme) from that
proposal so a deployment can refuse HS* family algorithms uniformly across
JWT, TLS, and signing without a parallel OIDC-specific configuration knob.
Grant Types in Detail
Authorization Code + PKCE
Used by the web text shell gateway and any capOS native app with a
browser. PKCE is mandatory (OAuth 2.1); code_challenge_method = S256.
The AuthCodeState.opaque blob carries the PKCE verifier, nonce, and
original TokenRequest; callers do not see or store these
values. Redirect URIs are validated against OAuthClientMetadata
exactly, no partial matching.
Device Authorization Grant (RFC 8628)
Used on serial consoles and other no-browser surfaces. The console
prints verification_uri and user_code; the user completes the
flow in a separate device’s browser; the console polls the token
endpoint. pollDeviceCode honors the slow_down response and caps
its polling rate at the IdP-advertised interval. Expiration is a
hard fail; the console must restart the flow.
This is the primary OIDC path for boot-to-shell on headless hosts and for interactive cloud-VM serial consoles.
Client Credentials
Used for backend-to-backend service identity. The calling service
holds an OAuthClient cap configured with privateKeyJwt or
tlsClientAuth. No user is involved; the subject of the issued token
is the client itself.
Refresh
Used to rotate a short-lived access token without re-authenticating
the user. Refresh tokens are long-lived secrets and live in the
RefreshToken cap; they never appear as bytes in session state.
Rotated refresh tokens (IdPs that issue a new refresh token on every
refresh) are installed into the same cap transparently.
JWT Bearer (RFC 7523)
Used for federation between systems that trust a common signing key.
A capOS service holding a JwtSigner can mint an assertion identifying
a subject and exchange it at the IdP for a token acting on behalf of
that subject. Used sparingly; the delegation implication is strong.
Token Exchange (RFC 8693)
The foundation of modern workload identity federation. Described in
its own subsystem (WorkloadIdentityFederation) because the
subject-token source is platform-specific. Concrete mappings:
- AWS IRSA / IAM OIDC provider: subject token is a Kubernetes
projected service-account JWT; IdP is
sts.amazonaws.com;AssumeRoleWithWebIdentityreturns AWS-scoped credentials. - GCP Workload Identity Federation: subject token is an
InstanceIdentityJWT from the GCE metadata service or a Kubernetes projected token; IdP issts.googleapis.com; returned token is usable against GCP APIs including Cloud KMS. - Azure federated identity credentials: subject token is an OIDC token from a trusted IdP (GitHub, GitLab, Kubernetes, another capOS instance); IdP is Azure AD; returned token is a standard AAD access token.
- SEV-SNP / TDX / Nitro attestation: subject token is the attestation report envelope; IdP is the cloud KMS or a standalone attestation verifier; returned token authorizes a KMS Decrypt against an attestation-policy-gated KEK.
In every case the capOS image contains no long-lived credentials. The
boot path produces a local InstanceIdentity cap, passes it to a
WorkloadIdentityFederation configured for the target cloud, and
receives short-lived tokens that KMS and other services accept.
Trust Bootstrap
An OidcIdentityProvider cap is created by a trusted service (the
OAuth service) from a provider configuration record. The record
includes:
- The canonical issuer URL.
- One of:
- a fixed JWKS snapshot baked into the manifest (for air-gapped or hermetic deployments), or
- the discovery URL plus one or more pinned root certs / SPKI
hashes for the TLS connection that fetches discovery and JWKS
(via
TlsClientConfigandPinSetfrom the certificates proposal).
- Acceptable algorithms (
allowedAlgorithmsin policy). - Minimum token lifetime and maximum clock skew.
- Whether the IdP advertises token exchange, DPoP, and PAR (pushed authorization requests).
- Client registrations allowed to use this IdP.
The trust root for OIDC verification is ultimately the TLS trust
chain back to a certificate authority plus the discovery document’s
signing policy. OidcIdentityProvider therefore depends on a
TlsClientConfig from
Certificates and TLS, not on
raw sockets, and IdP pinning composes with a PinSet cap from the same
proposal so that issuer-specific roots and SPKI hashes never share state with
the ambient WebPKI trust store. Issuer URL mismatches and JWKS failures are
hard errors; neither falls back to unauthenticated HTTP.
For enterprise IdPs that rotate signing keys frequently, Jwks
caches keys with a short TTL and refreshes on verification-time
unknownKeyId. A deployment that wants to forbid automatic refresh
(to pin a specific key set) configures Jwks with refresh disabled;
key rotation then requires a manifest update.
Authentication Strength Mapping
SessionInfo.authStrength from user-identity uses X.1254 LoA tiers.
OIDC acr/amr claims map as follows (deployment-configurable):
loa1— self-asserted,amrincludespwdonly,acrabsent or low.loa2— single-factor,amrincludespwd,pin, oremail-link style.loa3— multi-factor with a hardware-backed credential:amrcontainshwk,swkwith device attestation,face+pwd,fpt+pwd, or equivalent;acrtypically names a named MFA policy.loa4— high-assurance, typically requires identity proofing plus tamper-resistant hardware:amrcontainspopwith attested device,hwkplusface, in-person proofing claims, or vendor- specific high-assuranceacrvalues.
The mapping table lives in OAuthClientMetadata policy, not hard-
coded. loa0 (anonymous) is capOS-specific and has no matching OIDC
claim; anonymous sessions do not use OIDC.
Consumers
| Consumer | Uses |
|---|---|
| Console login (device code) | OAuthClient.startDeviceCode + verifyIdToken |
| Web text shell login | OAuthClient.startAuthCode + verifyIdToken; TLS from certs |
| Cloud KMS access (no baked creds) | WorkloadIdentityFederation.exchange → AccessToken.authorize |
CloudKmsKeySource unlock | Wraps AccessToken.authorize; no ambient cloud credentials |
| Service-to-service outbound HTTP | OAuthClient.clientCredentials + AccessToken.authorize |
| Inbound API token validation | TokenVerifier.verifyAccess |
Per-user EncryptedNamespace | OidcFederatedKeySource derives KEK from user’s AccessToken |
| Audit / telemetry export | Service identity via client credentials + DPoP |
| CI/CD runtime trust | WorkloadIdentityFederation from GitHub Actions OIDC |
Threat Model
Specific to this subsystem:
- Token leakage via logs. The classic OAuth failure mode. Raw
tokens never leave the OAuth service through claims, references,
or audit records.
exportRawis the only escape hatch and is audited with a mandatory reason string. - Refresh token theft. Refresh tokens are long-lived secrets. Mitigations: storage in the same service that holds the cap (not in session state, not in cookies readable by shells), optional rotation on refresh, revocation on logout.
- Replay of bearer tokens. A stolen bearer token is usable
until expiry. Mitigations: short TTLs; require DPoP
(
ConfirmationKind.dpop) for sensitive resources; mTLS for service-to-service. Nonce-bound DPoP proofs (RFC 9449 §8) with server-issued nonces where the resource server supports them. - Mixed-IdP confusion. A token issued by IdP A is presented to
a verifier expecting IdP B. Mitigations: strict
expectedIssuermatch; audience binding; IdP-specificOAuthClientcaps so services cannot confuse two OIDC providers; RFC 9207issparameter on authorization responses verified before the token exchange. - Discovery-document tampering. An attacker on the TLS path returns a forged discovery document or JWKS. Mitigations: pinned TLS roots or SPKI hashes per IdP; JWKS fetched over the same pinned TLS client; signature algorithm allow-list rejects downgrades; manifest-defined acceptable discovery URL prevents runtime redirect to attacker IdP.
- PKCE downgrade. A public client accepts a token without
proving possession of the code verifier. Mitigations: PKCE is
mandatory (
pkceRequired = trueis not a bit the caller can clear from a derived cap);code_challenge_method = S256only. - Authorization code replay. A leaked code is redeemed by an attacker. Mitigations: PKCE binds the code to the verifier the browser holds; codes are single-use; redirect URI exact match.
- Open redirector via redirect URI. Mitigations:
exact-match redirect URIs per registration; no substring
matching; validated at both
startAuthCodeandcompleteAuthCode. - Cross-site request forgery on the authorization request.
Mitigations:
stateparameter generated fromEntropySource, stored inAuthCodeState.opaque, checked on completion; PKCE adds a second CSRF-resistant binding. - OIDC
nonceomission. Missingnonceon the ID token allows replay of an ID token from another session. Mitigations:IdTokenPolicy.nonceMustMatchis mandatory for interactive logins;verifyIdTokenrefuses an ID token whosenoncedoes not match the one baked intoAuthCodeState.opaque. - Mis-issued
subclaim. An IdP reuses asubacross tenants or rebinds it. Mitigations: the external subject key includes provider kind, issuer, normalized tenant, and subject; it is neversubalone. Tenant-scoped IdPs (Azure AD per-tenant, Google Workspace) still record the tenant explicitly before hashing. - JWKS flooding. An attacker forces repeated
unknownKeyIdfailures to trigger JWKS refreshes. Mitigations: refresh rate- limited perJwkscap; audit events recorded; repeated failures fail closed rather than refresh-in-loop. - Token exchange policy evasion. An attacker with a narrow
subject token exchanges it for a broader one. Mitigations: the
remote issuer enforces its own policy on token exchange; capOS
cannot prevent a misconfigured STS. Defense is to pin
WorkloadFederationMeta.remoteIssuerand inspect returned scopes againstallowedScopes. - Clock skew attacks. Old tokens accepted, new tokens
rejected. Mitigations:
clockSkewSecondsis small by default; consumers that require hard bounds use an attested time source.
Security Verification in ITU-T and GOST Terms
OIDC and OAuth2 are IETF/OpenID protocols, but capOS’s broader security vocabulary is ITU-T/ISO-IEC plus, where a deployment requires it, GOST. The same verification surface this proposal defines (signature checks on JWTs, discovery and JWKS integrity, token binding, claim validation, scope enforcement, auth strength, logout semantics) maps onto those frameworks cleanly; it does not require a parallel vocabulary.
ITU-T X.805 security dimensions
X.805 (“Security architecture for systems providing end-to-end communications”) decomposes security into eight dimensions. The relevant mapping for the OIDC subsystem:
| X.805 dimension | Where it lives |
|---|---|
| Access control | AuthorityBroker + CapObject::call (ADF/AEF per X.812); scopes/claims as ABAC input |
| Authentication | OidcIdentityProvider.verifyIdToken; TokenVerifier.verifyAccess |
| Non-repudiation | Signed JWTs + audit records of verifyIdToken, issuance, exchange, revocation |
| Data confidentiality | TLS for every IdP call; encrypted audit payloads where applicable |
| Communication security | TlsClientConfig from the certificates proposal; issuer-pinned roots; HSTS-equivalent pins |
| Data integrity | JWT signature verification against Jwks; kid rotation handled without trust decay |
| Availability | Failure semantics defined for JWKS refresh, device-code expiry, introspection outage |
| Privacy | IdTokenClaims exposes only claims the client actually requires; scope minimization |
Each cell is a discrete thing to verify, test, and review. Keeping
the dimensions explicit makes gaps visible: for example, “what is our
availability story if the IdP’s JWKS endpoint is down for 2 hours”
is a concrete X.805 question; the answer is Jwks cache TTL +
refresh audit + fail-closed behavior on unknown kid.
ITU-T X.812 ADF/AEF
Already inherited from user-identity-and-policy-proposal.md. The
OIDC-specific instances:
- AEF (enforcement point):
CapObject::calland wrapper caps (AccessToken.authorize,TokenVerifier.verifyAccess). Bypass requires subverting the cap graph, not forging a claim. - ADF (decision point): the OAuth service when issuing a token,
AuthorityBrokerwhen returning scoped caps,TokenVerifierwhen accepting a bearer. A decision returns a capability (or denial); it does not return a boolean that downstream code might ignore.
ITU-T X.1254 / ISO/IEC 29115 LoA
Already built into the mapping: IdTokenClaims.acr/amr →
AuthStrength enum (loa1..loa4). The mapping table lives in
OAuthClientMetadata so each deployment can specify which IdP
acr/amr values count as each tier. SealPolicy.tokenExchange
carries minAuthStrength so unlock policy can say “require LoA 3+”
without knowing any specific IdP’s acr taxonomy.
ITU-T X.1252 identity management terms
X.1252 defines identity, credential, entity, enrolment,
identity provider, relying party, and identity assurance. The
proposal’s entities map directly:
| X.1252 term | capOS realization |
|---|---|
| Identity provider | OidcIdentityProvider |
| Relying party | Any service holding an OAuthClient cap |
| Credential | Bearer, DPoP-bound, or mTLS-bound AccessToken; IdToken; RefreshToken |
| Enrolment | CredentialStore bootstrap of IdP trust records + subject allow-list |
| Assurance level | AuthStrength (= X.1254 LoA) |
| Attribute authority | IdP via claims; optionally PolicyEngine for derived ABAC attributes |
| Identity binding | Canonical subjectHash mapped to a local PrincipalInfo.id; never sub alone |
ITU-T X.1255 and federated-identity discovery
X.1255 is the discovery framework for federated identity. The closest
IETF analog is OIDC Discovery (RFC 8414 + OpenID Discovery 1.0).
OidcIdentityProvider.metadata is the capOS surface for both. The
manifest-declared discovery URL + pinned TLS root closes the
“federated discovery must be trustworthy” requirement that X.1255
leaves to deployment.
ITU-T X.813 / ISO/IEC 10181-4 non-repudiation
Non-repudiation of authentication comes from the IdP’s signed ID
token. Non-repudiation of authorization decisions comes from
AuditLog records that include the decision inputs (claim
summaries, policy IDs, outcome). The framework deliberately does
not promise non-repudiation of shell commands — the agent shell
is a planner, not a signer of operator intent.
IETF OAuth security BCPs
For completeness, the proposal tracks:
- RFC 6819 — OAuth 2.0 Threat Model and Security Considerations. Covered by the threat-model section above.
- RFC 9700 — OAuth 2.0 Security Best Current Practice. The
“PKCE mandatory”, “exact redirect URI match”, “mix-up defense
via
issparameter”, “no implicit grant”, and “rotate refresh tokens” items are all baked in rather than opt-in. - FAPI 2.0 (OpenID Foundation) — financial-grade API profile.
Useful as a pre-packaged high-assurance profile: DPoP or mTLS
sender-constrained tokens, PAR, signed authorization requests,
strict algorithms.
OAuthClientMetadatais deliberately shaped so a “FAPI profile” is a set of required fields, not a separate interface.
GOST MAC/MIC (ГОСТ Р 59383-2021, ГОСТ Р 59453.1-2021)
capOS’s mandatory-access-control and mandatory-integrity-control story is described at two levels in User Identity and Policy and Formal MAC/MIC:
- a pragmatic level where userspace brokers and wrapper caps enforce labels at grant paths, and
- a formal level where an abstract automaton (subjects, objects, containers, hold edges, rights, accesses, information flows) carries explicit safety predicates and proof obligations in the shape ГОСТ Р 59453.1-2021 requires.
OIDC integration must fit both levels without introducing a second authority channel. Concretely:
Federated principals and subjects
An OIDC-authenticated UserSession creates a subject in the
formal automaton with:
subjectHash = hash(providerKind, iss, tenant, sub)— the durable external subject key, not reusable across IdPs or tenants. The local session principal may be a pseudonymous principal created for this external key or the local principal named by anExternalIdentityBinding.confidentiality_labelandintegrity_labelresolved byLabelAuthorityfrom the policy profile plus optional claim-derived refinement (e.g.groups = ["ops"]narrows to a specific compartment). Claims influence labels at mint time; they are not authority downstream.authStrengthfromacr/amr, already folded into LoA tiers.
The create_session transition in the formal automaton therefore
has one additional precondition when the login method is OIDC:
create_session(principal, policy_profile, resource_profile, oidc_proof):
pre:
verify_id_token(oidc_proof) succeeds with IdTokenPolicy
IdP trust record in CredentialStore permits (subjectHash, policy_profile)
manifest seed or AccountStore admission permits the binding
acr/amr satisfy policy_profile's minimum AuthStrength
subject allow-list or ExternalIdentityBinding admits subjectHash
effect:
new subject s with labels derived from policy_profile + claims
Hold(s, session_bundle) per AuthorityBroker(
session,
policy_profile,
resource_profile,
)
This is the same precondition shape as password and passkey login —
the safety proof does not branch on authentication method. It only
requires that verify_id_token is modeled as a trusted verifier
that rejects inputs failing the IdP’s published policy.
Integrity labels on IdP trust
IdP trust records carry an integrity level. An IdP configured as
the corporate operator IdP can mint sessions with higher integrity
than an IdP configured for guest/partner access. LabelAuthority
encodes this in the trust-record metadata; SessionManager refuses
to mint a session whose policy profile claims higher integrity than the
admitting IdP’s integrity level.
The formal invariant:
integrity(session) <= integrity(admitting_IdP_trust_record)
This closes the federated analog of “any trusted login path can mint a maximally trusted session” — a gap that is easy to introduce by accident when enterprises add a second, looser IdP.
Flow classes for token capabilities
Each method on the typed token interfaces needs a flow class per
the formal-mac-mic-proposal.md table. Proposed classifications:
OidcIdentityProvider.verifyIdToken ReadLike + NoFlow (pure verification)
OidcIdentityProvider.metadata ObserveLike
Jwks.keyById ObserveLike
Jwks.refresh ControlLike (on the Jwks object)
OAuthClient.startAuthCode ObserveLike (emits a URL; no subject-bearing data crosses)
OAuthClient.completeAuthCode TransferLike (materializes new authority as a token cap)
OAuthClient.startDeviceCode ObserveLike
OAuthClient.pollDeviceCode TransferLike (on "granted")
OAuthClient.clientCredentials TransferLike
OAuthClient.refresh TransferLike
OAuthClient.jwtBearer TransferLike + ControlLike (delegation)
OAuthClient.tokenExchange TransferLike (see narrowing below)
OAuthClient.revoke ControlLike
AccessToken.claims ReadLike (claims are metadata of the token object)
AccessToken.authorize WriteLike (outbound side-effect under the token's authority)
AccessToken.attenuate TransferLike (narrower cap minted)
AccessToken.exportRaw Declassify (trusted, audited, restricted)
AccessToken.reference/expiry ObserveLike
RefreshToken.* as above; exportRaw is Declassify
IdToken.raw Declassify
IdToken.claims ReadLike
TokenVerifier.verifyAccess ReadLike + NoFlow
DpopSigner.newProof WriteLike (produces a short-lived authenticator bound to a request)
WorkloadIdentityFederation.exchange TransferLike
The key formal-level consequences:
AccessToken.exportRaw,RefreshToken.exportRaw, andIdToken.rawareDeclassifytransitions. They must be modeled as trusted transitions with explicit audit. Excluding them by default from attenuated caps is consistent with the formal model’s requirement that declassification go through explicit trusted subjects.OAuthClient.tokenExchangeandAccessToken.attenuateareTransferLike; they cannot widen authority. The safety predicate is “issued-token scope ⊆ input-token scope ∩ policy permits” — exactly the wrapper-narrowing rule fromuser-identity-and-policy-proposal.md. The proof obligation is a scope-monotonicity lemma on the server side; capOS verifies the result by comparingTokenClaimsbefore accepting the returned cap.AccessToken.authorizeisWriteLikeagainst the external resource. In the formal model this is an outbound information flow from the subject to an object whose label is the label of the downstream service the broker wired into the request. Deployments needing a MIC proof must ensure the broker refuses to bind a low-integrity session’s token into a request against a high-integrity service — theintegrity(src) >= integrity(dst)rule applied through the broker.
Token attenuation as the wrapper-cap discipline
ГОСТ Р 59453.1 requires that every transfer either preserves or
narrows rights; capability attenuation is the capOS mechanism for
that. OIDC’s scope is a list of strings; treating scope narrowing
as cap attenuation means the verifier at issuance time must reject
any attenuate / tokenExchange result whose claimed scope is
not a subset of the source token’s scope. This is already the
spec’s behavior — the point is that the capOS implementation must
enforce it locally as well, because a misbehaving STS would
otherwise be a covert widening channel.
Subject-controls-subject and delegation
OAuthClient.jwtBearer (RFC 7523) lets a client speak on behalf of
another principal. That is a ControlLike transition in the
formal model: the invoking subject is exercising control over the
minted subject. The safety predicate is:
supervise_allowed(invoker, delegated):
integrity(invoker) >= integrity(delegated)
and invoker holds a delegation capability for the target IdP
and confidentiality/compartment labels are compatible
This is the formal reason a jwtBearer cap is not a default
session authority — it must come from a broker that checks the
control relation.
Endpoint declarations
formal-mac-mic-proposal.md requires every endpoint to declare
its flow policy. For OIDC-facing services that is:
OidcIdentityProviderendpoints declareObserveLikeon metadata calls andReadLikeon verification.OAuthClientendpoints declare the flow classes above and bind the output token’s label tomin(session.label, target_audience.label).TokenVerifierendpoints declareReadLikeand bind the verified claims to the caller’s object label (the claims flow into an object owned by the calling service).
Declaring these up-front lets the formal-mac-mic-proposal.md
review gate apply without a separate OIDC-specific checker.
ГОСТ Р 58833-2020 — identification and authentication
Beyond MAC/MIC, ГОСТ Р 58833-2020 defines organizational and technical requirements for identification and authentication. OIDC integration satisfies its technical baseline:
- Identifiers use
subjectHash = hash(providerKind, issuer, tenant, subject); subject reuse across IdPs or tenants is disallowed by construction. - Credentials (tokens, refresh tokens, DPoP keys) are held inside the OAuth service; raw material does not reach the model, the shell, or audit.
- Issuance and revocation (
OAuthClient.startAuthCode/ startDeviceCode/clientCredentials/...,revoke,SessionManager.logout) are audited. - Credential-strength policy is selectable per resource via
minAuthStrengthon seal policies and broker decisions, aligned to X.1254 / ISO/IEC 29115 LoA.
Organizational measures (credential lifecycle, incident response, operator training) remain a deployment responsibility the OS cannot enforce alone.
Proof-obligation checklist
A deployment aiming for a GOST-style MAC/MIC claim with OIDC
federation must add these obligations to the
formal-mac-mic-proposal.md proof. The checklist is explicit so
reviewers can point at individual items, and so each obligation
maps to one of the tools listed in the next subsection.
verify_id_tokentotality and policy soundness. Modeled as a trusted total function. Accepts only well-formed tokens under the configuredIdTokenPolicy(issuer, audience,acr/amr,exp/nbf/iatwith bounded skew,nonce,at_hash/c_hashwhen applicable). ReturnsIdTokenClaimsor a failure reason; never silently downgrades.- PKCE binding. No
completeAuthCode(state, code)succeeds unlessstatewas produced by a priorstartAuthCodeand the PKCE verifier stored inAuthCodeState.opaquehashes to thecode_challengethe IdP recorded forcode. - Nonce binding.
verifyIdTokenaccepts an ID token only ifclaims.nonceequals the nonce stored inAuthCodeState.opaquefor the matchingstate. Missing-nonce ID tokens on interactive logins are rejected. statebinding. The authorization response’sstatematches the one minted fromEntropySourceatstartAuthCode.- Scope monotonicity. Every
TransferLiketoken transition (AccessToken.attenuate,OAuthClient.tokenExchange,OAuthClient.refresh,OAuthClient.jwtBearer) produces a result whose scope is a subset of the input scope intersected with the broker/IdP-permitted set. No transition widens scope. - JWKS live-set invariant. A token signed under
kid = kis accepted iffkwas present inJwksat some timetwithiat - clockSkew ≤ t ≤ now. Rotation that removeskdoes not retroactively invalidate tokens already verified under it; rotation that addskdoes not accept tokens older than its introduction. - Device-code polling discipline.
pollDeviceCodehonors the IdP-issuedinterval;slow_downresponses monotonically increase the local backoff;expiredanddeniedare terminal. - Refresh rotation invariant. A successful
OAuthClient.refreshthat rotates the refresh token marks the priorRefreshTokencap broken; any subsequent use returnsrevoked(no parallel use of two generations of the same refresh-token family). - Session-creation MAC/MIC predicate.
create_sessionwith an OIDC proof establishesintegrity(session) ≤ integrity(admitting_IdP_trust_record)andconfidentiality(session) ⊑ confidentiality_ceiling(policy_profile, claims). - Broker-outbound MAC/MIC predicate. When the broker binds
an
AccessTokento an outbound request, the call site label satisfiesintegrity(src) ≥ integrity(dst)and the confidentiality flow is permitted.jwtBearerdelegations additionally satisfysupervise_allowed(invoker, delegated).
Additional implicit obligations:
Declassifytransitions (AccessToken.exportRaw,RefreshToken.exportRaw,IdToken.raw) are restricted to trusted subjects and produce audit records with the mandatoryreasonargument.- Endpoint flow declarations for OIDC services cover every method in the schemas above; adding a new method without a declaration is a review failure.
These are additive to the obligations in
formal-mac-mic-proposal.md. None require a new kernel mechanism;
they extend the same wrapper-cap / endpoint-flow-declaration
discipline to OIDC-backed subjects and token-typed capabilities.
Tool assignment
| Obligation | Primary tool | Notes |
|---|---|---|
1. verify_id_token totality | TLA+ + Kani | TLA+ models the trusted function; Kani proves the Rust impl is total |
| 2. PKCE binding | TLA+ | 3-state machine (started/completed/failed); invariant on state |
| 3. Nonce binding | TLA+ | Joint state with PKCE, same module |
4. state binding | TLA+ | Joint with 2/3; plus Alloy for EntropySource uniqueness |
| 5. Scope monotonicity | Alloy + Prusti | Alloy for the attenuate/exchange graph; Prusti as post-condition |
| 6. JWKS live-set invariant | TLA+ | Temporal property; Apalache if TLC state-space explodes |
| 7. Device-code polling discipline | Z.100 SDL + TLA+ | SDL for state + timer structure; TLA+ for liveness |
| 8. Refresh rotation invariant | TLA+ | Safety + single-generation liveness |
| 9. Session-creation MAC/MIC predicate | Alloy | Extends the hold-edge graph model from formal-mac-mic |
| 10. Broker-outbound MAC/MIC predicate | Alloy | Same model; predicate over outbound endpoint declarations |
| Declassify auditing | Kani | Rust-level: every exportRaw path writes an audit record |
| Endpoint flow declarations | Review gate + Alloy | Enumerate methods, check coverage relationally |
Supporting artifacts (useful even before the full proof lands):
- Z.120 MSC sequence charts for the three primary flows
(authorization code + PKCE, device code, token exchange). MSC
traces from a running capOS are already shaped like the sequence
dumps
tools/ccsproduces for capability rings, which makes property-checking “no RETURN without a matching CALL” a straightforward analog to “no token issuance without a matching authorization event.” - Proptest/fuzz harnesses for the JWT parser, claim validator,
PKCE verifier hash, DPoP proof parser, and discovery-document
parser. These are not formal proofs but are the first line of
defense for obligations 1 and 2. Tracked under the existing
security-and-verification-proposal.mdtiered tooling plan. - Loom model for the concurrent
Jwksrefresh path: multiple verification requests racing with a refresh triggered byunknownKeyId. Obligation 6’s live-set invariant is the correctness condition.
Out of scope for the formal track
- The external IdP’s own correctness. The model treats the IdP as a trusted oracle that emits signed tokens matching its published policy; bugs in the IdP itself are not capOS-provable.
- Network-layer adversaries. TLS authentication of the IdP and the token endpoint is assumed; that proof lives in the certificates proposal’s track.
- Timing and microarchitectural side channels on signature verification and DPoP checks. Treated as deployment-level mitigations (constant-time libraries, cache partitioning) rather than modeled flows.
- User behavior. Phishing, social engineering, and operator credential sharing are outside the model.
- IdP key compromise. Modeled as an assumption violation; the formal proof cannot recover from a signing-key compromise at the IdP.
Note on GOST cryptographic primitives
Separately from MAC/MIC, a deployment may also require GOST
cryptographic algorithms (GOST R 34.10-2012 signatures,
GOST R 34.11-2012 Streebog hashes, GOST R 34.12-2015 symmetric
ciphers) throughout the JWT/JWS and TLS stack. Those are additive
enum extensions in SignatureScheme, HashAlgorithm,
AsymmetricAlgorithm, and SymmetricAlgorithm across the key,
certificates, and OIDC proposals plus a certified cryptographic
library. The interface shape does not change; the MAC/MIC analysis
above is independent of the algorithm choice.
ФСТЭК threat modeling
The threat-model section above enumerates OAuth/OIDC-specific
attacks (leakage, replay, mixed-IdP confusion, discovery
tampering, PKCE downgrade, code replay, redirect hijack, CSRF,
nonce omission, sub confusion, JWKS flooding, token-exchange
evasion, clock skew). Mapping that enumeration to ФСТЭК’s
“Методика оценки угроз” taxonomy is a deployment-specific
documentation exercise; the raw facts are already here.
How this combines
A capOS deployment choosing a high-assurance profile selects:
- X.805 dimensions to audit explicitly (all eight for a regulated service).
- X.1254 LoA floor per resource (via
minAuthStrengthon seal policies and broker bundles). - A label lattice (confidentiality + integrity) and which IdP trust records can mint sessions at which labels.
- Which token transitions are modeled as
Declassify/Transfer/Controlin the MAC/MIC automaton. - A concrete IdP trust bootstrap (manifest-pinned JWKS snapshot vs. discovery with pinned TLS root).
- A concrete audit redaction and retention policy consistent with applicable regulation (ITU-T X.816, ФСТЭК guidance, GDPR, or sector-specific rules).
No kernel change is required to land any of these. Each choice
narrows the behavior of userspace services — OAuthClient,
OidcIdentityProvider, TokenVerifier, CredentialStore,
SessionManager, AuthorityBroker, LabelAuthority,
AuditLog — inside the same capability model.
Interaction with capOS Authority Model
OIDC and OAuth2 decide which external subject was authenticated and which
scopes apply to this call. Admission policy decides which local principal,
account, policy profile, and resource profile that external subject maps to.
They do not decide which caps exist in the process. That remains the job of
AuthorityBroker.
Practical flow:
- User authenticates to capOS via OIDC.
SessionManager.loginverifies the ID token and computessubjectHash = hash(providerKind, iss, tenant, sub). SessionManagerresolvessubjectHashthrough manifest seed admission, a local account-storeExternalIdentityBinding, or an explicit auto-creation rule. The result is a local or pseudonymous principal plus selected policy and resource profiles.SessionManagermints aUserSessionwhosePrincipalInfo.idis the resolved principal and whoseauthStrengthderives fromacr/amr.AuthorityBroker.requestreceives the session and any relevant access token. Scopes and OIDC claims are inputs to the RBAC/ABAC/MAC decision. They are never sufficient authority on their own.- The broker returns a capability bundle (or denial). The access
token is delivered inside an
ApprovalGrantor a wrapper cap when the caller needs to invoke an external service; the raw bytes remain inside the OAuth service. - For outbound calls to an OAuth-protected resource, the capOS
service holds an
AccessTokencap; it does not see the token string. - For inbound calls, a capOS service configured as an OAuth2
resource server holds a
TokenVerifiercap plus itsAuthorityBrokercap; verification yields claims, and the broker converts claims into narrower caps for the call.
This is the same “decision returns a capability” pattern the user-identity proposal already uses for Cedar/OPA. OIDC just provides one more input shape.
Phases
Phases follow the consumers.
Phase 1 — IdP and client schemas, JWT verification
- Add the schemas above to
schema/capos.capnp. - Implement a RAM-only IdP cache that can load a discovery document and JWKS from a static test fixture and verify a sample ID token.
- Implement
JwtVerifieroverPublicKeyprimitives from the key proposal using a vetted Rust crate (jsonwebtoken,biscuit, or a purpose-built verifier on top ofrsa/ed25519-dalek/p256). - Host tests: signature verification across RS256/ES256/EdDSA, issuer/audience/exp checks, clock skew, algorithm allow-list.
Phase 2 — OAuth client and device code
OAuthClientwithclientCredentials,refresh, anddeviceCodegrants.- Outbound HTTPS via the networking and certificate stacks (requires those to be real).
- Console OIDC login proof: QEMU serial starts
startDeviceCode, an operator completes the flow out-of-band,pollDeviceCodereturns a bundle,verifyIdTokensucceeds, a manifest-seeded external admission rule selects policy/resource profiles, and aUserSessionis minted.
Phase 3 — Authorization code + PKCE
- Web text shell gateway redirects to the IdP and consumes the returned code.
startAuthCode/completeAuthCodeintegrated with the gateway’s HTTP listener.- Per-session
nonce,state, and PKCE verifier all live inAuthCodeState.opaque.
Phase 4 — Resource server verification
TokenVerifier.verifyAccesswith JWKS refresh and introspection-endpoint fallback for opaque tokens.- Policy enforcement: required scopes, audience binding, cnf confirmation (DPoP or mTLS).
Phase 5 — Workload identity federation
WorkloadIdentityFederationwith subject sources for GCP and AWS.- Depends on
InstanceIdentityfrom cloud-metadata and a working outbound TLS client. CloudKmsKeySourcegains a no-baked-credentials unlock path.
Phase 6 — Private key client auth and DPoP
ClientAuthMethod.privateKeyJwtusingJwtSigner.DpopSigner+ConfirmationKind.dpopinTokenVerifier.- RFC 9449 nonces when the resource server supports them.
Phase 7 — mTLS-bound tokens and extended federation
ClientAuthMethod.tlsClientAuthper RFC 8705.- Attestation-report-backed federation
(
SubjectSource.attestationReport) for confidential computing. - CIBA grant (RFC 9126 + OpenID CIBA) if a deployment needs it for step-up on mobile devices.
Phase 8 — Token exchange as a first-class broker input
AuthorityBrokeraccepts anAccessTokenorIdTokenplus scopes as policy input; decisions can return narrowed access tokens alongside narrower caps.- Account-store-backed
ExternalIdentityBindingrecords replace manifest-only external admission for ordinary federated logins. Unknown external subjects are denied unless an explicit auto-creation rule names policy and resource profiles. - Per-user
EncryptedNamespaceunlock viaOidcFederatedKeySource(defined in the key-management proposal) using the user’s current access token as unlock context.
Phase 9 — Local IdP (optional, deferred)
- A
LocalIdentityProvidercap that issues tokens to other capOS services on the same host or fleet, signed by aJwtSignerbacked by aKeyVault-storedPrivateKey. Useful for air-gapped deployments and for bootstrapping workload federation between two capOS instances. Not in v1.
Relationship to Other Proposals
- Cryptography and Key Management
— supplies
PrivateKey/PublicKey/KeyVault/SignatureScheme/AsymmetricAlgorithm, theKeySourcefamily this proposal’sJwtSignerbinds to, and theKeyPurpose.oauthClientAssertion(RFC 7523private_key_jwt) andKeyPurpose.oidcIdToken(LocalIdentityProvidermint path) values that constrain how those keys may be used. That proposal also definesKeySourceKind.oidcFederatedand the correspondingOidcFederatedKeySource(Phase 6b there), plusSealPolicy.tokenExchange, which together letEncryptedNamespaceand other sealed payloads unlock against anAccessTokenminted here instead of a baked credential. - Certificates and TLS
— supplies the
TlsClientConfigconsumed by OIDC discovery, JWKS, token, introspection, revocation, and IdP admin endpoints, and thePinSetcomposed in per-IdP trust records so issuer roots and SPKI hashes stay isolated from the ambient WebPKI store. The two proposals meet at three X.509 corners:ClientAuthMethod.privateKeyJwtcarries an X.509Certificatecap when the IdP requires a cert-bound assertion;ClientAuthMethod.tlsClientAuth(RFC 8705) consumes a PKI-rooted clientCertificateplusPrivateKey; andClientAuthMethod.selfSignedTlsClientAuth(RFC 8705 §2.2) uses a self-signedCertificatepublished inOAuthClientMetadata. The certificates proposal handles X.509 verification; this proposal owns the resulting token-typed capabilities. - Boot to Shell —
device code and authorization code grants are
SessionManager.loginmethods.CredentialStorestores IdP trust records (issuer URL, JWKS, allowed audiences) alongside password verifiers and passkey public credentials. - Shell — the authority
broker consumes access tokens as ABAC input; the agent shell
holds
ApprovalGrantwrappers, not raw tokens. - User Identity and Policy
— owns the canonical external subject key
subjectHash = hash(providerKind, issuer, tenant, subject), theExternalIdentityBindingrecord (mapped fromsubjectHashto a local or pseudonymousPrincipalInfo.idplus named policy and resource profiles), and the three admission sourcesSessionManagerconsults afterverifyIdTokensucceeds: manifest seed admission, local account-store bindings, and explicit pseudonymous auto-creation rules. OAuth scopes and OIDC claims (acr,amr,groups, tenant) are normalized ABAC attributes fed toAuthorityBroker/PolicyEngine, never authority on their own;AuthStrengthderives fromacr/amrthrough the deployment-configured mapping inOAuthClientMetadata. - Volume Encryption
— OIDC-gated KMS unlock replaces baked IAM credentials; per-user
EncryptedNamespaceunlock usesOidcFederatedKeySource. - Cloud Metadata —
InstanceIdentityis the primary subject token source for workload identity federation. The current proposal’s ownSubjectSource.instanceIdentityJwtis implemented by that cap. - Networking — outbound OAuth calls use a userspace HTTP/TLS client built over the networking stack. Service-to-service OAuth coexists with mTLS as two delegation patterns rather than competing ones.
- System Monitoring
— every
verifyIdToken, token issuance, refresh, exchange, andverifyAccessflows through the audit cap. Redaction rules from the boot-to-shell proposal apply: claim summaries and token references, never raw tokens. - Security and Verification — JWT/JWS/JWE parsers are classic fuzz targets; PKCE and device-code state machines are Loom candidates; token-exchange policy evaluation is a Kani candidate.
- Live Upgrade — the OAuth service holds sensitive live state (refresh tokens, DPoP private keys, PKCE verifiers). Live upgrade needs a state-transfer path that does not leak tokens through shared memory.
Open Questions
- Do we ship our own OIDC RP implementation or wrap an existing
Rust crate?
openidconnect-rs,oauth2-rs, andbiscuitare candidates. The schema boundary is independent; the implementation choice affects TCB size and audit surface. - Opaque access token handling. Some IdPs issue opaque tokens validated only by introspection (RFC 7662). Latency and load on the introspection endpoint are operational concerns; caching introspection responses is fiddly (when is the cache allowed to serve stale “active”?). Probably: support introspection with short cache TTL and per-policy opt-in.
- PKCE-less legacy clients. A deployment against an old IdP that cannot do PKCE. Do we allow a config escape hatch, or do we refuse to boot? Leaning “refuse” given OAuth 2.1 guidance.
- DPoP nonce plumbing. Server-issued nonces (RFC 9449 §8)
require the caller to retry after the first 401 with the returned
nonce. Fits naturally in a wrapper cap around
AccessToken, but the retry policy on non-idempotent methods needs a clear rule. - Device code on air-gapped consoles. Device code presumes the user has another device with a browser. Pure-air-gapped hosts must fall back to password + passkey; what about console-only OIDC without internet? Probably: no-op; offline OIDC is an oxymoron, use local auth.
- How do tokens transfer across capability boundaries? Per-
consumer down-scoped issuance is the default. Should
AccessToken.attenuatebe a kernel-level badge, a userspace wrapper cap, or both depending on whether attenuation is server-side (token exchange) or client-side (scope subset)? - Logout semantics. OIDC end-session endpoints are
optional and frequently inconsistent across IdPs. When
UserSession.logoutfires, what is the best-effort expectation: local session drop + IdP revoke + RP-initiated logout redirect? Document a clear failure mode for each step. - Default audiences for
AuthorityBrokerdecisions. When the broker down-scopes an access token, what audience does the narrower token target — always the resource server the broker just returned a cap for? Or a list, for broker decisions that return a compound bundle? Probably: one audience perCapRequest, bundles emitted as multiple broker responses. - External auto-creation policy. Which OIDC providers may create pseudonymous local accounts, which policy/resource profiles may they name, and what rollback/recovery record proves the mapping was not replayed from stale account-store state?
- Support for JAR / PAR / JARM. Pushed Authorization Requests and JWT-Secured Authorization Response Mode are increasingly expected by enterprise IdPs. Phase 3 should support PAR; JAR and JARM can follow.
- Clock source. OIDC verification depends on a reliable clock.
Before the Timer capability and a cloud attested-time source
exist,
verifyIdTokenmust either fail closed or consume a bootstrap clock from the manifest. Document the first-boot behavior. - Key binding for user sessions. Should a
UserSessionbe bound to a DPoP key by default (so a leaked session ID is useless without the key), or is that overkill for console sessions? Probably: yes for web gateway sessions; no for direct local console sessions where session state never leaves the host. - GOST / jurisdictional OIDC. Some deployments mandate
GOST-signed JWTs (GOST R 34.10-2012 on the JWT signature).
Adding the algorithms to
SignatureSchemeis schema-level; validating a GOST-signed discovery document requires matching trust-store support in the certificates proposal. Track, do not block.
Proposal: Volume Encryption
Encrypting system and user volumes in a capability OS where storage is already a stack of typed capabilities and keys can be first-class capability objects.
Problem
capOS currently has no persistent storage, no crypto, no TPM driver, and no block-device drivers. That is the right moment to decide what encryption-at-rest looks like, before storage interfaces and service graphs harden around plaintext assumptions.
Traditional OSes bolt encryption on as a kernel subsystem
(dm-crypt/LUKS, BitLocker, FileVault, fscrypt). That choice follows
from those kernels’ architecture: the kernel owns block I/O, the
filesystem, the keyring, and the trust domain between processes, so
encryption logically lives there too. capOS has made the opposite bet —
the kernel is a capability router, block I/O lives in userspace
services, filesystems are userspace services, and there is no ambient
keyring because there is no ambient anything.
Putting crypto in the kernel would contradict Design Principle 5 (“the kernel is becoming a capnp-rpc router”) and Principle 7 (“pragmatic reuse” — let userspace crates do what they already do well). Putting it nowhere leaves the system unable to protect data at rest. The proposal below places encryption in userspace services expressed as capabilities, with no new kernel mechanism.
Threat Model
Four attackers worth distinguishing up front, because the defenses differ:
- Offline disk theft. Attacker has the storage medium, no live system, no running key service, possibly no hardware attestation. Ciphertext must reveal nothing about plaintext beyond length and block boundaries.
- Ciphertext tampering at rest. Attacker can write to the medium and hopes to flip ciphertext bits to produce attacker-chosen plaintext changes (classic XTS malleability). Modification must be detected, not merely scrambled.
- Peer userspace service holding the raw
BlockDevicecap. The virtio-blk driver, a backup agent, a telemetry exporter, or any service that is on the physical I/O path. They hold authority to read sectors but must not see plaintext for volumes whose key they do not hold. - Compromised session with a live key cap. Once an attacker is
inside a user’s session and holds the user’s
SymmetricKeycap, that user’s data is lost. The goal is lateral containment: no cross-user leverage, no escalation to the system volume, no access to other sessions’ keys.
Out of scope for a first pass:
- Cold-boot RAM attacks and side channels (mitigation: use TPM-bound keys when available, but physical memory reads against a running host are not defended).
- Evil-maid attacks on the unencrypted portion of the boot image (addressed separately by secure boot / measured boot — see Storage and Naming Open Question #5).
- Traffic analysis against encrypted backups or encrypted replication.
- Key escrow for legal recovery. capOS takes no position; a deployment
can add an escrow
KeySourcewithout changing the model.
Keys Are Capabilities
Key material never crosses cap boundaries. Callers hold
SymmetricKey or PrivateKey capabilities whose methods run inside
the service that holds the key; the holder gets encrypt/decrypt/sign
authority, not the bytes. Attenuation (decrypt-only, AAD-pinned,
purpose-bound) is wrapper CapObjects, the same mechanism that builds
read-only Files.
This proposal does not define those interfaces. They belong to
Cryptography and Key Management,
which covers SymmetricKey, PrivateKey/PublicKey, KeySource,
KeyVault, algorithm and purpose enums, seal policies, and the set
of concrete key sources (manifest-embedded, passphrase, passkey PRF,
TPM 2.0, cloud KMS, attestation, network, software-stored). Volume
encryption is one consumer among many.
Layer Placement
Two layers exist, and a first-class design uses both.
Layer A — EncryptedBlockDevice (LUKS analog)
A userspace service holds two caps — BlockDevice (raw) and
SymmetricKey — and exports a new BlockDevice cap that looks
identical to its input but encrypts writes and decrypts reads
transparently. Everything above the wrapper (filesystems, the Store
service, content-addressed backends) is oblivious.
Raw block device
→ virtio-blk / NVMe driver → BlockDevice cap (ciphertext)
→ EncryptedBlockDevice service holds [BlockDevice + SymmetricKey]
→ BlockDevice cap (plaintext-view)
→ FAT / ext4 / Store service
→ File / Directory / Namespace caps
→ App
Properties:
- One key per volume (or per-range, see “Key hierarchy” below).
- Granularity is a sector/block. Metadata in the filesystem layer is encrypted along with data — the shape of the directory tree is invisible to threat #3.
- Incompatible with zero-copy device DMA into user pages (see “SharedBuffer” below).
Layer A defends against threats #1, #2, and #3.
Layer B — per-user Namespace / Directory encryption (fscrypt analog)
Layered above a filesystem or Store, Layer B encrypts object contents and, optionally, object names, using a per-user key. The underlying block device may or may not also be encrypted.
BlockDevice (ciphertext or plaintext)
→ Store service → Store/Namespace caps (ciphertext objects)
→ EncryptedNamespace service holds [Namespace + UserKey]
→ Namespace cap (plaintext-view)
→ User's session services
Properties:
- One key per user (or per session, per device, per tenant).
- Metadata at the filesystem/Store layer is visible to threat #3 unless Layer A is also in place.
- Cap boundaries are naturally per-user — revocation is “drop the cap,” no filesystem rekeying.
- Compatible with shared filesystems across users (per-entry encryption).
Layer B defends primarily against #4-lateral (a compromise of user Bob’s session does not reveal user Alice’s data) and against a compromised shared filesystem service when the underlying block layer is unencrypted.
Recommendation
Use both. Layer A for the system volume and for the per-tenant block substrate in multi-tenant deployments; Layer B for per-user data on top of a shared filesystem or store. Users who run single-tenant desktops can skip B. Cloud VMs that rely on provider-side encryption of block storage (see “Cloud integration”) can skip A and keep B. The proposal does not mandate either layer; it standardizes the interface so both compose.
Volume-Specific Schemas
SymmetricKey, KeySource, KeyAlgorithm, KeyPurpose, and
SealPolicy are defined in
Cryptography and Key Management.
This proposal adds only the wrapper-factory and on-disk-format
schemas.
EncryptedBlockDevice
Exposes nothing new — it implements the existing BlockDevice
interface. The distinction is where it sits in the cap graph. A
factory cap creates it:
interface EncryptedBlockDeviceFactory {
open @0 (raw :BlockDevice, key :SymmetricKey, format :VolumeFormat)
-> (plain :BlockDevice);
format @1 (raw :BlockDevice, key :SymmetricKey, params :FormatParams)
-> (plain :BlockDevice);
}
struct VolumeFormat {
superblock @0 :Data; # read from raw device during open()
algorithm @1 :SymmetricAlgorithm; # defined in key-management proposal
sectorSize @2 :UInt32;
tagAreaLayout @3 :TagAreaLayout;
}
Cryptographic Construction
Two separate questions — block layer and object layer — with different answers.
Block layer (Layer A)
Requirement: authenticate every block. XTS alone is not enough; it defends against #1 but not #2.
Shortlist:
- AES-256-GCM-SIV with LBA-derived nonce + separate tag area. The
nonce is
HMAC(K_nonce, LBA)(deterministic, no extra storage). The tag (128 bits) is stored in a reserved tag area, either a sidecar journal (dm-integrity style) or a reserved footer per block group. Cost: ~3% storage overhead for the tag, one extra read/write to the tag area per I/O (usually absorbed by sector grouping). Defends against #1 and #2. - XChaCha20-Poly1305 with random nonce + tag. Same tag-storage problem as GCM-SIV; XChaCha’s 192-bit nonce removes nonce-reuse concerns entirely. Slower than AES on hardware that has AES-NI, faster on hardware that doesn’t (e.g. low-end ARM).
- AES-256-XTS alone. The LUKS1/LUKS2 default. Reject this as the sole defense; it fails #2. May still be useful as a building block under an external MAC (dm-integrity + dm-crypt in Linux).
- Wide-block constructions (HCTR2, Adiantum). Length-preserving, no MAC. Better diffusion than XTS but still fail #2. Useful only when storage overhead for tags is unacceptable and tamper-detection is being provided elsewhere.
Recommendation: AES-256-GCM-SIV with LBA-derived nonce and a
dedicated tag area, fallback to XChaCha20-Poly1305 on hardware without
AES-NI. Document the tag-area layout in VolumeFormat; don’t invent a
scheme per deployment.
Object layer (Layer B)
Requirement: per-object authentication; compatibility with content-addressed storage where possible.
Options, with the honest tradeoffs:
- Per-tenant keys,
hash(ciphertext)as address. Each user’s Store encrypts with their key. Dedup works within a volume, not across. Metadata (object size, access patterns) is visible to a peer holding the backingBlockDevice. This is the recommended default. - Per-tenant keys,
HMAC(K, plaintext)as address. Address derived deterministically from plaintext allows a user to look up their own objects by plaintext hash without scanning. Same cross-tenant properties as above. - Convergent encryption (key =
hash(plaintext)). Global dedup across users, but leaks equality: “user X holds the same file as user Y.” Rejected as a default; too much leakage for a capability-based OS that treats ambient authority as a bug.
All three use an AEAD (GCM-SIV or XChaCha20-Poly1305) per object with a random nonce stored with the object.
System Volume Flow
- Boot firmware loads Limine, which loads the kernel + init + boot services from an unencrypted boot partition.
- Kernel spawns init. Init spawns a minimal service graph: block
device driver, console service,
KeySourceservice (one of passphrase / TPM / cloud KMS / manifest-embedded), and theEncryptedBlockDeviceFactoryservice. - Init obtains the unlock context. For interactive boot: read a passphrase via the console login flow in Boot to Shell. For unattended boot: invoke TPM unseal, KMS decrypt, or an attestation protocol. Contexts that require networking (cloud KMS, Tang) come up after the network stack.
- Init hands
(BlockDevice, SymmetricKey)toEncryptedBlockDeviceFactory.openand receives a plaintext-viewBlockDevice. - Init hands that
BlockDeviceto the filesystem or Store service, which becomes the system storage root. - Init pivots to the services graph baked in the now-readable system
volume. Services that do not need direct I/O never see a raw
BlockDeviceand therefore never see ciphertext.
Analogous to Linux’s initramfs pattern, but with capabilities
instead of /dev paths.
User Volume Flow
- User authenticates through the login flow in
Boot to Shell. Success
yields a session and a
CredentialStoreresponse. SessionManagerinvokes the user’sKeySource— passkey PRF, password-derived, or cloud-held — yielding a userSymmetricKey.SessionManagerhands(UserNamespace, UserKey)to anEncryptedNamespaceFactory.openand receives a plaintext-viewNamespace.- The plaintext Namespace is installed in the session’s CapSet. Services in the session see only the user’s decrypted view.
- On logout, the session is torn down; the user
SymmetricKeycap is released; the key service’s in-process material is zeroized.EncryptedNamespacestops decrypting. Ciphertext remains intact on disk.
Revocation is a cap-drop, not a filesystem rekey.
SharedBuffer and DMA
SharedBuffer (docs/roadmap.md Stage 6 / MemoryObject) exists so devices can
DMA directly into app pages. Software block encryption is inherently
incompatible with that: the device writes ciphertext; the app expects
plaintext.
Three honest answers:
- Extra copy. Driver DMAs into a scratch page held by the
EncryptedBlockDeviceservice, which decrypts into the app’sSharedBuffer. One extra copy per I/O. Simple; correct; first implementation. Cost is dominated by the crypto itself, not the copy, for typical I/O sizes. - Decrypt in place. Device DMAs ciphertext into the app’s
SharedBuffer; the service decrypts it in-place before completion is posted. Saves a copy, keeps CPU crypto on the hot path, and complicates reuse of the buffer (the app sees ciphertext briefly, then plaintext). Viable once the buffer lifetime is well-specified. - Hardware inline crypto. NVMe OPAL, SED drives, Intel CSE, AES-XTS block engines on some ARM SoCs. Device sees the key; DMA paths see plaintext; software sees an unencrypted-looking device. Different trust model — the device is now in the TCB — and different key-provisioning story (IEEE 1667 / TCG Opal PSID). Note for future work; not a first-implementation target.
First implementation: #1. Revisit #2 when I/O performance matters.
Treat #3 as a separate capability shape (SelfEncryptingBlockDevice)
rather than a flag on the main interface.
Boot Order and the Unencrypted Boot Partition
By construction there must be an unencrypted partition containing at least: Limine, kernel, init, the block device driver, the key-source service(s), the encrypted block device factory, and — if the key source requires it — a minimal networking stack.
This partition is the trust root for the whole system. It does not need to be encrypted, because its contents are either integrity-protected by a measured-boot chain or considered public anyway (the capOS binaries are open source). It does need to be integrity-protected, which is secure boot / measured boot — addressed in Storage and Naming Open Question #5 and not duplicated here.
Relationship to that question: a TPM-sealed KeySource requires
measured boot to be useful. Without measurement, a tampered boot
partition can unseal the key under attacker-controlled code. A
passphrase KeySource does not require measured boot, only the
expectation that the user will notice if the boot UI looks wrong. A
cloud KMS KeySource relies on cloud-provider instance identity,
which is a parallel trust story (see below).
Cloud Integration
Cloud environments change every part of this picture: the block device is virtual, the key store is a network service, instance identity is provider-signed, object storage exists as a first-class primitive, and backups are a product, not a script. capOS should treat each of these as a capability and reuse them.
Cloud block storage (EBS, GCP Persistent Disk, Azure Disk)
These volumes are already encrypted at rest by the provider. The question is whose key performs the encryption:
| Model | Provider sees plaintext? | Customer controls key? | Customer does crypto? |
|---|---|---|---|
| Provider-managed (default) | Yes (plaintext in volume) | No | No |
| Customer-managed (CMEK) | Yes (plaintext in volume) | Yes (via KMS) | No |
| Customer-supplied (CSEK) | Briefly, during request | Yes | No |
| Client-side (Layer A) | No | Yes | Yes |
capOS’s BlockDevice cap is indifferent to which of the first three
the provider is doing. For the fourth — client-side encryption — capOS
wraps the provider’s BlockDevice cap in its own
EncryptedBlockDevice. The provider sees only ciphertext and cannot
read the volume even with a compelled-disclosure order.
Deployment guidance:
- Untrusted provider / compliance-driven: Layer A over cloud block storage. Provider-side encryption becomes a belt-and-braces redundancy.
- Trusted provider / operational simplicity: rely on CMEK, skip Layer A. Capability model still contains peer services — a compromised capOS service does not get raw block I/O unless it holds the cap.
- Confidential-computing VMs (SEV-SNP / TDX / Nitro): use Layer A
with an attestation-gated
KeySource. The attestation report proves the VM is genuine and running approved code; KMS releases the DEK only against a valid report.
Cloud KMS (AWS KMS, GCP KMS, Azure Key Vault, Vault, …)
Envelope encryption is the universal pattern: the cloud KMS holds a key-encrypting key (KEK) with tight IAM-bound access; the actual data-encrypting key (DEK) is generated by capOS, wrapped by the KEK, stored alongside the ciphertext, and unwrapped by KMS at unlock time.
Map to capabilities:
- A
CloudKmsKeySourceservice implementsKeySource.unlock(blob)sends the wrapped DEK to KMS forDecrypt, receives the plaintext DEK, constructs a localSymmetricKeycap around it, and returns it. - The service authenticates to KMS using the VM’s instance identity,
obtained from a
CloudMetadata-derivedInstanceIdentitycap (see Cloud Metadata). No long-lived credentials are baked into the image. seal(key, KmsPolicy{kmsKeyId, grant})calls KMSEncryptto wrap the key under the named KEK and returns the opaque blob.- KMS audit logs record every unwrap. This is a free observability win capOS inherits by delegation; nothing in the OS needs to log key usage separately.
Benefits of envelope encryption that capOS gets by following the pattern:
- Free KEK rotation. Rotating the KEK requires only re-wrapping
the DEK (fast, metadata-only). The DEK itself stays; the volume is
not rewritten. A
rewrapmethod onKeySourcemakes this explicit. - Revocation. Disable the KMS key or revoke the IAM grant; the
next
unlockfails. Running instances with a cached DEK continue until reboot — matches Linux behavior. - Cross-region / cross-account access. KMS grants move
ciphertext-readable capability between accounts without handing
over the key material. capOS reads that as “the receiving account
holds a
KeySourcecap whose policy the grant satisfies.”
Non-AWS KMS providers (Vault, HSM clusters, KMIP devices) fit the
same interface. The CloudKmsKeySource service name is a placeholder;
production likely wants one service per provider, or one generic
service with a provider-selection parameter.
Instance identity and attestation
Cloud VMs authenticate to KMS without baked-in credentials because the hypervisor signs identity tokens. AWS IMDSv2, GCP metadata identity tokens, and Azure IMDS all produce short-lived signed JWTs. Confidential-computing platforms extend this with hardware attestation reports (SEV-SNP, TDX, Nitro).
An InstanceIdentity capability — carved out of
Cloud Metadata — exposes
these token and attestation paths. Key-source services consume that
cap instead of pulling from an ambient metadata endpoint. Revoking a
service’s access to the metadata service becomes a cap-graph edit:
no firewall rules, no iptables on 169.254.169.254.
OIDC-gated volume unlock (workload identity federation)
InstanceIdentity is the raw material. Modern clouds consume it
through OIDC token exchange (RFC 8693) rather than a provider-
specific identity API. That pattern is defined in
OIDC and OAuth2 as
WorkloadIdentityFederation; volume encryption consumes it through
OidcFederatedKeySource (see
Cryptography and Key Management).
System-volume flow:
- Boot the key-less image.
initstarts the block driver, the metadata service, and the OAuth service, but never holds raw cloud credentials. CloudMetadatareturns anInstanceIdentitycap (a signed JWT from the hypervisor).WorkloadIdentityFederation.exchangeposts that JWT to the cloud STS withgrant_type = urn:ietf:params:oauth:grant-type:token-exchangeandsubject_token_type = urn:ietf:params:oauth:token-type:jwt. It receives a short-lived cloud access token bound to the instance’s identity.OidcFederatedKeySourceuses that access token to authenticate aDecryptcall on the wrapped DEK at the cloud KMS. The plaintext DEK returns as aSymmetricKeycap.EncryptedBlockDeviceFactory.opencomposes that key with the rawBlockDeviceand returns a plaintext-viewBlockDevice.
Per-user volume flow (Layer B):
- Alice authenticates through console or web shell OIDC; the IdP issues an ID token and an access token.
SessionManagermints herUserSession; herAccessTokencap is handed toOidcFederatedKeySourcewrapped inside the broker- returned session bundle — never as a bearer string.- The key service enforces
SealPolicy.tokenExchange { issuer, audience, subjectPattern, requiredClaims, minAuthStrength }. It verifies the access token (or an ID token it exchanges for) against its pinned IdP trust record and only then releases Alice’s DEK. EncryptedNamespaceFactory.openyields Alice’s plaintext namespace. Logout drops the cap; the in-process key material zeroizes.
Properties this adds on top of plain CloudKmsKeySource:
- No long-lived IAM credentials anywhere in the image. The historical instance-role access-key pair is gone; what remains is a short-lived access token tied to the live workload.
- Audit keyed on principal. Cloud KMS logs the OIDC
subof every Decrypt, so “Alice’s laptop unlocked her volume at 09:14” is observable without extra audit glue. - Step-up authentication on the unlock path.
TokenExchangePolicy.minAuthStrengthmaps to X.1254 LoA. A volume requiringloa3cannot be unlocked by a passwords-only session. - Revocation through IdP or KMS. Disable Alice at the IdP or revoke the IAM grant and the next unlock fails. Cached DEKs in running instances survive until reboot — identical to today’s cloud KMS semantics but explicit.
Token TTL vs. cached DEK
OIDC access tokens typically expire in minutes; DEKs typically live
for as long as a volume is mounted. OidcFederatedKeySource.unlock
is called once per mount; the DEK cap is held by the encrypted
block/namespace service until mount ends. Token expiry after unlock
does not re-lock the volume. This matches every other KMS-unwrap
pattern (CloudKmsKeySource, Tpm2KeySource), but it is worth
saying aloud: short-lived tokens give short-lived authorization
freshness, not short-lived key availability. Deployments that
want stricter revocation can:
- require periodic re-unlock (re-mount) via broker policy,
- keep the volume mounted read-only by default and require a fresh token for each write window,
- or use a confidential-computing + attestation-gated KEK that the hardware refuses to re-release on policy change.
No baked credentials policy
The capOS ISO must contain neither a long-lived cloud IAM credential
nor a long-lived bearer token. ManifestEmbeddedKeySource remains
dev/CI only. Production builds pass through one of:
Tpm2KeySource, AttestationKeySource, CloudKmsKeySource
(instance-identity flow), or OidcFederatedKeySource
(workload-federation flow). The manifest validator should refuse a
production-profile image that embeds a symmetric volume key or a
long-lived cloud credential.
Object storage (S3, GCS, Azure Blob)
Object storage is a natural backend for the capability-native
Store. The Store service holds an S3Bucket cap, serializes capnp
messages as S3 objects keyed by their content hash, and exports
Store / Namespace caps to clients.
Encryption trust tiers mirror block storage:
| Model | Provider sees plaintext? | Customer key? | Customer does crypto? |
|---|---|---|---|
| SSE-S3 | Yes | No | No |
| SSE-KMS | Yes | Yes (KMS) | No |
| SSE-C | Briefly | Yes | No |
| Client-side (Layer B in Store) | No | Yes | Yes |
Client-side is the interesting case for capOS. The content-addressed
Store can encrypt each blob with a per-tenant DEK before upload,
keying objects by hash(ciphertext) or HMAC(K, plaintext). The DEK
is wrapped by cloud KMS; the bucket can be world-readable without
leaking plaintext. This is a deployment where “the provider stores our
data” and “the provider cannot read our data” coexist.
Nonce management across objects becomes the main design question. Either:
- random 192-bit nonce per object (XChaCha), stored as an object header; or
- derived nonce from object identity (
HMAC(K_n, object_id)), requires that the same plaintext object is never uploaded twice under the same key, which is consistent with content-addressing semantics.
Backups
Backups are where encryption choices pay off or hurt:
- Block-level snapshot / cross-region replication. The provider handles it. A snapshot of a Layer-A-encrypted EBS volume is ciphertext; restoring requires the KMS key. Cross-region replication requires the key to be grant-accessible in the target region. Free; handled by the provider.
- Application-level backup service. A backup service holds a
StoreorDirectorycap, reads objects, writes them to an object-storage bucket, and records the backup manifest. If Layer B is in place, the backup bytes are already encrypted — no re-encryption needed, and the backup destination does not need the user’s key. If only Layer A is in place, the backup service sees plaintext because Layer A wraps below the Directory; the backup service must re-encrypt for the destination. - Restore to a different account / region / capOS install. The
key must be reachable in the target environment. For KMS-wrapped
DEKs: cross-account grants, multi-region KMS keys, or replicated
key material. For TPM-sealed DEKs: explicit re-seal to the target
TPM before restore. capOS does not need to implement this
directly; it needs the
KeySourceabstraction to not hide the provider-specific primitives that enable it.
A backup KeyPolicy worth documenting: “this key is usable in
regions A, B, and C, wrapped under KMS keys k_a, k_b, k_c, all
granting access to the instance identity role backup-reader.” This
is routine on AWS and routinely surprising to people who expect Linux
dm-crypt semantics.
Keys never in the image
The capOS ISO must never contain production keys. The
ManifestEmbeddedKeySource (key-management proposal) exists for
development and CI only; the manifest validator should refuse to boot
from an image that embeds a non-development key on a
production-profile manifest. The production flow is always: boot from
a key-less image, obtain identity from the cloud, fetch the wrapping
policy from the cloud, unwrap a DEK via KMS, mount the volume. Same
property as AWS’s “EBS with KMS requires no bootstrap secrets on the
instance.”
Confidential computing
SEV-SNP, TDX, and AWS Nitro Enclaves produce attestation reports that include measurements of the VM image. A KMS policy can require a matching attestation before releasing the wrapping key. In capOS:
AttestationServiceexposesattestation(nonce) -> report(the report includes the image measurement, firmware version, and VM metadata signed by the hardware root of trust).KeySourceof kindattestationcollects the report and submits it as part of the KMSDecryptrequest; KMS enforces the policy server-side.- The trust story becomes: “this capOS image, unmodified, running on genuine SEV-SNP / TDX / Nitro hardware, is the only thing that can unlock this volume.” That is materially stronger than instance-identity alone.
This composes cleanly with Layer A: the confidential VM reads ciphertext from a cloud disk, unwraps the DEK via attestation-gated KMS, and decrypts locally. The cloud provider never sees plaintext and a stolen snapshot cannot be decrypted outside the attested VM.
Phases
No implementation exists. Phases here cover only the volume-specific work; the underlying key abstractions, key sources, and KMS integration are phased in Cryptography and Key Management. Volume encryption tracks, but does not duplicate, that sequence.
Phase V1 — EncryptedBlockDevice over RAM block device
- Add
EncryptedBlockDeviceFactory,VolumeFormat,TagAreaLayout, andFormatParamstoschema/capos.capnp. - Wire the service between a RAM-backed
BlockDeviceand the Store or a toy FAT reader. Key source isManifestEmbeddedKeySourcefrom the key-management proposal’s Phase 1. - Implement AES-256-GCM-SIV with a reserved tag area; document the on-disk format (superblock, tag area layout, block size).
- Measurement: demonstrate a Store survives a ciphertext read of the raw RAM disk and fails decrypt after a flipped bit.
Phase V2 — EncryptedNamespace and user-volume path
- Add
EncryptedNamespaceFactoryschema. - Layer B over a RAM-backed Store. Depends on
PassphraseKeySource(key-management Phase 4) andPasskeyPrfKeySourceonce passkey infrastructure lands. - Revocation tests: dropping a session’s key cap renders the namespace unreadable without rebooting.
Phase V3 — Persistent storage integration
- Promote Phase V1 from RAM disk to virtio-blk.
- System volume unlock in the normal boot path. Default dev build uses a manifest-embedded key; production build requires passphrase/TPM/KMS.
- QEMU smoke: system volume encrypted with a passphrase, reboot survives, wrong passphrase fails closed.
Phase V4 — TPM-backed system volume
- Depends on
Tpm2KeySourcefrom key-management Phase 5. - Measured-boot chain: firmware, bootloader, kernel, init, key service. PCR composition for a sealed system volume documented.
Phase V5 — Cloud deployment
- Depends on
CloudKmsKeySourcefrom key-management Phase 6. - Client-side encrypted block volume over cloud block storage.
- Optional: client-side encrypted Store backend over object storage.
Phase V5b — OIDC-federated unlock
- Depends on
OidcFederatedKeySourcefrom key-management Phase 6b and onWorkloadIdentityFederationfrom OIDC and OAuth2 Phase 5. - System volume unlocks through token-exchange against the cloud STS; no long-lived IAM credentials in the image.
- Per-user
EncryptedNamespaceunlocks from a userAccessTokenunderSealPolicy.tokenExchange. - QEMU smoke against a local fake STS (e.g.
dex) proves the flow end-to-end before targeting a real cloud.
Phase V6 — Confidential computing
- Depends on
AttestationKeySourcefrom key-management Phase 7. - Attestation-gated system volume unlock on SEV-SNP / TDX / Nitro.
- QEMU SEV-SNP smoke (where toolchain supports it).
Relationship to Other Proposals
cryptography-and-key-management-proposal.md— primary dependency. Volume encryption consumes theSymmetricKey,KeySource,KeyVault,KeyAlgorithm,KeyPurpose, andSealPolicyprimitives defined there (see that proposal’s “Schemas” → “Symmetric keys” and “Key lifecycle — theKeyVault” sections) along with the concreteManifestEmbeddedKeySource,PassphraseKeySource,PasskeyPrfKeySource,Tpm2KeySource,CloudKmsKeySource,OidcFederatedKeySource, andAttestationKeySourceimplementations under “Concrete Key Sources”. This proposal adds only the volume-specific wrapper factories (EncryptedBlockDeviceFactory,EncryptedNamespaceFactory), the on-diskVolumeFormat/TagAreaLayout, and the block- and object-layer cryptographic constructions; it does not redefine any key, algorithm, purpose, seal-policy, or key-source shape.storage-and-naming-proposal.md— Open Question #5 (manifest trust and secure boot) is a prerequisite for a TPM-sealedKeySourceto be meaningful. This proposal extends the storage stack withEncryptedBlockDeviceandEncryptedNamespaceas optional wrapper services; theBlockDevice,File,Directory,Store, andNamespaceinterfaces are unchanged.boot-to-shell-proposal.md— the passphrase / passkey unlock path at the console and in the web gateway feedsKeySourceimplementations.CredentialStore,SessionManager, andAuthorityBrokeralready think about missing credentials not implying an unlocked system; this proposal extends that to “missing key source implies missing system volume, not zero-fill.”user-identity-and-policy-proposal.md— user-volume keys are bound to session identity. The cap chain that yields “you are Alice” also yields Alice’s KEK.cloud-metadata-proposal.md—CloudMetadataand theInstanceIdentitycap carved out of it are what the cloudKeySourceimplementations consume to authenticate to KMS without baked-in credentials.oidc-and-oauth2-proposal.md— theWorkloadIdentityFederationand token-exchange primitives behindOidcFederatedKeySource. Also the source of theAccessToken/IdTokencap shape used in per-user volume unlock and the policy inputs consumed bySealPolicy.tokenExchange.cloud-deployment-proposal.md— owns the cloud KMS reasoning this proposal builds on. Its “Managed Application Services” and “GCP Cloud KMS And IAM Notes For Adventure Saves” sections describe the envelope-encryption pattern (KMS holds the KEK; capOS generates the DEK and wraps it; KMSEncrypt/Decryptunwraps on demand) and the IAM/grant model thatCloudKmsKeySourceandOidcFederatedKeySourceplug into. Its NVMe phase (“Phase 5: NVMe Driver”) and SED/Opal notes set the ground for a futureSelfEncryptingBlockDevicecapability with hardware inline crypto, distinct from this proposal’s software-crypto Layer A and with a different TCB story (the device is in the TCB).security-and-verification-proposal.md— the encrypted block format is a good target for the tiered tooling plan: fuzz corrupted ciphertext at the block boundary, proptest round-trips through the wrapper, Loom-model the volume unlock state machine, Kani-prove LBA-nonce uniqueness invariants. General crypto-side invariants are tracked in the key-management proposal.system-monitoring-proposal.md— volume unlock, decrypt failure, and format-params events are audit-worthy. TheEncryptedBlockDeviceservice emits them through the audit cap. Generic key events are emitted by the key-management services.live-upgrade-proposal.md— replacing theEncryptedBlockDeviceservice must preserve in-flight I/O and the DEK. The service holds sensitive state (the key material); live upgrade needs a state-transfer path that does not touch the disk and does not leak the key through shared memory.../design-risks-register.md— the register currently carries no dedicated R-entry for volume encryption or encryption-at-rest; that is intentional, because no implementation exists yet. The closest tracked entry is Q11 (“Capability persistence model”), which already lists this proposal alongsidestorage-and-naming-proposal.mdas a tracker for the sealed/stored capability and key-material persistence path. Open a dedicated R-entry once Phase V1 lands a real on-disk format, since at that point the tag-area layout, LBA-nonce derivation, and revocation semantics become long-horizon design surfaces in their own right.
Open Questions
- Tag area layout. Sidecar journal (dm-integrity style, separate device or partition) vs. reserved footer per block group vs. derived-nonce-only-plus-separate-MAC-area. Affects write amplification, recovery, and fsync semantics. A small measurement study under QEMU would settle it.
- Key rotation at scale. Rewrap-only (KEK rotation) is cheap. Rekeying a DEK on a live volume means re-encrypting every block. Online rekey is a research problem; for capOS a controlled offline rekey service reading old-key and writing new-key is the honest first answer.
- Metadata leakage in Layer B. fscrypt-style filename encryption is fiddly (deterministic encryption to preserve directory lookups vs. randomized encryption that breaks them). Decide whether Layer B encrypts names as well as contents, and how lookups work if names are randomized.
- Backup re-encryption. A backup crossing trust boundaries needs either shared key material at both ends or an explicit re-encrypt step. Who does the re-encryption — the backup service, a dedicated re-encryption service, or a KMS-side primitive? Policy question, not a mechanism question, but worth documenting defaults.
- Hardware inline crypto as a separate capability. NVMe OPAL and
SED drives do not fit the software-AEAD model. Define
SelfEncryptingBlockDevicewith its ownopen/lock/unlockmethods and a separate trust story (the device is in the TCB). - Swap / paging. No swap yet. When added, encrypted swap with a per-boot ephemeral key is standard. The memory-pressure policy, page-eligibility rules, and swap lifecycle now live in OOM Handling and Swap.
- Firmware and boot-partition integrity. This proposal assumes
secure boot / measured boot is available when TPM-sealed keys are
in use. The actual secure-boot work is owned by
storage-and-naming-proposal.mdOpen Question #5 and is prerequisite, not in scope here.
Algorithm enum scope, side-channel hardening, post-quantum migration, GOST support, and audit granularity are answered in Cryptography and Key Management’s open-questions section rather than duplicated here.
Proposal: Cloud Instance Bootstrap
Picking up instance-specific configuration — SSH keys, hostname, network config, user-supplied payload — from cloud provider metadata sources, without porting the Canonical cloud-init stack.
Problem
A capOS ISO built once has to boot on any cloud VM and adapt to its environment: different instance IDs, different public IPs, different operator-supplied SSH keys, different user-data payloads. Without this, every instance needs a custom-baked ISO — and the content-addressed-boot story (“same hash boots identically on N machines”) devalues itself at the point where it would actually matter for operations.
The Linux convention is cloud-init: a Python daemon that reads
metadata from provider-specific sources and applies it by writing
files under /etc, invoking systemctl, creating users, and running
shell scripts. Porting it is a non-starter:
- Python, POSIX, systemd-dependent.
- Runs as root with ambient authority: parses untrusted user-data as shell scripts, mutates arbitrary system state.
- ~100k lines covering hundreds of rarely-used modules (chef, puppet, seed_random, phone_home).
- Assumes a package manager and init system that do not exist on capOS.
capOS needs the pattern — consume provider metadata, use it to bootstrap the instance — reshaped to the capability model.
Metadata Sources
All major clouds expose instance metadata through one or more of:
- HTTP IMDS.
169.254.169.254. AWS IMDSv2 requires aPUTtoken-exchange handshake; GCP and Azure accept directGET. Paths differ per provider. Needs a running network stack. - ConfigDrive. An ISO9660 filesystem attached as a block device,
containing
meta_data.json(or equivalent) and optional user-data file. OpenStack, older Azure. Needs a block driver and filesystem reader, no network. - SMBIOS / DMI. Vendor, product, serial-number, UUID fields populated by the hypervisor. Good for provider detection before networking comes up.
- NoCloud. Seed files baked into the image or on an attached FAT disk. Useful for development and bare-metal.
The bootstrap service should read from whichever source is present rather than hardcoding one. Provider detection via SMBIOS runs first (no dependencies), then the appropriate transport is initialized.
CloudMetadata Capability
A single capnp interface; one or more implementations:
interface CloudMetadata {
# Instance identity
instanceId @0 () -> (id :Text);
instanceType @1 () -> (type :Text);
hostname @2 () -> (name :Text);
region @3 () -> (region :Text);
# Network configuration (primary interface addresses, gateway, DNS)
networkConfig @4 () -> (config :NetworkConfig);
# Authentication material
sshKeys @5 () -> (keys :List(Text));
# User-supplied payload. Opaque to the metadata provider.
userData @6 () -> (data :Data, contentType :Text);
# Vendor-supplied payload. Separate from userData so the
# bootstrap policy can trust them differently.
vendorData @7 () -> (data :Data, contentType :Text);
}
struct NetworkConfig {
interfaces @0 :List(Interface);
struct Interface {
macAddress @0 :Text;
ipv4 @1 :List(IpAddress);
ipv6 @2 :List(IpAddress);
gateway @3 :Text;
dnsServers @4 :List(Text);
mtu @5 :UInt16;
}
}
Implementations:
HttpMetadata— fetches from169.254.169.254; one variant per provider because paths and auth handshakes differ (AWS IMDSv2 token, GCPMetadata-Flavor: Google, Azure API version).ConfigDriveMetadata— reads an ISO9660 seed disk.NoCloudMetadata— reads a seed blob from the initial manifest.
Detection lives in a small probe service that inspects SMBIOS
(System Manufacturer: Google, Amazon EC2, Microsoft Corporation,
…) and grants the cloud-bootstrap service the appropriate
CloudMetadata implementation as part of a manifest delta.
Bootstrap Service
A single service — cloud-bootstrap — runs once per boot:
cloud-bootstrap:
caps:
- metadata: CloudMetadata # from probe service
- manifest: ManifestUpdater # narrow authority to extend the graph
- network: NetworkConfigurator # apply interface addresses
- ssh_keys: KeyStore # target store for authorized keys
user_data_handlers:
- application/x-capos-manifest: ManifestDeltaHandler
# operator-installed handlers for other content types
Sequence:
- Gather identity and declarative config (
instanceId,hostname,networkConfig,sshKeys), apply through the narrow caps above. (data, ct) = metadata.userData()— dispatch by content type. If no handler is registered, log and skip.- Exit.
The service never holds ProcessSpawner directly. It holds
ManifestUpdater, a wrapper that accepts capnp-encoded
ManifestDelta messages and applies them through the existing init
spawn path. The decoder and apply path are shared with the build-time
pipeline (same capos-config crate, same spawn loop). The precise
shape of ManifestDelta is an open question — see “Open Questions”
below — but at minimum it covers hostname, network config, SSH keys,
and authorized application-level service additions:
struct ManifestDelta {
addServices @0 :List(ServiceEntry);
addBinaries @1 :List(NamedBlob);
setHostname @2 :Text;
setNetworkConfig @3 :NetworkConfig;
}
Relationship to the Build-Time Manifest Pipeline
The existing build-time pipeline (system.cue →
tools/mkmanifest → manifest.bin → Limine boot module →
capos-config decoder → init spawn loop) and the cloud-metadata
bootstrap path are not two parallel systems. They are the same
pipeline with different transports and different trust scopes.
See docs/proposals/system-configuration-proposal.md for the
authoring side — layered package capos CUE, the
cue/defaults/defaults.cue baseline, operator-supplied
system.local.cue overlays, @tag(user) host-user injection, and
the slice-4 mkmanifest cue-to-capnp host tool that turns
arbitrary schema-aware CUE into capnp bytes without expanding the
boot-manifest ABI. The same authoring tool, decoder, and merge
contract back the cloud path; the only delta on the cloud side is
who hands the capnp bytes to the parser and which cap applies them.
| Stage | Build-time (baked ISO) | Runtime (cloud metadata) |
|---|---|---|
| Authoring | system.cue in the repo | user-data.cue on the operator’s host |
| Compile | mkmanifest (CUE → capnp) | same tool, same output |
| Transport | Limine boot module | HTTP IMDS / ConfigDrive / NoCloud disk |
| Wire format | capnp-encoded SystemManifest | capnp-encoded ManifestDelta |
| Decoder | capos-config | capos-config |
| Apply | init spawn loop | same spawn loop, invoked via ManifestUpdater |
Three practical consequences:
- CUE is a host-side authoring convenience, not an on-wire format.
Neither kernel nor init evaluates CUE. An operator supplying
user-data writes
user-data.cue, runs `mkmanifest user-data.cueuser-data.bin
on their host, and ships the capnp bytes (base64 into–metadata [email protected]` for GCP/AWS, or as a file on a ConfigDrive ISO). - NoCloud is a Limine boot module by another name. A NoCloud
seed blob is the same bytes as a baked-in
manifest.bin, attached via a disk or bundled into the ISO instead of handed over by the bootloader. The only difference is who hands the bytes to the parser. - No new schema surface.
ManifestDeltais defined alongsideSystemManifestinschema/capos.capnp, and sharing the decoder meansManifestUpdater’s apply path is a thin merge-and-spawn on top of code that already boots the base system.
The trust model stays clean precisely because ManifestDelta is
not SystemManifest. The base manifest is inside the
content-addressed ISO hash (fully trusted, reproducible). The
runtime delta is applied by a narrowly-permitted service whose caps
define what fields of the delta can actually take effect — the
content-addressed-boot story is preserved because cloud metadata
augments the base graph, it cannot replace it.
User-Data Model
User-data on the wire is a capnp blob, not a shell script. Content
type application/x-capos-manifest identifies the canonical case:
the payload is a ManifestDelta message produced by mkmanifest
on the operator’s host and consumed directly by the bootstrap
service.
For cross-cloud-vendor compatibility, operators can install user-data dispatcher services for other content types (YAML, other capnp schemas, signed manifests, etc.). The bootstrap service holds a handler cap per content type; unknown types are logged and ignored, not executed.
Shell-script user-data — the Linux default — has nowhere to run on
capOS because there is no shell and no ambient-authority process to
execute it under. An operator who insists on this can install a
shell service and a handler that routes text/x-shellscript to it,
but that is a deliberate choice, not a default fallback.
Trust Model
The capability angle earns its keep here.
- The metadata endpoint is assumed as trustworthy as the hypervisor running the VM — the same assumption Linux cloud-init makes.
- The bootstrap service holds narrow caps (
ManifestUpdater,NetworkConfigurator,KeyStore), not ambient root. A bug or a malicious metadata response can at most spawn services theManifestUpdateraccepts, set network config theNetworkConfiguratoraccepts, and drop keys into theKeyStore. It cannot reach for arbitrary system state. vendorDataanduserDataare separated on the wire. A policy that trusts the cloud provider but not the operator (e.g., applyvendorDataas-is, routeuserDatathrough a signature check) is expressible by granting different handler caps to each.- User-data content-type dispatch is capability-mediated: the bootstrap service cannot execute a content type it wasn’t given a handler for. There is no fallback “try to run it as shell.”
Phased Implementation
Most of the manifest-handling machinery already exists from the
build-time pipeline (capos-config, mkmanifest, init’s spawn
loop). The new work is transports, provider detection, and the
ManifestDelta merge semantics. The transport and platform
prerequisites — SMBIOS decode beyond the bounded diagnostics
snapshot, ISO9660/block stack for ConfigDrive, userspace
networking for HTTP IMDS, and cloud-vendor disk-image bring-up —
all land through docs/proposals/cloud-deployment-proposal.md,
which already owns the imported-image boot proof and the
userspace-driver authority gate this proposal depends on.
ManifestDeltaschema andManifestUpdatercap. Add the delta type toschema/capos.capnpalongsideSystemManifest, extendcapos-configwith a merge routine (SystemManifest + ManifestDelta → new services to spawn), and exposeManifestUpdateras a cap in init.NoCloudMetadataseeded from a test fixture is enough to demo the apply path end-to-end without any cloud dependency.- Provider detection via SMBIOS. Kernel-side primitive or capability that reads SMBIOS DMI tables and exposes manufacturer / product strings. No network required.
- ConfigDrive support. ISO9660 reader plus
ConfigDriveMetadata. Gives a working real-transport metadata source with no dependency on userspace networking. QEMU can attach one via-drive file=configdrive.iso,if=virtiofor local testing. - HttpMetadata per provider. Requires the userspace network stack (Stage 6+). GCP first (simplest auth), then AWS (IMDSv2 token flow), then Azure.
- Cross-provider Cloud Metadata demo. Same ISO hash boots under
QEMU, GCP, AWS, and Azure; the only difference is the SMBIOS
manufacturer string, which the probe service uses to pick the
right
HttpMetadatavariant. This is the Cloud Metadata observable milestone.
Open Questions
Which fields of system.cue are runtime-modifiable?
system.cue today is a handful of service entries with kernel Console cap
grants encoded as structured source variants. That will grow. Plausible additions as capOS
matures: driver process definitions (virtio-net, virtio-blk, NVMe) with
device MMIO, interrupt, and frame allocator grants; scheduler tuning
(priority, budget, CPU pinning); filesystem driver services; memory-policy
hooks; ACPI/SMBIOS consumers.
Most of those are either fragile (kernel-adjacent; a bad value bricks
the instance), sensitive (granting kernel:frame_allocator to a
user-data-declared service is effectively root), or both. A
ManifestDelta with full SystemManifest equivalence hands every
such knob to whoever controls user-data.
The narrowing has to happen somewhere, but there are several places it could live:
- Different schema.
ManifestDeltais not structurally a subset ofSystemManifest— it omits driver entries, scheduler config, and kernel cap sources entirely. Schema-level guarantee; rigid but unambiguous. - Shared schema, policy-narrowing cap.
ManifestUpdateraccepts a full delta but validates at apply time: kernel source variants are rejected unless explicitly allow-listed by the cap’s parameters; additions that touch driver-level service entries fail. Flexible, but the narrowing logic is code that has to be audited, not a schema that is self-documenting. - Tiered deltas.
PrivilegedDelta(drivers, scheduler) andApplicationDelta(hostname, SSH keys, app services), minted by different caps. An operator supervisor holdsPrivilegedManifestUpdater;cloud-bootstrapholds onlyApplicationManifestUpdater. Compositional; matches the capability-model grain but doubles the schema surface. - Tag-based field permissions. Fields in
ServiceEntrycarry a privilege tag;ManifestUpdateris parameterized with a permitted-tag set. One schema, orthogonal policy.
Picking one prematurely would either over-constrain the cloud path
(option 1 before we know what apps legitimately need) or
under-constrain it (option 2 without clarity on what to check
against). This proposal commits only to the shared pipeline
(decoder, spawn loop, authoring tool). The shape of the public
type(s) the cap accepts is deferred until system.cue has grown
enough that the privileged vs. application split is visible in
concrete form.
Related open question: whether kernel cap sources should be expressible in
system.cue at all, or whether the build-time manifest should also declare
them through a narrower mechanism so that the same discipline that protects
cloud user-data also protects the baked-in manifest from accidental
over-grants. If they remain expressible, they should be structured enum/union
variants, not free-form strings; the associated interface TYPE_ID is only a
schema compatibility check and does not identify the authority being granted.
Non-Goals
- cloud-init compatibility. No parsing of
#cloud-configYAML, no#!/bin/bashexecution, noinclude-url, no MIME multipart handling. Operators who need these install their own dispatcher services; the base system does not. - Runtime package installation. The capOS equivalent of “install nginx on boot” is “include nginx in the manifest.” User-data can add services to the manifest; it cannot install packages (there is no package manager to install into).
- Re-running on every boot. cloud-init distinguishes
per-boot,per-instance, andper-oncemodules. The capOS bootstrap service runs once per boot; the manifest it produces is cached under the instance ID, and subsequent boots read the cache and skip the metadata round-trip. A full mode matrix is future work. - IPv6-only bring-up in the first iteration. Many clouds expose both; the schema supports both; the first implementations do whichever is easier per provider (typically IPv4).
- Automatic secret rotation. Metadata often exposes short-lived credentials (IAM role tokens on AWS, service-account tokens on GCP). Refresh logic belongs to the service that consumes the credential, not to cloud-bootstrap.
Related capOS Proposals
docs/proposals/cloud-deployment-proposal.mdowns the hardware/disk/network surface this proposal sits on: PCIe config-space access, MSI/MSI-X, ACPI/SMBIOS, virtio-net/virtio-blk, cloud-vendor disk-image bring-up, and the userspace-driver authority gate (DeviceMmio,DMAPool,Interrupt,HardwareAuditLog). The probe service’s SMBIOS read, the ConfigDrive block path, and the HTTP IMDS network path all wait on primitives tracked there.docs/proposals/system-configuration-proposal.mdowns the authoring side:package caposlayering,cue/defaults/defaults.cuebaseline,system.local.cueoverlay, host-user@tag(user)injection, the per-user~/.capos-toolscache, and the slice-4mkmanifest cue-to-capnphost tool. The cloud bootstrap service reuses the same decoder, the same merge contract, and the same authoring conventions; only the transport and the apply cap differ.docs/proposals/service-architecture-proposal.mddefines the init spawn loop andProcessSpawnerboundary theManifestUpdatercap narrows.docs/proposals/cryptography-and-key-management-proposal.mdanddocs/proposals/certificates-and-tls-proposal.mdown the trust anchors that signed-manifest user-data handlers will need once they exist.
Related External Work
- cloud-init (Canonical). The Linux reference. Huge scope, shell-script-centric, assumes root and POSIX. The capOS design intentionally takes the pattern and drops everything that depends on ambient authority.
- ignition (CoreOS/Flatcar). Runs once in initramfs, consumes a JSON spec, fails-fast if the spec can’t be applied. Closer in spirit to the capOS design — small, single-pass, declarative. Worth studying for its rollback and error-handling approach.
- AWS IMDSv2. The token-exchange handshake is the one thing the
HTTP client needs to handle that is not plain
GETs. Designing theHttpMetadatainterface without accounting for it up front leads to a rewrite later.
Proposal: Hardware Abstraction and Cloud Deployment
How capOS goes from “boots in QEMU” to “boots on a real cloud VM” (GCP, AWS, Azure). This covers the hardware abstraction infrastructure missing between the current QEMU-only kernel and real x86_64 hardware, plus the build system changes needed to produce deployable images.
Depends on: Kernel Networking Smoke Test (for PCI enumeration), Stage 5 (for timer history), Stage 7 / SMP proposal Phase C (for LAPIC timer and IPI).
Complements: Networking proposal (extends virtio-net toward cloud NICs), Storage proposal (extends local block-device work toward virtio-scsi and NVMe), SMP proposal (LAPIC timer/IPI infrastructure shared, with x2APIC tracked as a later backend).
Current State
The kernel boots via Limine UEFI, outputs to COM1 serial, has QEMU legacy PCI
enumeration for the virtio-net smoke path, and has LAPIC timer/IPI groundwork
from the SMP track. It also has an initial bounded, read-only ACPI diagnostic
parser for Limine RSDP, RSDT/XSDT table inventory, MADT summaries, and MCFG
presence/allocation summaries, plus a Q35 smoke that proves the reusable PCI
config backend can enumerate a capped PCIe ECAM function inventory from MCFG.
The x86 path exports bounded MADT I/O APIC/source-override records, maps the
I/O APIC, and programs masked legacy IRQ routes to LAPIC vectors while honoring
source overrides. PCI drivers can validate and map memory BAR subregions through
a shared kernel helper; the virtio-net modern transport uses that helper for its
common, notify, ISR, and device configuration regions. The PCI capability walk
also reports MSI/MSI-X metadata for the virtio-net function, and the QEMU net
smoke uses that metadata for a bounded kernel-owned virtio-net MSI-X
dispatch/unmask and lifecycle proof through the device MSI vector pool; the
remaining run-net fixture also covers queue setup, descriptor guards, ARP, and
ICMP. Device-autonomous virtio-net MSI-X delivery is covered by the dedicated
userspace-provider gates after the kernel L4 owner is retired.
The cloudboot image/harness slice landed in commit 02635421
(2026-05-05 06:51 UTC):
make capos-cloudboot-image builds the importable raw disk tarball and
make cloudboot-test drives the GCE upload/import/temporary-instance/serial-log
loop with teardown. The first GCP imported-image serial-console boot proof is
run 1778230874-715a (2026-05-08 09:06 UTC) against source commit
3951e275 (2026-05-08 08:50 UTC), reaching the capos kernel starting
serial landmark on a temporary no-public-IP, no-service-account/scopes
e2-small instance before teardown.
It still lacks public L4/SSH/WebShell ingress, AWS/Azure boot proofs and provider drivers, broader storage variants, high-throughput/multiqueue NIC readiness, direct-remapping DMA, production cloud-image release paths, and a cloud-ready clocksource/clockevent closeout. The GCP-first provider rollup has live serial-console operator access, selected NIC raw-frame reachability, selected NVMe Persistent Disk I/O, and gVNIC portability evidence.
The GCP-first usable cloud-instance provider rollup is closed by
docs/tasks/done/2026-06-07/cloud-usable-instance-provider-nic-storage.md.
Do not cite the cloudboot harness or the first GCP serial-console boot alone as
evidence for provider NIC/storage readiness; the closeout depends on separate
live NIC, storage, operator-access, and gVNIC evidence records. AWS/Azure,
public ingress, and production cloud-image release gates remain separate.
Trusted Build Inputs And Reproducibility Cross-Links
Cloud deployment depends on the same trusted-build-inputs inventory that
covers local builds. The consolidated supply-chain risk view – floating Rust
nightly, observed-not-pinned xorriso / qemu-system-x86_64 / OVMF, CI
publication and comparison of build-provenance records, and pinned production
runner identity – is tracked as R13 in docs/design-risks-register.md; the
detailed inventory, dependency policy, vendored-snapshot table, and the
build-provenance retention/comparison policy live in
docs/trusted-build-inputs.md. This proposal is recorded as a secondary
owner of R13 because cloud-image release paths and provider-driver bring-up
both depend on those reproducibility gates.
The implication for cloud bring-up is concrete: imported cloud images must
travel with the corresponding make build-provenance record (source commit,
toolchain identity, embedded-binary hashes, OVMF identity or explicit
absence) before any provider serial-console run is cited as production
evidence. Until the R13 gates close, cloud images remain local/CI proof
artifacts rather than third-party reproducible boot images.
What Cloud VMs Provide
GCP (n2-standard), AWS (m6i/c7i), and Azure (Dv5) all expose:
| Resource | Cloud interface | capOS status |
|---|---|---|
| Boot firmware | UEFI (all three) | Limine UEFI works |
| Serial console | COM1 0x3F8 | Works (serial.rs) |
| Boot media | Hybrid BIOS+UEFI raw disk image, packaged per provider import rules | Partial (make capos-cloudboot-image builds a GCE-importable raw disk tarball; production release packaging and non-GCP provider packaging remain future) |
| Storage | virtio-scsi or NVMe (GCP Persistent Disk), NVMe/EBS (AWS Nitro), managed disks | Partial (GCP NVMe Persistent Disk brokered READ proof landed; GCP virtio-scsi, Local SSD, AWS/Azure storage, and broader filesystem-backed cloud storage remain future) |
| NIC | virtio-net or gVNIC (GCP), ENA (AWS), MANA (Azure) | Partial (GCP legacy virtio-net raw-frame provider-nic-bound and gVNIC raw-frame / typed-Nic proofs landed; public ingress, high-throughput/multiqueue, ENA, and MANA remain future) |
| Virtio NIC | QEMU, GCP where selectable, some bare-metal | Partial (QEMU smoke; reusable/cloud path planned) |
| Timer | LAPIC timer, TSC, HPET | Partial (LAPIC timer groundwork; cloud clocksource work missing) |
| Interrupt delivery | I/O APIC, MSI/MSI-X | Partial (masked MADT-backed I/O APIC routes, MSI/MSI-X capability metadata, and bounded kernel-owned virtio-net MSI-X dispatch/lifecycle proof; I/O APIC ownership and userspace interrupt authority missing) |
| Device discovery | ACPI + PCI/PCIe | Partial (QEMU legacy PCI smoke, bounded ACPI diagnostics/routing state, reusable legacy/ECAM PCI config access, kernel BAR/MMIO validation, MSI/MSI-X metadata discovery, and bounded virtio-net MSI-X dispatch proof; broader driver authority still missing) |
| Display | None (headless) | N/A |
Cloud NIC And Storage Portability Notes
The Device Driver Foundation is not complete just because QEMU virtio-net
works. Cloud bring-up has provider-specific NIC and storage surfaces, and the
first implementation slices must keep those differences visible while still
deferring the actual provider drivers.
| Provider path | Expected device surface | capOS dependency | Current state |
|---|---|---|---|
| QEMU / constrained GCP virtio-net | Virtio PCI transport, virtqueues, MSI-X where available | Shared virtio transport helpers, DMAPool, DeviceMmio, Interrupt, and queue lifecycle proofs | QEMU virtio-net proofs and the live GCE legacy virtio-net raw-frame provider-nic-bound proof landed. This does not claim public L4 ingress, high-throughput/multiqueue readiness, or device-autonomous MSI-X completion delivery |
| GCP gVNIC | gVNIC as the modern Compute Engine NIC, replacing virtio-net on newer machine generations and required for some features | PCI BAR/MMIO binding, MSI-X routing, per-queue ring setup, image metadata declaring GVNIC, and fallback choice between virtio-net and gVNIC by machine family | Grounding plus bounded live proofs landed: the GCE gVNIC provenance map records the spec basis and authority mapping, the GCE harness can request GVNIC image/instance posture and inventory the 1ae0:0042 PCI function, the admin-queue/register proof maps BAR0 and issues one DESCRIBE_DEVICE, the raw-frame proof configures one GQI/QPL TX/RX queue pair, and the typed Nic adaptation proof exercises inline-frame Nic.transmit / Nic.receive over live gVNIC. No QEMU gVNIC model exists. This remains a separate GCE portability lane, not a blocker for the first public Web UI proof on a virtio-compatible machine type |
| AWS Nitro ENA + EBS | ENA enhanced networking plus Nitro NVMe storage | ENA queue/MSI-X driver, NVMe controller/storage path, IOMMU or bounce-buffer policy, and image import with ENA/NVMe expectations | Planned; no ENA, NVMe EBS, or AWS boot proof |
| Azure Accelerated Networking | Accelerated Networking exposes SR-IOV hardware families, with MANA as the newer Azure NIC and Mellanox mlx4/mlx5 still relevant on some hosts | Synthetic-interface fallback awareness, VF binding/revocation handling, MANA/Mellanox driver binding, MSI-X routing, and reset/revoke paths that survive VF removal | Planned; no MANA, Mellanox VF, or Azure boot proof |
These rows are planning gates, not implementation evidence. Each provider NIC
has its own queue layout, feature negotiation, MSI-X/vector conventions, reset
behavior, and driver-binding rules. Azure’s accelerated-networking path also
requires the OS and applications to tolerate dynamic SR-IOV VF revocation by
falling back to the synthetic network interface. Provider storage follows the
same rule: AWS Nitro uses NVMe for EBS, GCP can require NVMe on newer or
Confidential VM paths while retaining virtio-scsi on older paths, and Azure
uses SCSI on many older families while Azure Boost and newer NVMe-capable VM
families expose managed disks through NVMe. The shared foundation therefore
needs ACPI/PCIe discovery, BAR validation, interrupt ownership, DMAPool
accounting, IOMMU/bounce-buffer policy, and lifecycle teardown before any cloud
NIC or storage driver is treated as portable.
What Already Works
- UEFI boot – Limine ISO includes
BOOTX64.EFI. The boot path itself is cloud-compatible. - Serial output – all three clouds expose COM1.
gcloud compute instances get-serial-port-output,aws ec2 get-console-output, and Azure serial console all read from it. - x86_64 long mode – cloud VMs are KVM-based x86_64. Architecture matches.
Managed Application Services
Booting capOS on a cloud VM and using managed cloud services are separate tracks. The VM path proves hardware, disk, network, and serial behavior. Managed services can be useful earlier for application persistence, especially game profile/world state, as long as they sit behind narrow capOS service capabilities.
For a GCP-backed adventure persistence bridge:
- Cloud Run hosts a small bridge endpoint. It translates capOS save/load/append requests into provider calls and enforces request bounds before touching cloud APIs.
- Cloud KMS owns the key-encrypting keys (KEKs) for each game-world instance or shard. The bridge or game-world service gets narrow authority to wrap or unwrap data-encrypting keys (DEKs) through Cloud KMS envelope encryption. Ordinary browser clients do not receive DEKs, game-world key capabilities, KMS decrypt/unwrap grants, or provider-independent plaintext authority; provider storage objects contain ciphertext, wrapped DEKs, and metadata only.
- Firestore Native mode stores mutable profile summaries, indexes, and compare-and-set version records.
- Cloud Storage stores larger immutable snapshots, evidence blobs, exports, and content-addressed records. Object versioning and lifecycle policy are required before using it for durable game data.
- Secret Manager stores bridge-side provider credentials and rotation material. Those secrets are never granted to ordinary capOS game clients.
This does not change the storage proposal’s rule: persistence is still
application-level serialization of bounded Cap’n Proto records. The cloud bridge
is just one backing implementation for Store, Namespace, or an
app-specific AdventureSaveStore/CloudGameStore capability. Local fake-cloud
tests must enforce stale-write rejection, wrong-profile rejection, append-only
ledger behavior, and size bounds before a real GCP deployment is trusted.
A separate browser-mediated path can serve user-owned private backups. In that
model, the browser or web terminal host authenticates the user to Google, stores
encrypted save capsules in Drive appDataFolder or Firebase user documents, and
returns only opaque provider handles and encrypted capsule bytes through
explicit restore flows. DEK unwrap and plaintext validation happen in the local
capOS key domain or in the game-world service with KMS/IAM authority, not in
browser JavaScript.
This is appropriate for user profile backup, private expedition checkpoints,
and settings sync. It is not appropriate for authoritative public world state,
reward witness records, market receipts, or multiplayer outcomes. The user’s
browser holds provider tokens; capOS game services do not. For GCP-backed game
worlds, the browser transports envelope-encrypted capsules with wrapped DEKs but
does not hold game-world key capabilities, KMS decrypt/unwrap grants, DEKs, or
plaintext authority.
Firebase user-document capsule paths must make the auth binding visible in the
path template, not just in policy metadata. Use a narrow shape such as
users/{request.auth.uid}/saveCapsules/{capsule_id} so Firestore rules can
bind the user wildcard to request.auth.uid; literal profile names such as
users/alice/... are not accepted by the capOS policy model. Firestore rules
remain access control for opaque encrypted capsules only. They must not be
treated as validation for decrypted adventure semantics, and path segments must
respect Firestore ID constraints such as no ., no .., no __.*__, and the
1,500-byte collection/document ID limit.
GCP Cloud KMS And IAM Notes For Adventure Saves
GCP-backed adventure save capsules follow the same envelope-encryption model as
CloudKmsKeySource and the volume-encryption proposal: Cloud KMS holds a
key-encrypting key (KEK), the game-world service owns the capsule
data-encrypting key (DEK), and KMS Encrypt/Decrypt wraps or unwraps that
DEK rather than bulk-encrypting capsule bytes. Provision one Cloud KMS key ring
and one symmetric CryptoKey KEK per game-world instance or shard. The key ring
is an administrative grouping boundary; ordinary runtime authority should be
granted on the CryptoKey resource where possible, not at the project or key-ring
level. Do not claim key-version-scoped IAM as a design primitive for this path:
predefined Cloud KMS crypto roles have CryptoKey as their lowest grantable
resource.
Service accounts are split by operation:
- Writers that only create new ciphertext receive
roles/cloudkms.cryptoKeyEncrypteron the configured game-world CryptoKey so they can wrap a freshly generated DEK. - Restore, validation, and migration workers that must read protected capsules
receive
roles/cloudkms.cryptoKeyDecrypteron that CryptoKey so they can unwrap an existing DEK. - The narrow game-world service account receives
roles/cloudkms.cryptoKeyEncrypterDecrypteronly when the same service must both wrap and unwrap DEKs. Avoidroles/cloudkms.cryptoOperator, project-wide grants, owner/editor roles, browser OAuth identities, and service-agent roles for ordinary adventure runtime access.
The browser-vault boundary does not change. Browser JavaScript may carry
ciphertext, wrapped DEKs, capsule metadata, and opaque Drive/Firebase provider
handles. It must not receive plaintext DEKs, capOS SymmetricKey or
KeySource capabilities, Cloud KMS decrypt/unwrap grants, service account
credentials, or provider-independent plaintext. The game-world service may use
the unwrapped DEK internally as service authority, modeled as a SymmetricKey
capability, but that authority does not cross into browser JavaScript.
Possession of a Drive file id or Firebase document path is only transport
authority over opaque encrypted bytes.
Rotation creates a new primary KEK version for future DEK wrapping. It does not re-encrypt existing capsules, rewrite wrapped DEK blobs, or disable/destroy old key versions automatically. Capsule re-encryption or rewrapping is a managed game-world service operation: unwrap the old DEK while its KEK version remains enabled and authorized, decrypt and validate the capsule inside the service, then write a new capsule using a new DEK or a DEK rewrapped by the current primary KEK version. The service verifies content hashes and ledger/profile bindings before replacing capsule metadata. Old KEK versions should only be disabled or scheduled for destruction after inventory proves no accepted wrapped DEK still depends on them.
Retiring a game-world first removes IAM decrypt authority from the world service and migration workers. If the retirement is meant to make existing capsules inaccessible, disable the relevant key versions and record the expected outage and recovery procedure before doing it. Destruction is delayed by Cloud KMS’ scheduled destruction period and is irreversible once completed, so destroy key versions only after audit retention, export, and break-glass recovery decisions are recorded. Disabling or destroying a key version can make all capsules that depend on it unreadable; this is a revocation tool, not cleanup.
Phase 1: Bootable Disk Image And Serial Diagnostics
Goal: Produce a raw hybrid BIOS+UEFI disk image that can boot locally and can be packaged for cloud import, alongside the existing ISO for QEMU. The first cloud-visible proof is serial-console boot to init/diagnostics, not network shell access.
The Problem
Cloud VMs boot from disk images, not ISOs. Each cloud has provider-specific format and boot-mode rules:
| Cloud | Image format | Import method |
|---|---|---|
| GCP | disk.raw in gzip .tar.gz using old GNU tar; raw size in 1 GiB increments | gcloud compute images create --source-uri=gs://... |
| AWS | raw, VMDK, VHD/VHDX, or OVA | aws ec2 import-image with explicit boot-mode notes |
| Azure | VHD (fixed size) | az image create --source |
GCP’s manual import path documents a functional MBR partition table or a
hybrid GPT+MBR bootloader configuration for imported boot disks, plus ACPI
support. AWS VM Import/Export supports both UEFI and legacy BIOS boot modes,
but UEFI imports need a fallback EFI binary at /EFI/BOOT/BOOTX64.EFI; Nitro
instances generally expect NVMe storage and ENA networking for useful
operation. Therefore the first capOS image target should be a hybrid
BIOS+UEFI raw disk: an ESP for UEFI fallback boot and a BIOS/MBR-compatible
Limine path for import paths that still validate MBR bootability.
Disk Layout
Hybrid raw disk image (1 GiB-aligned for cloud packaging)
Protective/hybrid MBR + GPT
Partition 1: EFI System Partition (FAT32, ~32 MB)
/EFI/BOOT/BOOTX64.EFI (Limine UEFI loader)
/limine.conf (bootloader config)
/boot/kernel (capOS kernel ELF)
/boot/init (init process ELF)
Partition 2: (reserved for future use -- persistent store backing)
Build Tooling
New Makefile target make image using standard tools:
IMAGE := capos.img
IMAGE_SIZE := 1024 # MB, keeps GCP raw image packaging simple
image: kernel init $(LIMINE_DIR)
# Create raw disk image
dd if=/dev/zero of=$(IMAGE) bs=1M count=$(IMAGE_SIZE)
# Partition with GPT + ESP; keep room for hybrid/MBR boot metadata.
sgdisk -n 1:2048:+32M -t 1:ef00 $(IMAGE)
# Format ESP as FAT32, copy files
# (mtools or loop mount + mkfs.fat)
mformat -i $(IMAGE)@@1M -F -T 65536 ::
mcopy -i $(IMAGE)@@1M $(LIMINE_DIR)/BOOTX64.EFI ::/EFI/BOOT/
mcopy -i $(IMAGE)@@1M limine.conf ::/
mcopy -i $(IMAGE)@@1M $(KERNEL) ::/boot/kernel
mcopy -i $(IMAGE)@@1M $(INIT) ::/boot/init
# Install Limine BIOS path as well as UEFI fallback files.
$(LIMINE_DIR)/limine bios-install $(IMAGE)
New QEMU target to test disk boot locally:
run-disk: $(IMAGE)
qemu-system-x86_64 -drive file=$(IMAGE),format=raw \
-bios /usr/share/edk2/x64/OVMF.4m.fd \
-display none $(QEMU_COMMON); \
test $$? -eq 1
Cloud upload helpers (scripts, not Makefile targets):
# GCP
cp capos.img disk.raw
tar --format=oldgnu -Sczf capos.tar.gz disk.raw
gcloud storage cp capos.tar.gz gs://my-bucket/
gcloud compute images create capos --source-uri=gs://my-bucket/capos.tar.gz
# AWS
aws ec2 import-image --disk-containers \
"Format=raw,UserBucket={S3Bucket=my-bucket,S3Key=capos.img}" \
--boot-mode uefi
Serial diagnostics are part of Phase 1 rather than a later convenience. The cloud bring-up loop should be:
make run-diskproves the hybrid image under local QEMU/OVMF.- a local BIOS-mode disk run proves the MBR/Limine path if provider import requires it;
- a serial diagnostics prompt is reachable on COM1 in QEMU;
- GCP/AWS imported instances reach the same prompt through provider serial console output.
The serial diagnostics prompt should expose bounded read-only commands for
status, cpu, mem, acpi, pci, irq, timers, devices, and logs,
plus reboot/halt. It is the early remote debugging path for cloud driver
bring-up before NICs or disks are reliable. It should not be required to upload
large binaries, replace kernels in place, or stream high-volume tracing through
cloud serial consoles.
Dependencies
sgdisk(gdisk package) – GPT partitioningmtools(mformat, mcopy) – FAT32 manipulation without root/loop mount
Scope
Makefile/helper script work for the image plus a narrow diagnostics-mode surface. Kernel changes are limited to serial diagnostics and any boot path adjustments needed for disk images; network and block drivers remain later phases.
Phase 0 closeout: GCE harness landed (2026-05-05 06:51 UTC)
Commit 02635421 (2026-05-05 06:51 UTC) records this harness closeout.
The first build-and-boot leg of Phase 1 landed as the cloud-boot harness.
make capos-cloudboot-image produces a 10 GiB GPT-partitioned target/disk.raw
with a 128 MiB FAT32 EFI System Partition holding the Limine UEFI loader,
limine.conf, the kernel ELF, and the manifest, plus the Limine BIOS stage 2
embedded in the GPT for legacy SeaBIOS boot. The disk is repackaged as
target/capos-disk.tar.gz using tar --format=oldgnu -czf, the exact form
GCE’s manual import path expects. Disk size is enforced as an exact multiple
of 1 GiB.
tools/cloudboot/run-test.sh (also wired as make cloudboot-test) drives the
end-to-end loop on a sandbox GCE project: an idempotent orphan sweep on a
configured project-pinned label, a staging tarball upload, image creation,
instance creation with no public IP, no service account, no API scopes, the
same project-pinned label set, and the configured sandbox subnet, then
serial-port polling for the capos kernel starting landmark with a hard
wall-clock budget. Serial output is captured under
target/cloudboot-evidence/run-<id>/serial.log BEFORE teardown, and a bash
trap on EXIT INT TERM always deletes the instance, image, and staged
tarball even on signal or partial failure. The harness hard-fails if the
active project name does not match the configured sandbox.
Sandbox project name, subnet, staging bucket, and the IAM custom roles the
harness assumes are operational details that depend on the host environment;
they belong in tools/cloudboot/README.md and operator-local configuration,
not in this proposal.
This is the harness only. The recurring portability gate that records cloud
boot evidence on every reviewed cloud-relevant change remains open as
docs/backlog/hardware-boot-storage.md Task 6, and the userspace driver
authority gate remains open under DDF Task 5.
First GCP serial-console boot proof (2026-05-08 09:06 UTC)
The first imported-image GCP serial-console proof reached
capos kernel starting as run 1778230874-715a at 2026-05-08 09:06 UTC,
against source commit 3951e275 from 2026-05-08 08:50 UTC. The run used
the cloudboot harness to import the staged disk image, create a temporary
e2-small instance with no public IP and no service account/scopes, poll
serial output for the kernel-start landmark, save the serial log under the
run evidence directory, and tear down the temporary instance/image/staging
objects.
This proves imported-image firmware/bootloader/kernel serial reachability on one GCP sandbox run only. It does not prove a usable cloud instance, provider NIC or storage drivers, cloud clocking, persistence, SSH/network shell access, AWS/Azure import, or production cloud readiness.
Private Web UI Reachability Evidence Contract
The first self-hosted Web UI provider proof is private GCE reachability, not
operator browser exposure. The behavior task
cloud-gce-private-self-hosted-webui-proof
extends tools/cloudboot/run-test.sh with --require-web-ui-proof only after
the local Web UI L4 proof, DHCP/IPv4 configuration, and Web UI hardening tasks
are closed. This proposal defines the evidence contract for that later behavior
slice; it does not authorize a billable GCE run, a public endpoint, broad
firewall changes, TLS certificate provisioning, service-account broadening, or a
production release.
The proof must keep the current cloudboot posture unless the behavior task is
explicitly amended: no public IP on the capOS VM, no service account, no API
scopes, no public firewall rule, and teardown through the existing orphan-sweep
and EXIT INT TERM trap discipline. The reachability probe must cross the live
GCE virtual network boundary. Acceptable shapes include a same-VPC probe
instance, a provider-supported internal probe path, or another reviewed private
path that sends packets through the capOS VM’s GCE NIC and private endpoint.
Evidence classes stay separate:
| Evidence class | What it can prove | What it cannot prove |
|---|---|---|
| Cloudboot-only | The image imports, boots, emits serial markers, and tears down provider resources | Web UI reachability over the provider network |
| Provider-private | A private probe reaches remote-session-web-ui through the live GCE NIC and Phase C L4 path | Public operator access, TLS readiness, DNS readiness, or browser production posture |
| Operator-exposure | A separately authorized public or browser-mediated path reaches the Web UI under the selected ingress policy | The private proof by itself; it must depend on the private proof instead |
The private Web UI proof records, before teardown, at least:
| Field | Requirement |
|---|---|
| Run identity | Cloudboot run id plus source commit or image provenance used for the imported image |
| Machine shape | GCE machine family/type, NIC selection posture, and zone |
| Private posture | public_ip=false or equivalent, service-account/scopes posture, and no public firewall rule |
| Private endpoint | Internal IP or provider-private endpoint, UI port, and probe source identity |
| Probe path | Same-VPC probe, provider-supported internal probe, or other reviewed private path that crosses the GCE virtual network boundary |
| Web UI marker | A run-unique Web UI response marker, header, or body token observed by the private probe |
| Phase C L4 marker | The remote-session-web-ui Phase C L4 evidence marker, such as cloudboot-evidence: remote-session-web-ui-l4 <token>, tied to the same source commit/image |
| Private proof marker | A final structured marker, such as cloudboot-evidence: gce-private-self-hosted-webui <token>, emitted only after the private probe succeeds |
| Teardown | Instance, image, staged object, probe resources, and any private firewall or route resources created by the run were deleted or reported as a failed run |
Private Proof Runbook Checklist
The future --require-web-ui-proof harness gate closes provider-private Web UI
reachability only when the run records these steps in order:
- Preflight confirms the local Web UI L4 proof, DHCP/IPv4 proof, session hardening, and connection-bound prerequisites are closed, and confirms that the run has current authorization for billable private GCE execution.
- Image/source provenance records the cloudboot run id, source commit, imported image or staged object identity, and the local artifact set used for the VM.
- Launch posture records the zone, machine type, NIC posture, no public IP, no service account or API scopes, and no public firewall rule.
- Probe setup records the private endpoint, UI port, probe source identity, and same-VPC or provider-supported private path that crosses the GCE virtual network boundary.
- The private probe fetches the Web UI over that provider-private path and records a run-unique response marker, header, or body token.
- The serial or harness evidence ties the same run to the Phase C L4 marker
for
remote-session-web-ui, such ascloudboot-evidence: remote-session-web-ui-l4 <token>, from the same source commit/image. - The harness emits the private proof marker, such as
cloudboot-evidence: gce-private-self-hosted-webui <token>, only after the provider-private probe and L4-marker correlation both succeed. - Teardown removes the VM, imported image, staged object, probe resources, and any private firewall or route resources created by the run, using the normal orphan-sweep and trap discipline.
- Failed-run reporting preserves the run id, failure class, last observed private posture, teardown result, and whether any loopback, same-guest, or serial-only diagnostics passed without treating those diagnostics as a provider-private proof.
No-Spend Preflight (Step 1, Landed as a Local Gate)
Step 1 of the checklist is implemented and testable today without any provider
mutation: tools/cloudboot/run-test.sh --require-web-ui-proof --preflight-only
runs the local no-spend preflight and exits before the harness access probe,
orphan sweep, upload, image import, instance launch, firewall mutation, or any
probe resource. It validates that the local prerequisite proofs are done
(cloud-prod-remote-session-web-ui-l4-local-proof,
remote-session-web-ui-session-hardening,
remote-session-web-ui-connection-bounds, and the legacy-datapath serving
prerequisite cloud-gce-legacy-virtio-webui-serving-local-proof), that an
operator supplied a firewall-IAM attestation (the documented live blocker), and
that a current per-run billable authorization is present, emitting one
structured cloudboot-webui-preflight: line per check naming the failure class
without printing credentials or attestation values. make cloudboot-gce-private-webui-preflight-check is the fixture gate proving the
safe failure paths and that no provider CLI is invoked on any preflight path
(tools/cloudboot/README.md documents the inputs and failure classes). A
preflight pass is cloudboot-only evidence – the output labels itself
evidence-class=cloudboot-local-preflight – and is neither the
provider-private proof nor authorization for a billable run. The live
--require-web-ui-proof gate remains unimplemented and fails closed without
--preflight-only.
Evidence-Grammar Fixture (Local Gate)
The closeout evidence grammar for the table above is also locally testable
without any provider mutation:
tools/cloudboot/validate-private-webui-evidence.sh validates a
harness-rendered evidence report for field completeness, marker ordering (the
private proof marker only after the recorded private-probe pass and the
correlated remote-session-web-ui-l4 marker), run/source identity agreement,
private posture, and teardown result, and rejects loopback-only, serial-only,
same-guest, public-IP, public-firewall, and missing-teardown evidence with
structured failure classes. make cloudboot-gce-private-webui-evidence-fixture-check is the fixture gate
(tools/cloudboot/README.md documents the report grammar and failure
classes). A pass is
evidence-class=cloudboot-local-private-webui-evidence-fixture with an
explicit provider-private-reachability=not-proven label: it proves only that
a future successful run’s evidence will be parsed, ordered, and classified
correctly, not that any provider-private probe has run.
Loopback-only checks (127.0.0.1, guest-local localhost, or an in-guest HTTP
health request) are supplemental service-health evidence. They may help diagnose
a failed run, but they do not close cloud-gce-private-self-hosted-webui-proof
because they do not prove the provider NIC, VPC routing, private endpoint, or
probe-to-VM packet path. Serial-only markers are likewise insufficient for the
private Web UI proof unless the private probe also succeeds and the harness
records the required provider-private fields.
The public ingress policy below remains a later authorization boundary. Closing the private proof does not permit a public IP, load balancer, DNS name, TLS certificate, Identity-Aware Proxy, operator browser exposure, or widened service account. Public browser-facing exposure must reference the private proof as an input and then satisfy the separate public-ingress policy and on-hold approval gate.
Public Web UI Ingress Policy (First Operator-Access Proof)
The cloudboot harness intentionally launches with no public IP, no service
account, and no API scopes. Exposing the self-served capOS Web UI
(remote-session-web-ui, see
Remote Session CapSet Client
Gate 1B) to an operator browser is therefore a separate, reviewed exposure
decision, not a follow-on of the private reachability proof. This section is the
selected policy that the first public-ingress behavior task
(cloud-gce-public-self-hosted-webui-ingress-tls)
builds against, decided by
cloud-gce-public-webui-ingress-tls-policy-design.
Selected Ingress Shape: Provider-Terminated HTTPS Load Balancer
The first public proof uses a GCP external Application Load Balancer that terminates HTTPS at the Google front end. capOS serves only plain HTTP/1.1 on its UI backend port; the operator browser reaches the UI exclusively through the load balancer’s HTTPS virtual IP and hostname. TLS is terminated by Google’s front end against a managed certificate; capOS never holds the TLS private key and never parses hostile TLS bytes in this proof.
graph LR
B[Operator browser] -- HTTPS --> LB[GCP external HTTPS<br/>Application Load Balancer<br/>Google-managed cert]
LB -- HTTP, health-check-scoped firewall --> NEG[Zonal NEG / backend service]
NEG --> VM[capOS VM<br/>remote-session-web-ui :8080<br/>plain HTTP/1.1, no public IP]
style LB fill:#2d5,stroke:#333
style VM fill:#2d5,stroke:#333
Why this shape is the first proof rather than direct capOS TLS termination:
- No capOS TLS termination stack exists yet. The Phase-1 certificate
verifier has landed, but the capability-native TLS termination model
(
TlsServerConfig, ACME issuance, OCSP stapling, and private-key custody) is not landed in Certificates and TLS, and the userspace L4 network stack has not yet completed fullTcpSocketrelocation. The ACME/Let’s Encrypt successor path is decomposed, but it still depends on minimalPrivateKey/KeyVault/KeySourcecustody, server-side TLS, the RFC 8555 client, the scopedhttp-01solver, andCertificateStore.watchrenewal. A direct external IP would put capOS’s nascent userspace HTTP parser at the first byte of hostile internet traffic with no TLS and no reviewed key custody. - Least privilege and reversibility. Provider-terminated TLS keeps the VM
with no public IP, no inbound
0.0.0.0/0, and no private-key custody in either capOS or the harness. Teardown is the deletion of a bounded set of provider resources, not the rotation of an exposed key. - Clean successor path. When the capability-native TLS stack and an ACME
flow ship, the direct-external-IP / capOS-terminated shape becomes available
as a second, separately reviewed ingress. This proof does not foreclose it; it
is the bootstrap step before it. The interim posture is recorded as
“Bootstrap TLS for the First Public GCE Web UI” in
Certificates and TLS, and the
public GCE successor task is
cloud-gce-public-webui-letsencrypt-direct-termination-proof. That successor requires a controlled public DNS name plus explicit billable/public-ingress authorization, and any Let’s Encrypt production call requires explicit CA authorization.
Raw public HTTP is not acceptable closeout evidence. If port 80 is published at all, it exists only as an HTTP-to-HTTPS 301 redirect at the load balancer and never reaches capOS. The closeout evidence must be the HTTPS path.
An optional hardening for the first proof is to enable Identity-Aware Proxy
(IAP) on the backend service so the public door is gated by Google IAM before
any request reaches the capOS backend. IAP here is not a separate ingress shape:
it rides on the same external HTTPS load balancer and gates that backend service,
so the ALB is still the only public entry point. IAP composes with, and does not
replace, the capOS SessionManager/AuthorityBroker login boundary: IAP
authenticates the human to Google; capOS still mints its own UserSession and
projects only browser-safe view models. The browser never receives raw capOS
caps.
Certificate and Key Custody
| Concern | First proof | Successor (deferred) |
|---|---|---|
| TLS terminator | Google front end (load balancer) | capOS userspace TLS service |
| Certificate source | Google-managed certificate (Certificate Manager or classic managed cert), or an operator-supplied cert resource on the load balancer | ACME (AcmeClient + http-01/tls-alpn-01 solver) from Certificates and TLS |
| Private-key custody | Google-held; never in capOS or the harness | capOS PrivateKey cap sealed under a KeySource |
| Min TLS version / cipher policy | Load balancer SSL policy (TLS 1.2+ minimum; prefer the GCP MODERN/RESTRICTED profile) | capOS CipherPolicy (modern) |
The first proof must not write a private key into the disk image, the manifest, the cloudboot evidence directory, or any harness-staged object. A managed certificate keeps key material entirely on the provider side.
The successor must preserve the same no-export rule on the capOS side: the ACME
account key and TLS private key remain behind PrivateKey / KeyVault
authority and are not copied into cloudboot images, manifests, logs, or evidence
directories. Local ACME proofs use a local directory; public GCE/Let’s Encrypt
proofs require explicit run authorization, DNS-name control, public-ingress
teardown evidence, and staging-vs-production CA labeling.
Browser Session and Origin Policy
The self-served Web UI keeps the Gate 1B boundary: remote-session-web-ui is
the trusted backend that holds remote-session/CapSet state server-side, and
browser JavaScript receives only browser-safe view models. Public exposure adds
the following reviewed browser rules:
- Single public origin. UI assets and the same-origin JSON API are served
from the one HTTPS origin (the load balancer hostname). No second origin, no
wildcard CORS, no cross-origin credentialed requests. The service-side
policy is implemented in
remote-session-web-uias a boot-manifest input: onepublic_origin.<host>marker cap (an inert Endpoint, granted after the service caps) fixes the acceptedhttps://<host>origin at boot, validated fail-closed (second marker, malformed, loopback-named, or IP-literal-shaped host, or any unrecognized extra grant fails the boot), and consulted by theHost/Origin/Referergates only for requests on the trusted forwarded-scheme HTTPS path, so a direct client can never claim the public origin. Browser-supplied principal/source hint headers (IAP assertions, authenticated-user hints) are rejected on the public-origin path before any backend-held capability dispatch, no CORS headers are emitted, and login ingress extends to the recorded GFE ranges only when a public origin is configured. Locally proven bymake run-cloud-prod-remote-session-web-ui-l4(in-process trusted-forwarder fixture positive plus cross-origin, mixed-scheme, wildcard, missing-origin, hostile-Referer, principal-hint, and real-ingress direct-client forged negatives); the proof claims no DNS name, load balancer, TLS endpoint, or live public exposure. - Forwarded-scheme trust is firewall-bounded. Because the backend hop is
plain HTTP, capOS derives the external scheme from the load balancer’s
X-Forwarded-Proto/forwarding headers. It must trust those headers only from the Google front-end source ranges (enforced by the firewall below), and treat any such header from an unexpected source as absent (default to “not HTTPS”, fail closed on secure-context assumptions). The service-side trust gate is implemented inremote-session-web-ui(forwarded_scheme_peer_trusted/external_scheme_is_https, pinned to130.211.0.0/22and35.191.0.0/16, fail-closed on unknown peer formats) and locally proven bymake run-cloud-prod-remote-session-web-ui-l4: a real ingress client forgingX-Forwarded-Proto: httpskeeps the non-Securecookie posture, and a fixture simulating the recorded ranges is the only path that flips the session cookie toSecure. The local proof remains plaintext-loopback and claims no live load balancer or TLS endpoint. - Session cookies. The session cookie is
Secure,HttpOnly, andSameSite. TheSameSitevalue is picked deterministically rather than mid-slice:Strictwhen no IAP front door is used, andLaxwhen IAP is enabled (the IAP sign-in redirect is a cross-site top-level navigation that would drop aStrictcookie on return).Secureis honored because the browser only ever sees the cookie over the load balancer’s HTTPS origin. The switch is implemented inremote-session-web-uias a boot-manifest policy input: an IAP-fronted deployment manifest grants the inertiap_fronted_ingressmarker cap (last in the web-ui grant list) to selectLax; without it the service emitsStrict, andSameSite=Noneis never emitted. The posture applies uniformly to the session, CSRF, and logout/expiry clear-cookie headers, stays independent of the forwarded-scheme-derivedSecureattribute, and is fixed at boot so no request header, cookie, or body field can select the weaker branch. Because aLaxcookie attaches on cross-site top-level GET navigations, the Lax posture additionally rejects authenticated GET views whose Fetch Metadata provenance (Sec-Fetch-Site) is cross-site – and cookie-bearing GETs with no Fetch Metadata at all, covering legacy browsers and webviews that attach Lax cookies without stating provenance – before any session state is touched; the gate is inert underStrict, where the cookie never attaches cross-site.make run-cloud-prod-remote-session-web-ui-l4proves the defaultStrictposture end to end (including a real-ingress login forging IAP-shaped headers and body fields) and theLaxbranch through the service’s in-process policy fixture; the live IAP-fronted deployment is future work. - HSTS and redirect. The HTTPS edge sets
Strict-Transport-Securitywith a conservativemax-age(nopreload, noincludeSubDomainscommitment for the first proof). Any port-80 listener is a 301 to HTTPS only. - CSRF. State-changing JSON routes require a per-session anti-CSRF token and
an
Origin/Referercheck against the known public origin; cross-origin or origin-absent state changes are rejected. - Session lifetime and logout. Sessions carry a bounded idle timeout and an absolute lifetime. Logout drops the server-side session and clears the cookie; the existing self-served stale-session / logout failure-closed boundary (proven in the Gate 1B implementation gate) extends unchanged to the public endpoint. A stale or expired cookie yields no authority.
Firewall and Source-Range Policy
The instance keeps no public IP. Ingress to the capOS UI backend port is allowed
only from Google’s load-balancer and health-check ranges, never from
0.0.0.0/0:
| Allowed source | Purpose |
|---|---|
130.211.0.0/22, 35.191.0.0/16 | Google Front Ends and load-balancer health checks reaching the backend port |
35.235.240.0/20 | Identity-Aware Proxy (only if IAP fronting or IAP-tunneled SSH/diagnostics is used) |
No other ingress rule is created. The proof does not broaden the service
account, add API scopes beyond the LB/health-check need, open SSH to the public
internet, or attach a broad firewall tag. Egress stays default-deny-friendly:
the LB-terminated path needs no capOS outbound, and the future ACME path (which
would require egress 443 to the ACME directory) is explicitly out of scope
here.
Backend Health-Check Contract (Local Proof Landed)
The backend port is reachable only from the GFE/health-check ranges above, so
the load balancer’s health checker is the route’s only intended public caller.
The backend health contract, proven locally by
make run-cloud-prod-remote-session-web-ui-l4:
- Route:
GET /healthzon the Web UI backend port, served bydemos/remote-session-web-ui(HEALTH_BODY). The exact bounded response body is{"ok":true,"service":"remote-session-web-ui"}withContent-Type: application/jsonandCache-Control: no-store; it carries no cap ids, session ids, user/profile names, endpoint handles, provider resource ids, host paths, or secret material. - No authority: the route is unauthenticated and never creates, rotates,
refreshes, or consumes a browser session; it never emits
Set-Cookie, and a presented (even forged) session cookie changes nothing. The local proof drives a/healthzprobe with live session cookies against an idle-expired session and asserts the next authenticated call still fails closed. It is the only unauthenticated public-ingress liveness exception; the Host/Origin/CSRF/session gates on authority-bearing routes are unchanged. (/api/healthremains the bundled operator app’s same-origin page-load ping with the same no-authority posture; the provider health check never probes it.) - Host-gate exemption: the health checker probes the backend by IP, so
/healthzdeliberately does not require the loopback/public-hostHostallowlist that authority-bearing routes enforce. - Fail-closed variants: non-
GETmethods and path variants (POST /healthz,/healthz/extra,/HEALTHZ) return 404 without reaching any authority-bearing handler. - Availability under abuse: the slow-client phases of the L4 smoke prove a
concurrent
/healthzkeeps completing while idle, partial-request, and drip-feed clients are held open, and after they are abandoned.
This is local backend readiness for the selected policy
(evidence-class=local-qemu), not a live GCE health check: no health-check
resource, load balancer, firewall rule, or public endpoint exists, and a
passing local contract proof authorizes none of them.
Audit and Evidence Fields
The public proof records, before teardown, at least:
- selected ingress shape (
https-load-balancer) and whether IAP was enabled; - public endpoint (hostname and HTTPS virtual IP);
- TLS posture: terminator (
google-frontend), certificate type (google-managedoroperator-supplied), and the load balancer SSL-policy minimum TLS version; - authentication method exercised (capOS
SessionManagerlogin, and Google IAM identity if IAP is enabled); - firewall/forwarding scope: the named source ranges, backend port, and the URL-map/forwarding-rule chain created;
- HTTP-to-HTTPS redirect and HSTS header observation;
- teardown result for every resource the proof created.
Teardown Checklist
The existing harness deletes the instance, image, and staging tarball in an
EXIT INT TERM trap. The public proof extends that trap to delete, in
dependency order, every ingress resource it creates:
- global forwarding rule and target HTTPS proxy;
- URL map and any HTTP-to-HTTPS redirect URL map / target HTTP proxy;
- backend service and health check;
- zonal/serverless NEG or managed instance group backing the backend;
- managed certificate / certificate-map entry / SSL policy created for the run;
- the LB-scoped and (if used) IAP-scoped firewall rules;
- the reserved external IP address, if one was allocated for the LB;
- the instance, image, and staged tarball (existing harness behavior).
Teardown must be idempotent and must run on signal or partial failure, matching the existing orphan-sweep discipline. A run that cannot confirm deletion of an ingress resource is a failed run, not a passed one.
Local Plan Gate (Landed)
The resource graph above is locally reviewable before any billable work:
tools/cloudboot/plan-public-webui-ingress.sh renders and validates the
selected plan shape with zero provider interaction, and
make cloudboot-public-webui-ingress-plan-check is the fixture gate proving
each rejected hazard (raw public HTTP to capOS, instance public IP,
0.0.0.0/0 backend ingress, missing /healthz health check, broad service
account/scopes, staged private-key material, non-provider certificate custody)
fails closed by structured class before any provider CLI could be invoked.
Output is stamped evidence-class=cloudboot-local-plan with
operator-exposure=not-proven; a plan pass is not public reachability, TLS
readiness, or authorization for the on-hold public proof. The command contract
and failure classes are documented in tools/cloudboot/README.md (“Public Web
UI ingress plan gate”).
Local Teardown Fixture Gate (Landed)
The teardown checklist above is locally proven before any billable work:
tools/cloudboot/teardown-public-webui-ingress.sh is the dependency-ordered,
idempotent, deletion-confirming teardown engine over a per-run
created-resources journal, and
make cloudboot-public-webui-teardown-fixture-check exercises it against
recording stub provider CLIs across complete, partial-create,
command-failure, delete-claims-success-but-persists, unreadable-state,
signal-trap, and orphan-sweep paths. Every checklist resource class is
modeled and the engine’s class list must equal the plan gate’s rendered
teardown-order= line (the fixture fails on drift), so a class added to the
selected plan cannot go missing from the cleanup graph. An unconfirmed
deletion is a blocking structured failure (undeleted-<class> /
resource-state-unknown), matching the failed-run policy above. All
public-ingress resource names must carry the capos-test- sweepable marker;
a journal naming anything else is rejected before any provider call, and the
orphan sweep enforces the marker client-side so out-of-scope resources are
never deleted. Output is stamped
evidence-class=cloudboot-local-teardown-fixture live-teardown=not-proven;
a fixture pass is local harness evidence only, never live provider teardown
evidence, and authorizes no public ingress. The journal grammar, sweep
contract, and failure classes are documented in tools/cloudboot/README.md
(“Public Web UI ingress teardown fixture gate”).
Local Evidence Fixture Gate (Landed)
The “Audit and Evidence Fields” contract above is locally proven before any
billable work: tools/cloudboot/validate-public-webui-evidence.sh validates
a harness-rendered public-proof closeout report against the selected
evidence grammar, and
make cloudboot-public-webui-proof-evidence-fixture-check is the fixture
gate proving accepted and rejected reports over stub inputs with zero
provider CLI invocations. Acceptance requires the recorded ingress shape,
public HTTPS hostname/VIP, provider TLS terminator and managed or
operator-supplied certificate resource, minimum TLS policy, IAP posture,
no-key-custody statement, no-public-IP instance posture, GFE/health-check
firewall scope, health-check, HTTP-to-HTTPS redirect and HSTS observations,
capOS SessionManager login observation, a public HTTPS probe record, the
correlated gce-public-self-hosted-webui-ingress-tls proof marker, and a
per-resource teardown record pinned to the plan gate’s teardown-order=
class list (the fixture fails on drift). Raw public HTTP, a direct
instance public IP, wildcard backend ingress, a missing health check,
missing HSTS/redirect observation, capOS or harness private-key custody,
stale/missing/incomplete teardown, a non-provider TLS terminator, and
private-proof-only evidence (a same-VPC or provider-internal probe path,
or a proof marker without a recorded HTTPS probe) each fail closed by
structured class. The tls terminator= label structurally separates this
provider-terminated evidence contract from the later capOS-terminated TLS
successor, so successor evidence can never pass through the first-proof
grammar. Output names field names, classes, and line numbers only; input
values are never echoed. Every pass is stamped
evidence-class=cloudboot-local-public-webui-evidence-fixture with
operator-exposure=not-proven: a fixture pass is local evidence-grammar
validation only, never public reachability or operator-access evidence,
and it does not authorize public exposure or move the live proof out of
cloud-gce-public-self-hosted-webui-ingress-tls.
The report grammar and failure classes are documented in
tools/cloudboot/README.md (“Public Web UI evidence-grammar fixture
gate”).
Local Provider Command Allowlist Gate (Landed)
The provider command boundary the future public proof may use is locally
proven before any billable work:
tools/cloudboot/check-public-webui-provider-commands.sh validates a
recorded provider-command transcript against the selected resource graph,
and make cloudboot-public-webui-provider-command-allowlist-check is the
fixture gate proving both directions over recording stub gcloud/gsutil
with zero live provider invocations. The allowlist permits only the
resource families the plan and teardown checklist name – forwarding rules,
target HTTPS/HTTP proxies, URL maps, backend services, health checks,
zonal NEGs, scoped firewall rules, managed-certificate resources, SSL
policies, reserved addresses, instance/image creation, and staged
tarball upload/delete – and requires the capos-test- marker on every
created resource, journal-pinned deletion (a delete must name a resource
the created-resources journal recorded), GFE/IAP-only firewall source
ranges, the capos-test filter on every listing, marker discipline on
create-wired references, per-surface create flags and parameters pinned to
the selected graph shape, an explicit pin of the documented sandbox project on
every command, and explicit --global/--zone scope on deletes (ambient
Cloud SDK project/region defaults are never trusted). Drift toward broader
provider authority fails closed
by structured class: IAM mutation, service-account/scopes changes, DNS
mutation, private-key upload, 0.0.0.0/0 backend ingress, unmarked
resources, deletion outside the journal (zone-pinned), project-wide or
filter-restating sweeps, ambient credential flags, project/network/region
scope overrides beyond the pinned sandbox forms, --flags-file
indirection, non-selected create parameters, shell/environment
inspection, and provider CLI resolution from an unexpected path. Rejected
command content is reported by class and line number only; credentials,
principals, key paths, and rejected names are never echoed. Output is
stamped evidence-class=cloudboot-local-provider-command-allowlist with
provider-mutation=none: a pass narrows what the future live proof may
execute, it is not live provider evidence and does not authorize the
on-hold public proof. The transcript grammar and failure classes are
documented in tools/cloudboot/README.md (“Public Web UI
provider-command allowlist gate”).
Phase 2: ACPI and Device Discovery
Goal: Parse ACPI tables to discover hardware topology, interrupt routing, and PCI root complexes. This replaces QEMU-specific hardcoded assumptions.
Why ACPI
On QEMU with default settings, you can hardcode PCI config space at
0xCF8/0xCFC and assume legacy interrupt routing. On real cloud hardware:
- PCI root complex addresses come from ACPI MCFG table (PCIe ECAM)
- Interrupt routing comes from ACPI MADT (I/O APIC entries) and _PRT
- CPU topology comes from ACPI MADT (LAPIC entries)
- Timer info comes from ACPI HPET/PMTIMER tables
Limine provides the RSDP (Root System Description Pointer) address via its protocol. From there, the kernel can walk RSDT/XSDT to find specific tables.
Required Tables
| Table | Purpose | Priority |
|---|---|---|
| MADT | LAPIC and I/O APIC addresses, CPU enumeration | High (Phase 2) |
| MCFG | PCIe Enhanced Configuration Access Mechanism base | High (Phase 2) |
| HPET | High Precision Event Timer address | Medium (fallback timer) |
| FADT | PM timer, shutdown/reset methods | Low (future) |
Landed Discovery Slice
The first landed slices are bounded diagnostics plus reusable config access.
The ACPI parser requests
Limine’s RSDP, validates RSDP/RSDT/XSDT/static-table lengths and checksums
within fixed caps, emits serial summaries for RSDT/XSDT table count and
MADT/MCFG presence, reports MADT LAPIC/I/O APIC/interrupt-source-override
inputs, and reports MCFG ECAM allocation records when firmware provides the
table. The PCI layer now keeps the existing legacy I/O-port backend and adds an
ECAM backend selected from MCFG allocations; devices retain their discovery
backend so config reads, writes, capability walking, and BAR sizing use the
same access path. The PCI layer also exposes a shared memory-BAR subregion
validator/mapper, and the virtio-net transport uses it for modern capability
regions. It also reports MSI/MSI-X capability metadata for the virtio-net
function and uses kernel-owned config/RX/TX source records with a bounded
first-fit LAPIC device MSI vector pool plus lock-free dispatch slots for QEMU
virtio-net MSI-X table programming, virtio vector assignment, driver-owned
route unmask, claimed-route lifecycle/reassignment proof, and TX delivery
proof. The x86 setup
maps MADT I/O APICs and programs masked legacy IRQ routes from MADT source
overrides before higher-level drivers can depend on interrupt routing. The Q35
smoke asserts both the ECAM inventory lines, a
pci: config backend=ecam enumerated ... proof line, and representative masked
I/O APIC route lines; the net smoke asserts virtio-net BAR, capability, MSI-X
metadata, source-route records, route unmask records, vector programming,
queue assignment, descriptor guards, ARP, and ICMP fixture lines before
MMIO transport mapping completes. This path does not interpret AML, provide
userspace driver authorities, or provide full unbounded bus discovery yet.
Implementation
#![allow(unused)]
fn main() {
// kernel/src/acpi.rs
/// Minimal ACPI table parser.
/// Walks RSDP -> XSDT -> individual tables.
/// Does NOT implement AML interpretation -- static tables only.
pub struct AcpiInfo {
pub lapics: Vec<LapicEntry>,
pub io_apics: Vec<IoApicEntry>,
pub iso_overrides: Vec<InterruptSourceOverride>,
pub mcfg_base: Option<u64>, // PCIe ECAM base address
pub hpet_base: Option<u64>,
}
pub fn parse_acpi(rsdp_addr: u64, hhdm: u64) -> AcpiInfo { ... }
}
For the fuller static-table subsystem, prefer the acpi crate (or an
equivalent maintained no_std parser) rather than expanding the diagnostic
parser into a general hand-written ACPI stack. The landed parser is a boot-time
inventory proof for RSDP/RSDT/MADT/MCFG summaries; it can be retired or
narrowed once the crate-backed table model fits capOS mapping and table
lifetime constraints.
Limine RSDP
#![allow(unused)]
fn main() {
use limine::request::RsdpRequest;
static RSDP: RsdpRequest = RsdpRequest::new();
// In kmain:
let rsdp_addr = RSDP.response().expect("no RSDP").address as u64;
let acpi_info = acpi::parse_acpi(rsdp_addr, hhdm_offset);
}
Crate Dependencies
| Crate | Purpose | no_std |
|---|---|---|
acpi | Planned fuller/static ACPI table parsing (MADT, MCFG, HPET, FADT, etc.) | yes |
Scope
The landed diagnostic slice is kernel-local bounded read-only parsing for serial inventory. Fuller handling should be mostly glue around a maintained static-table parser plus capOS mapping, lifetime, and authority types.
Phase 3: Interrupt Infrastructure
Goal: Set up I/O APIC for device interrupt routing and MSI/MSI-X for modern PCI devices. This replaces the implicit legacy PIC setup.
I/O APIC
The I/O APIC routes external device interrupts (keyboard, serial, PCI devices) to specific LAPIC entries (CPUs). Its address and configuration come from the ACPI MADT (Phase 2).
#![allow(unused)]
fn main() {
// kernel/src/arch/x86_64/ioapic.rs
pub struct IoApic {
base: *mut u32, // MMIO registers via HHDM
}
impl IoApic {
/// Route an IRQ to a specific LAPIC/vector.
pub fn route_irq(&mut self, irq: u8, lapic_id: u8, vector: u8) { ... }
/// Mask/unmask an IRQ line.
pub fn set_mask(&mut self, irq: u8, masked: bool) { ... }
}
}
The current x86 implementation maps MADT I/O APIC MMIO, reads each controller’s ID/version/redirection count, and programs legacy IRQ 0-15 routes to LAPIC vectors while keeping the redirection entries masked. It respects Interrupt Source Override entries from MADT (for example, Q35 remaps IRQ 0 to GSI 2). Driver-owned unmask policy, dispatch, and EOI handling remain planned.
MSI/MSI-X
Modern PCI/PCIe devices (NVMe, cloud NICs) use Message Signaled Interrupts instead of pin-based IRQs routed through the I/O APIC. MSI/MSI-X writes directly to the LAPIC’s interrupt command register, bypassing the I/O APIC entirely.
This is critical for cloud deployment because:
- NVMe controllers require MSI or MSI-X (no legacy IRQ fallback on many controllers)
- Cloud NICs (ENA, gVNIC) use MSI-X exclusively
- MSI-X supports per-queue interrupts (one vector per virtqueue/submission queue), enabling better SMP scalability
#![allow(unused)]
fn main() {
// kernel/src/pci/msi.rs
/// Configure MSI for a PCI device.
pub fn enable_msi(device: &PciDevice, vector: u8, lapic_id: u8) { ... }
/// Configure MSI-X for a PCI device.
pub fn enable_msix(
device: &PciDevice,
table_bar: u8,
entries: &[(u16, u8, u8)], // (index, vector, lapic_id)
) { ... }
}
MSI/MSI-X capability structures are found by walking the PCI capability list (already needed for PCI enumeration in the networking proposal). The current PCI path reports MSI/MSI-X capability metadata for virtio-net so diagnostics can see the advertised table and pending-bit-array layout. The virtio-net QEMU smoke now records kernel-owned config/RX/TX MSI-X sources, publishes them into the device interrupt dispatch table, allocates LAPIC vectors from the bounded device MSI vector pool to program their table entries and virtio vector registers, lets the in-kernel virtio-net owner unmask only those routes, then proves TX delivery by observing that source’s dispatch counter advance after maskable interrupts are live. The same smoke uses an unused masked MSI-X table entry to prove claimed-route reassignment, stale old-route rejection, old-vector unregistered delivery, reassigned-vector masked delivery, unsupported-vector delivery, and release. Broader driver dispatch and userspace interrupt authority remain planned.
Integration with SMP
LAPIC initialization is shared with the SMP proposal. The active x86 path uses xAPIC MMIO for the immediate QEMU/KVM timer and IPI foundation, with PIT/PIC fallback. This cloud phase consumes that architectural LAPIC path for local interrupt delivery and now adds masked ACPI MADT I/O APIC routing plus MSI/MSI-X capability metadata discovery and a bounded virtio-net MSI-X dispatch/lifecycle proof; userspace device interrupts remain planned.
KVM/QEMU paravirtual features such as PV EOI, PV IPI, and PV TLB flush are host-specific accelerations. They are useful later for cloud performance, but cloud boot correctness should use the architectural LAPIC path first. x2APIC is a later backend for newer/high-core systems and firmware states where xAPIC is unavailable or undesirable; it is not a blocker for the current LAPIC path.
Scope
~300-400 lines total:
- I/O APIC driver: ~150 lines
- MSI/MSI-X setup: ~100-150 lines
- Integration/routing logic: ~50-100 lines
Phase 4: PCI/PCIe Infrastructure
Goal: Standalone PCI bus enumeration and device management, usable by all device drivers (virtio-net, NVMe, cloud NICs).
The networking proposal includes PCI enumeration as a substep for finding virtio-net. This phase promotes it to a reusable kernel subsystem that all device drivers build on.
PCI Configuration Access
Two mechanisms, determined by ACPI:
- Legacy I/O ports (0xCF8/0xCFC) – works in QEMU, limited to 256 bytes of config space per function. Insufficient for PCIe extended capabilities.
- PCIe ECAM (Enhanced Configuration Access Mechanism) – memory-mapped config space, 4 KB per function. Base address from ACPI MCFG table. Required for MSI-X capability parsing and NVMe BAR discovery on real hardware.
Legacy I/O and Q35 ECAM config access exist today behind the same early PCI
backend abstraction. The PCI layer also validates memory BAR subregions with
checked offset/length/alignment bounds and maps selected subregions through the
kernel MMIO window for in-kernel drivers, and it records non-programming
MSI/MSI-X metadata for the current virtio-net path by walking the standard PCI
capability list. The virtio-net path now selects a usable MSI-X capability and
programs config/RX/TX table entries through the typed PCI MSI-X table helper
using the kernel-owned source records and bounded first-fit LAPIC device MSI
vectors. The QEMU net smoke lets the in-kernel virtio-net owner claim and
unmask those routes, assigns the virtio common and queue MSI-X vector
registers, and proves TX delivery by observing that source’s dispatch counter
advance after the TX completion path has run and maskable interrupts are live.
It also proves claimed-route reassignment and release with an unused masked
MSI-X table entry. The next steps are using that path for full bus discovery,
userspace DeviceMmio authority, broader driver dispatch, and driver binding.
Device Enumeration
#![allow(unused)]
fn main() {
// kernel/src/pci.rs
pub struct PciDevice {
pub bus: u8,
pub device: u8,
pub function: u8,
pub vendor_id: u16,
pub device_id: u16,
pub class: u8,
pub subclass: u8,
pub bars: [Option<Bar>; 6],
pub interrupt_pin: u8,
pub interrupt_line: u8,
}
pub enum Bar {
Memory {
base: u64,
size: u64,
prefetchable: bool,
width: MemoryBarWidth,
},
Io { base: u32, size: u32 },
}
/// Scan all PCI buses and return discovered devices.
pub fn enumerate() -> Vec<PciDevice> { ... }
/// Find a device by vendor/device ID.
pub fn find_device(vendor: u16, device: u16) -> Option<PciDevice> { ... }
/// Walk the PCI capability list for a device.
pub fn capabilities(device: &PciDevice) -> Vec<PciCapability> { ... }
}
BAR Mapping
Device drivers need MMIO access to BAR regions. The kernel now maps validated
memory-BAR subregions into its bounded MMIO virtual window for in-kernel
drivers. A future DeviceMmio capability will carry equivalent authority to
userspace drivers as described in the networking proposal.
PCI Device IDs for Cloud Hardware
| Device | Vendor:Device | Cloud |
|---|---|---|
| virtio-net | 1AF4:1000 (transitional) or 1AF4:1041 (modern) | QEMU, supported first/second-generation GCP machine families |
| virtio-blk | 1AF4:1001 (transitional) or 1AF4:1042 (modern) | QEMU |
| NVMe | 8086:various, 144D:various, etc. | All clouds (EBS, PD, Managed Disk) |
| AWS ENA | 1D0F:EC20 / 1D0F:EC21 | AWS |
| GCP gVNIC | 1AE0:0042 | GCP |
| Azure MANA | 1414:00BA | Azure |
Scope
~400-500 lines:
- Config space access (I/O + ECAM): ~100 lines
- Bus enumeration: ~150 lines
- BAR parsing and mapping: ~100 lines
- Capability list walking: ~50-100 lines
Phase 5: NVMe Driver
Goal: Basic NVMe block device driver, sufficient to read/write sectors. This is the storage equivalent of virtio-net for networking – the first real storage driver.
Why NVMe Over virtio-blk
The storage-and-naming proposal mentions virtio-blk for Phase 3 (persistent store). On cloud VMs, all three providers expose NVMe:
- AWS EBS – NVMe interface (even for gp3/io2 volumes)
- GCP Persistent Disk – NVMe or SCSI (NVMe is default for newer VMs)
- Azure Managed Disks – SCSI on many older VM families such as D/Ev5 or Fv2 and older; NVMe on Azure Boost and newer NVMe-capable families such as Ebsv5 and Da/Ea/Fav6 and newer
virtio-blk is QEMU-only. An NVMe driver unlocks persistent storage on all
cloud platforms where the selected VM shape exposes NVMe. For QEMU testing,
QEMU also emulates NVMe well:
-drive file=disk.img,if=none,id=d0 -device nvme,drive=d0,serial=capos0.
NVMe Architecture
NVMe is a register-level standard with well-defined queue-pair semantics:
Application
|
v
Submission Queue (SQ) -- ring buffer of 64-byte command entries
|
| doorbell write (MMIO)
v
NVMe Controller (hardware)
|
| DMA completion
v
Completion Queue (CQ) -- ring buffer of 16-byte completion entries
|
| MSI-X interrupt
v
Driver processes completions
Minimum viable driver needs:
- Admin Queue Pair (for identify, create I/O queues)
- One I/O Queue Pair (for read/write commands)
- MSI-X for completion notification (or polling)
Implementation Sketch
#![allow(unused)]
fn main() {
// kernel/src/nvme.rs (or kernel/src/drivers/nvme.rs)
pub struct NvmeController {
bar0: *mut u8, // MMIO registers
admin_sq: SubmissionQueue,
admin_cq: CompletionQueue,
io_sq: SubmissionQueue,
io_cq: CompletionQueue,
namespace_id: u32,
block_size: u32,
block_count: u64,
}
impl NvmeController {
pub fn init(pci_device: &PciDevice) -> Result<Self, NvmeError> { ... }
pub fn read(&self, lba: u64, count: u16, buf: &mut [u8]) -> Result<(), NvmeError> { ... }
pub fn write(&self, lba: u64, count: u16, buf: &[u8]) -> Result<(), NvmeError> { ... }
pub fn identify(&self) -> NvmeIdentify { ... }
}
}
DMA Considerations
NVMe uses DMA for data transfer. The controller reads/writes directly from physical memory addresses provided in commands. Requirements:
- Buffers must be physically contiguous (or use PRP lists / SGLs for scatter-gather)
- Physical addresses must be provided (not virtual)
- Cache coherence is handled by hardware on x86_64 (DMA-coherent architecture)
The existing frame allocator can provide physically contiguous pages. For larger transfers, PRP (Physical Region Page) lists allow scatter-gather.
Crate Dependencies
| Crate | Purpose | no_std |
|---|---|---|
| (none) | NVMe register-level protocol is simple enough to implement directly | N/A |
The NVMe spec is cleaner than virtio and the register interface is straightforward. A minimal driver (admin + 1 I/O queue pair, read/write) is ~500-700 lines without external dependencies.
Integration with Storage Proposal
The storage proposal’s Phase 3 (Persistent Store) specifies virtio-blk as
the backing device. This can be generalized to a BlockDevice trait:
#![allow(unused)]
fn main() {
trait BlockDevice {
fn read(&self, lba: u64, count: u16, buf: &mut [u8]) -> Result<(), Error>;
fn write(&self, lba: u64, count: u16, buf: &[u8]) -> Result<(), Error>;
fn block_size(&self) -> u32;
fn block_count(&self) -> u64;
}
}
Both NVMe and virtio-blk implement this trait. The store service doesn’t care which backing driver it uses.
Scope
~500-700 lines for a minimal in-kernel NVMe driver (admin queue + 1 I/O queue pair, read/write, identify). Userspace decomposition follows the same pattern as the networking proposal (kernel driver first, then extract to userspace process with DeviceMmio + Interrupt caps).
Phase 6: Cloud NIC Strategy
Goal: Define the path to networking on cloud VMs, given that each cloud uses a different proprietary NIC.
The Landscape
| Cloud | Primary NIC | Virtio NIC available? | Open-source driver? |
|---|---|---|---|
| GCP | gVNIC (1AE0:0042) | Yes on supported first/second-generation machine families | Yes (Linux, ~3000 LoC) |
| AWS | ENA (1D0F:EC20) | No (Nitro only) | Yes (Linux, ~8000 LoC) |
| Azure | MANA (1414:00BA) | No (accelerated networking) | Yes (Linux, ~6000 LoC) |
Recommended Strategy
Short term: constrained virtio-net on GCP
GCP can expose VIRTIO_NET on supported first/second-generation machine
families. After the shared image, ACPI/PCIe, interrupt, DMA/MMIO, and virtio
foundation exists, that gives a constrained early cloud-network proof without
writing a provider-specific NIC driver. It is not the general GCP target:
third-generation-and-later machine families, Tau T2A, Confidential VM, and
some higher-bandwidth paths require gVNIC.
gcloud compute instances create capos-test \
--image=capos \
--machine-type=e2-micro \
--network-interface=nic-type=VIRTIO_NET
Medium term: gVNIC driver
gVNIC is a simpler device than ENA or MANA. The Linux driver is ~3000 lines (vs ~8000 for ENA). It uses standard PCI BAR MMIO + MSI-X interrupts. A minimal gVNIC driver (init, link up, send/receive) would be ~800-1200 lines.
gVNIC is worth prioritizing because:
- GCP’s constrained virtio-net path can de-risk cloud networking before a provider-specific NIC driver exists
- Graduating from virtio-net to gVNIC on the same cloud is the required path for newer, Tau T2A, Confidential VM, and higher-bandwidth GCP instances
- The gVNIC register interface is documented in the Linux driver source
Long term: ENA and MANA
ENA and MANA are more complex and less well-documented outside their Linux drivers. These should be deferred until the driver model is mature (userspace drivers with DeviceMmio caps, as described in the networking proposal Part 2).
At that point, the kernel only needs to provide PCI enumeration + BAR mapping + MSI-X routing. The actual NIC driver logic runs in a userspace process, making it feasible to port from the Linux driver source with appropriate licensing considerations.
Alternative: Paravirt Abstraction Layer
Instead of writing native drivers for each cloud NIC, an alternative is a thin paravirt layer:
Application -> NetworkManager cap -> Net Stack (smoltcp) -> NIC cap -> [driver]
Where [driver] is one of:
virtio-net(QEMU, supported first/second-generation GCP machine families)gvnic(GCP)ena(AWS)mana(Azure)
All drivers implement the same Nic capability interface from the networking
proposal. The network stack and applications are driver-agnostic.
This is already the architecture described in the networking proposal. The
only addition is recognizing that multiple driver implementations will exist
behind the same Nic interface.
Phase Summary and Dependencies
graph TD
P1[Phase 1: Disk Image + Serial Diagnostics] --> BOOT[Boots on Cloud VM]
P2[Phase 2: ACPI Parsing] --> P3[Phase 3: Interrupt Infrastructure]
P2 --> P4[Phase 4: PCI/PCIe]
P3 --> P5[Phase 5: NVMe Driver]
P4 --> P5
P4 --> NET[Networking Smoke Test<br>virtio-net driver]
P3 --> NET
P4 --> P6[Phase 6: Cloud NIC Drivers]
P3 --> P6
NET --> P6
S5[Stage 5: Scheduling] --> P3
SMP_C[SMP Phase C: LAPIC timer/IPI] --> P3
style P1 fill:#2d5,stroke:#333
style BOOT fill:#2d5,stroke:#333
| Phase | Depends on | Estimated scope | Enables |
|---|---|---|---|
| 1: Disk image + diagnostics | Nothing | image tooling plus bounded diagnostics mode | Cloud serial boot |
| 2: ACPI | Nothing (kernel code) | ~200-300 lines | Phases 3, 4 |
| 3: Interrupts | Phase 2, LAPIC (SMP Phase C) | ~300-400 lines | NVMe, cloud NICs |
| 4: PCI/PCIe | Phase 2 | ~400-500 lines | All device drivers |
| 5: NVMe | Phases 3, 4 | ~500-700 lines | Cloud storage |
| 6: Cloud NICs | Phases 3, 4, networking smoke test | ~800-1200 lines each | Cloud networking |
Minimum Path to “Boots on Cloud VM, Prints Hello”
Raw serial output and UEFI boot support already exist, so the smallest “prints hello” experiment is mostly Phase 1 image packaging plus any boot-path adjustments needed to reach the same COM1 output from an imported disk image. That experiment is a precursor, not the full Phase 1 closeout.
Phase 1 closeout also includes a bounded serial diagnostics prompt so cloud driver bring-up can inspect CPU, memory, ACPI, PCI, IRQ, timer, device, and log state before cloud NICs or storage drivers are reliable. That diagnostics surface is kernel/userspace behavior, not just build-system work.
Minimum Path to “Useful on Cloud VM”
Phases 1-5 (disk image + ACPI + interrupts + PCI + NVMe) plus the existing roadmap items (Stages 4-6 for capability syscalls, scheduling, IPC). On a supported first/second-generation GCP machine family, networking can use the existing virtio-net proposal without a provider-specific gVNIC/ENA/MANA driver on that constrained target.
QEMU Testing
All phases can be tested in QEMU before deploying to cloud:
| Phase | QEMU flags |
|---|---|
| Disk image | -drive file=capos.img,format=raw -bios OVMF.4m.fd |
| ACPI | Default QEMU provides ACPI tables (MADT, MCFG, etc.) |
| I/O APIC | Default QEMU emulates I/O APIC |
| PCI/PCIe | -device ... adds PCI devices; QEMU has PCIe root complex |
| NVMe | -drive file=disk.img,if=none,id=d0 -device nvme,drive=d0,serial=capos0 |
| MSI-X | Supported by QEMU’s NVMe and virtio-net-pci emulation; current net smoke asserts metadata selection, kernel-owned source-route records, route unmask, vector programming, virtio queue assignment, descriptor guards, ARP, and ICMP fixture evidence. Device-autonomous virtio-net MSI-X delivery is covered by the dedicated userspace-provider gates. |
| Multi-CPU | -smp 4 (already works with Limine SMP) |
| x2APIC backend | future explicit QEMU CPU feature such as -cpu qemu64,+smep,+smap,+rdrand,+x2apic |
aarch64 and ARM Cloud Instances
This proposal focuses on x86_64 because that’s the current kernel target, but ARM-based cloud instances are significant and growing:
| Cloud | ARM offering | Instance types |
|---|---|---|
| AWS | Graviton2/3/4 | m7g, c7g, r7g, etc. |
| GCP | Tau T2A (Ampere Altra) | t2a-standard-* |
| Azure | Cobalt 100 (Arm Neoverse) | Dpsv6, Dplsv6 |
ARM cloud VMs have the same general requirements (UEFI boot, ACPI tables, PCI/PCIe, NVMe storage) but different specifics:
- Interrupt controller: GIC (Generic Interrupt Controller) instead of APIC. GICv3 is standard on cloud ARM instances.
- Boot: UEFI via Limine (already targets aarch64). Limine handles the architecture differences at boot time.
- Timer: ARM generic timer (CNTPCT_EL0) instead of LAPIC/PIT/TSC.
- Serial: PL011 UART instead of 16550 COM1. Different register interface.
- NIC: Same PCI devices (ENA, gVNIC, MANA) with the same register interfaces – PCI/PCIe is architecture-neutral.
- NVMe: Same NVMe register interface – PCIe is architecture-neutral.
The arch-neutral parts of this proposal (PCI enumeration, NVMe, disk image format, ACPI table parsing) apply equally to aarch64. The arch-specific parts (I/O APIC, MSI delivery address format, LAPIC) need aarch64 equivalents (GIC, ARM MSI translation).
The existing roadmap lists “aarch64 support” as a future item. For cloud deployment, aarch64 should be considered as soon as the x86_64 hardware abstraction is stable, since:
- Device drivers (NVMe, virtio-net, cloud NICs) are architecture-neutral – they talk to PCI config space and MMIO BARs, which are the same on both architectures
- The
acpicrate handles both x86_64 and aarch64 ACPI tables - Limine already targets aarch64
- AWS Graviton instances are often cheaper than x86_64 equivalents
The main aarch64 kernel work is: exception handling (EL0/EL1 instead of Ring 0/3), GIC driver (instead of APIC), ARM generic timer, PL011 serial, and the MMU setup (4-level page tables exist on both but with different register interfaces).
Open Questions
-
ACPI scope. The landed diagnostic parser covers bounded read-only RSDP/RSDT/MADT/MCFG summaries only. The
acpicrate can parse fuller static tables (MADT, MCFG, HPET, FADT). Full ACPI requires AML interpretation (for _PRT interrupt routing, dynamic device enumeration). Do we need AML, or are static tables sufficient for cloud VMs? Cloud VM firmware typically provides simple, static ACPI tables – AML interpretation is likely unnecessary initially. -
PCIe ECAM vs legacy. Should we support both config access methods, or require ECAM (which all cloud VMs and modern QEMU provide)? Supporting both adds ~50 lines but makes bare-metal testing on older hardware possible.
-
NVMe queue depth. A single I/O queue pair with depth 32 is sufficient for initial use. Per-CPU queues (leveraging MSI-X per-queue interrupts) improve SMP throughput but add complexity. Defer per-CPU queues to after SMP is working.
-
Driver model unification. Resolved: PCI enumeration is the standalone PCI/PCIe Infrastructure item in the roadmap. The networking smoke test and NVMe driver both consume this shared subsystem. The networking proposal’s Part 1 Step 1 has been updated to reference this phase.
-
GCP vs AWS as first cloud target. The first cloud proof should be imported-image serial-console boot on both providers when practical, because that validates image format, firmware, bootloader, and early ACPI without depending on cloud NICs. For the later usable-networked-instance milestone, a constrained first/second-generation GCP virtio-net target is the easiest first network proof; broader GCP coverage needs gVNIC, and AWS follows once the NVMe/ENA path or an explicit workaround is ready.
References
Specifications
- NVMe Base Specification 2.1 – register interface, queue semantics, command set
- PCI Express Base Specification – ECAM, MSI/MSI-X capability structures
- ACPI Specification 6.5 – MADT, MCFG, HPET table formats
- Intel SDM Vol. 3, Ch. 10 – APIC architecture (LAPIC, I/O APIC)
Crates
- acpi – no_std ACPI table parser
- virtio-drivers – no_std virtio (already in networking proposal)
Prior Art
- Redox PCI – microkernel PCI driver in Rust
- Hermit NVMe – unikernel NVMe driver
- rCore virtio – educational OS with virtio + PCI in Rust
- Linux gVNIC driver – reference for gVNIC register interface (~3000 LoC)
- Linux ENA driver – reference for ENA
Cloud Documentation
- GCP: Creating custom images
- GCP: Manually import boot disks
- GCP: Requirements to build custom images
- GCP: Persistent Disk storage interfaces
- AWS: Importing VM images
- AWS: VM Import/Export requirements
- AWS: VM Import/Export limitations
- AWS: EC2 UEFI boot mode requirements
- Azure: Creating custom images
- GCP: Choosing a NIC type
- GCP: Cloud Run overview
- GCP: Firestore Native mode
- GCP: Cloud Storage object versioning
- GCP: Secret Manager
- GCP: Cloud KMS overview
- GCP: Cloud KMS IAM
- GCP: Cloud KMS roles and permissions
- GCP: Cloud KMS key rotation
- GCP: Rotate a Cloud KMS key
- GCP: Enable and disable Cloud KMS key versions
- GCP: Destroy and restore Cloud KMS key versions
- AWS: Enhanced networking
- AWS: Nitro instances
- Azure: Accelerated Networking
- Azure: Microsoft Azure Network Adapter
- Azure: Manage Accelerated Networking
- Azure: NVMe overview
- Google Drive: application data folder
- Google Drive: Drive API scopes
- Firebase: Firestore offline persistence
- Firebase: Firestore security rule conditions
- Firebase: Firestore usage and limits
- Firebase: Google sign-in for web
capOS Cross-Links
docs/design-risks-register.md– R13 (trusted build inputs are partly pinned) consolidates the long-horizon supply-chain risk view that gates cloud-image release paths; this proposal is recorded as a secondary owner.docs/trusted-build-inputs.md– the actual inventory of pinned and observed-not-pinned build inputs, dependency policy, vendored upstream snapshots, and the build-provenance retention/comparison policy that cloud proofs must satisfy before they are cited as production evidence.docs/tasks/done/2026-06-07/cloud-usable-instance-provider-nic-storage.md– the completed GCP-first usable-instance provider rollup covering provider NIC/storage authority, DMA backend selection, cloud teardown, and serial-console operator access.docs/dma-isolation-design.md– DMA isolation backend selection (kernel-owned bounce buffers vs IOMMU/remapping) that cloud provider drivers must commit to before claiming usable-instance status.docs/backlog/hardware-boot-storage.md– DDF Tasks 5 (userspace driver authority) and 6 (recurring cloud-portability gate) referenced from Phase 1 closeout above.
Proposal: Live Upgrade
Replacing a running service with a new binary, without dropping outstanding
capability references or losing in-flight work. The kernel-side primitive
(CapRetarget) is owned by this proposal; the surrounding orchestration
(supervisors, manifest sources, fault containment) is owned by
service-architecture-proposal.md and consumes the primitive defined here.
Problem
In a Linux-like system, “upgrading a service” is one of:
- Restart: stop the old process, start the new one. Clients holding
file descriptors, sockets, or pipes to the old process receive
ECONNRESETorEPIPEand must reconnect. Session state is lost unless clients serialize it themselves. - Graceful restart (nginx
-s reload, unicorn, systemd socket activation): new process starts alongside old, inherits the listening socket, old drains in-flight requests. Works only for request/response protocols where the session is the request. Does nothing for stateful sessions. - Live patch (kpatch, ksplice): binary-level function replacement. Narrow, fragile, no schema for state layout changes.
None of these compose with a capability OS. A CapId held by a client
points at a specific process; if that process exits, the cap is dead.
There is no “the service” abstraction the kernel could re-bind — the
point of capabilities is that they identify a specific reference, not
a name that could be redirected after the fact.
But capOS has a kernel-side primitive the Linux model lacks: the kernel
already owns the authoritative table of every CapId and which process
serves it. Rewriting “cap X is served by process v1” → “cap X is served
by process v2” is a table update. The question is when it is safe, and
how v2 inherits enough state to answer the next call.
Three Cases
Live upgrade has three distinct cost profiles. The right design is to make each one explicit rather than pretend the hard case doesn’t exist.
Case 1: Stateless services
Each SQE is independent; the service holds no state that matters across calls. A request router, a pure codec, a logger that flushes to an external sink.
Upgrade is trivial: start v2, retarget every CapId from v1 to v2,
exit v1. Clients may observe a small latency spike; no DISCONNECTED
CQE fires. Only the kernel primitive is needed.
Case 2: State externalized into other caps
The service’s in-memory data is a cache or dispatch table; durable state
lives behind caps the service holds (Store, SessionMap, Namespace).
v1’s held caps are passed to v2 at spawn time (via the supervisor, per
the manifest), kernel retargets client caps, v1 exits.
Architecturally this is the idiomatic capOS pattern: services stay thin, state is factored into dedicated holders with their own caps. The Fetch/HttpEndpoint split in the service-architecture proposal already pushes in this direction. In that world, most services fall into this bucket by construction.
Case 3: Stateful services requiring migration
The service has in-memory state that matters: a JIT’s code cache, a codec’s ring buffer, a parser’s arena, session data not yet flushed. Upgrade requires v1 to hand its state to v2.
capOS’s contribution here is that the state wire format is already capnp — the same format the service uses for IPC. v1 serializes its state as a capnp message; v2 consumes it. There is no separate serialization layer to build and no opportunity for it to drift from the IPC format.
The contract extends the service’s capnp interface:
interface Upgradable {
# Called on v1 by the supervisor. Returns a snapshot of service
# state and stops accepting new calls. Calls already in flight
# complete before the snapshot returns.
quiesce @0 () -> (state :Data);
# Called on v2 after spawn. Loads state from the snapshot. After
# this returns, v2 is ready to serve calls.
resume @1 (state :Data) -> ();
}
The state schema is service-defined. Schema evolution follows capnp’s standard rules: adding fields is backward-compatible, renaming requires care, removing requires a major version bump.
Kernel Primitive: CapRetarget
The kernel exposes the retarget as a capability method, not a syscall:
interface ProcessControl {
# Atomically redirect every CapId currently served by `old` to
# be served by `new`. Requires: `new` implements a schema
# superset of `old` (schema-id compatibility), `new` is Ready,
# `old` is Quiesced (graceful) or the caller has permission to
# force.
retargetCaps @0 (old :ProcessHandle, new :ProcessHandle,
mode :RetargetMode) -> ();
}
enum RetargetMode {
graceful @0; # old must be Quiesced; in-flight calls drain on old
force @1; # caps redirect immediately; in-flight calls fail
}
Only a process holding a ProcessControl cap to both processes can
perform this — typically the supervisor that spawned them. The kernel
never initiates upgrades.
Atomicity is per-CapId. From a client’s perspective, the retarget is a
single point in time: a CALL SQE submitted before retarget goes to v1;
a CALL SQE submitted after goes to v2. A CALL already dispatched to v1
either completes there (graceful) or returns a DISCONNECTED CQE
(force).
Supervisor-Level Upgrade Protocol
The primitives above compose into a protocol the supervisor runs:
1. spawn v2 from the new binary in the manifest
2. Case 1 & 2: v2.resume(EMPTY_STATE)
Case 3: state = v1.quiesce()
v2.resume(state)
3. kernel.retargetCaps(v1, v2, graceful)
4. wait for v1 to drain (graceful mode)
5. v1.exit()
If any step fails, the supervisor rolls back: kill v2, resume v1 (if quiesced), log the failure. Because the retarget hasn’t happened yet, clients never observe the aborted attempt.
In-Flight Calls
The subtle case is a client that has already posted a CALL SQE to v1 when the retarget happens. Two options:
- Graceful mode. v1 finishes the call, kernel routes the CQE back to the client on v1’s ring. v1 exits only after its ring is empty. This preserves call semantics; v1 and v2 coexist briefly.
- Force mode. The in-flight CALL returns
DISCONNECTED. Client retries against v2. Appropriate when v1 is wedged andquiescewon’t return.
In graceful mode the client cannot distinguish “call landed on v1” from “call landed on v2” — which is the point. Capability identity survives the upgrade; process identity does not.
Relationship to Fault Containment
Live upgrade and fault containment (driver panics → supervisor respawns) share machinery. The difference is one step of the protocol:
- Fault containment: v1 has crashed; kernel has already marked it
dead and epoch-bumped its caps. Supervisor spawns v2, issues a
graceful retarget (no quiesce — v1 is gone; in-flight CALLs already
delivered
DISCONNECTED). Clients reconnect to v2. - Live upgrade: v1 is healthy; supervisor initiates
quiesce→ state transfer → retarget, and no CQE ever reportsDISCONNECTEDto any caller.
The epoch-based revocation work from Stage 6 is the foundation for both. CapRetarget is one additional primitive layered on top.
Security and Trust
Live upgrade does not expand the trust model. The supervisor already holds the authority to kill, restart, and reassign caps for services it spawned — upgrade is a refinement of that authority, not a new principal. Requirements:
- Only a holder of
ProcessControlcaps to botholdandnewcan callretargetCaps. By construction this is the supervisor that spawned them. - The new binary must be legitimately obtained — in practice, loaded from the same content-addressed store as everything else (ties to Content-Addressed Boot).
- Schema compatibility (
newis a superset ofold) is checked by the kernel before retarget. This prevents an upgrade from silently narrowing the interface clients depend on.
Non-Goals
- Code hot-patching. No binary-level function replacement. Upgrade is at the process boundary, not the symbol boundary.
- Kernel live replacement. Covered by Reboot-Proof / process persistence (reboot with state preserved, not live replacement). The kernel is a single trust domain; replacing it in place needs a different design.
- Automatic schema migration across incompatible changes. If v2’s state schema is not a capnp-evolution-compatible superset of v1’s, the service author writes the migration. The kernel does not.
- System-wide registry of upgradable services. The supervisor knows what it spawned; there is no ambient discovery.
Phased Implementation
- CapRetarget primitive. Kernel operation +
ProcessControlcap. Useful immediately for stateless services (Case 1) and as the foundation of Fault Containment (respawn with a new process, point its caps to a fresh instance). - Upgradable interface. Schema, contract documentation, and a
Rust helper in
capos-rtthat services derive. - Graceful drain. Quiesce + in-flight call completion + v1 exit synchronization.
- Stateful demo. A service maintaining session state, upgraded live with zero session loss. This is the Live Upgrade observable milestone.
Related Work
- Erlang/OTP
code_change/3is the closest prior art: processes upgrade their behavior module in place, with a callback to migrate state. capOS differs only in that state transport goes through capnp rather than Erlang term format, and that the process boundary is an OS process rather than a BEAM process. - Fuchsia component updates rebind component instances in the routing graph. Similar primitive in a different mechanism.
- nginx
-s reloadis graceful restart for request/response servers. The design here generalizes it by exposing the state migration point explicitly rather than relying on “the session is the request.”
Cross-Links
service-architecture-proposal.md— owns the supervisor surface that drives this proposal’s protocol. The “Supervisors” and “Supervision Tree” sections describe the principal that holdsProcessControlcaps to botholdandnewand runs spawn →quiesce→resume→retargetCaps→ drain →exit. The “Service Taxonomy” entry Upgrade manager is the per-system orchestrator that consumesCapRetargetfor live replacement, distinct from a per-subtree supervisor that uses the same primitive for fault containment (respawn after crash). Schema compatibility fornewvsoldis the same superset check the manifest executor and the boot package contract already require, not a new policy invented here.cloud-deployment-proposal.md— owns the binary delivery story this proposal depends on.newmust be obtained from the same content-addressed boot package / image-update pipeline the cloud deployment plan describes, not from an ad-hoc path. Cloud-managed services (KMS clients, metadata agents, log/metric shippers, the cloud-metadata agent itself) are exactly the Case 2 / Case 3 services where this proposal’s value shows up first: they hold long-lived caps to upstream cloud APIs, and a restart that drops those caps either re-runs IAM/JWT handshakes or, worse, drops audit/log shippers’ in-flight buffers. The bootable disk image / NVMe path defines what “update the binary” means on real hardware; until then the manifest-embeddedBootPackageblobs are the only source ofnew.storage-and-naming-proposal.md— owns the Case 2 holders (Store,SessionMap,Namespace) the idiomatic service factoring relies on, and the future sealed/stored capability path that lets state survive across reboot, not just across live upgrade. Case 3 state-transfer is the strictly weaker contract: same capnp wire format, but the snapshot only has to outlive a singleretargetCapscall, not power loss.system-monitoring-proposal.md—quiescestart,resumecompletion,retargetCapsmode (graceful vs force), drain duration, and rollback (killnew, resumeold) are audit-worthy lifecycle events. The upgrade manager emits them through the audit cap so an operator can correlate a service binary change with downstream behavior. Graceful upgrades by definition emit zeroDISCONNECTEDCQEs; force-mode and fault-containment respawns do, and that distinction is what the audit record has to preserve.security-and-verification-proposal.md—retargetCapsis a natural target for bounded modeling: per-CapId atomicity (no SQE submitted before retarget lands onnew; no SQE submitted after lands onold), graceful-mode in-flight completion (old’s ring drains beforeexit), and schema-superset enforcement at the kernel before retarget. Force-modeDISCONNECTEDdelivery is the same epoch-revocation path the fault-containment story already needs, not a separate kernel surface.../design-risks-register.md— the register currently carries no dedicated R-entry for live upgrade, which is intentional: no implementation exists yet. The closest cross-cutting entries are R6 (CAP_OP_RELEASEis deferred), because graceful drain has to outlive the per-process release path beforev1.exit()is safe; R12 (verification coverage is partial), because the per-CapId retarget atomicity and graceful-drain invariants belong in a bounded model before this lands; and Q7 (revocation strategy), because force-mode retarget shares the epoch path the open revocation decision will pick. Open a dedicated R-entry onceCapRetargetlands in code, since at that point retarget atomicity, graceful-drain shutdown, and the supervisor-only authority constraint become long-horizon design surfaces in their own right.
Proposal: Capability-Oriented GPU/CUDA Integration
Purpose
Define a minimal, capability-safe path to integrate GPU-class accelerators (NVIDIA/CUDA, AMD, Intel, plus future ML-accelerator boards) into capOS without expanding kernel trust.
The kernel keeps direct control of hardware arbitration and trust boundaries. GPU hardware interaction is performed by a dedicated userspace driver service that is invoked through capability calls and that holds device-scoped bootstrap grants for its single managed device.
This proposal is a downstream consumer of:
- LLM and agent proposal – defines the
LanguageModel/Embedder/ImageModelcapability surface that benefits from GPU-backed inference backends. The agent runtime treats a GPU-backed model process as just anotherLanguageModelcapability holder; the GPU service proposed here is one of the substrate choices the model process may use. - Userspace binaries proposal – defines
the native Rust over
capos-rtuserspace runtime, thex86_64-unknown-capostarget, and the libcapos C-substrate path that any vendor SDK adapter (CUDA, ROCm, OpenCL, oneAPI) must link against. The GPU service runs as one such userspace binary, not as a kernel module.
Positioning Against Current Project State
capOS currently provides infrastructure that is directly load-bearing for a future GPU service:
- Process lifecycle, page tables, preemptive scheduling (PIT 100 Hz, round-robin, context switching).
- A global and per-process capability table with
CapObjectdispatch. - Shared-memory capability ring (io_uring-inspired) with syscall-free SQE
writes.
cap_entersyscall for ordinary CALL dispatch and completion waits. - PCI/PCIe enumeration over both legacy I/O ports and ACPI MCFG ECAM, plus reusable memory-BAR subregion validation and kernel MMIO mapping helpers for diagnostics and driver bring-up.
- MSI/MSI-X capability metadata discovery and typed MSI-X table programming,
proven end-to-end through the virtio-net
make run-netsmoke. - I/O APIC routing for masked legacy IRQ programming via MADT.
- Kernel-owned device interrupt source records plus a bounded first-fit device MSI vector pool with lock-free dispatch slots and claimed-route reassignment/release.
- Kernel-owned DMA pool accounting ledger that tracks pool bytes, live page count, page-rounded MMIO mapping bytes, interrupt holds, ring depth, and descriptor submission/completion counts for the current virtio-net path.
- Bootstrap-grant authority hooks for
DeviceMmio,DMAPool,Interrupt, andHardwareAuditLogcapabilities, exercised by themake run-devicemmio-grant,make run-dmapool-grant,make run-interrupt-grant, andmake run-hardware-auditsmokes.
What does not exist yet and gates real GPU work:
- A userspace driver-authority gate. Today the kernel still owns virtio-net, the DMA pool ledger, and the MSI-X dispatch table. The DDF bootstrap-grant smokes prove the schema and grant plumbing for the typed device caps, but there is no userspace driver process that consumes those grants to run a real driver. GPU integration cannot land before that gate moves.
- IOMMU/DMA-remapping integration (VT-d / AMD-Vi). Until a userspace driver is constrained by IOMMU domains, no production GPU stack can be granted bus-master DMA on a multi-tenant host.
- A
LanguageModelcapability surface to consume the GPU service. The LLM proposal defines the schema target; the GPU service is one backend choice.
That means GPU integration must be staged. The early phases are capability schema and mock-service exercises that ride on the existing DDF bootstrap grants; real hardware backends arrive after the userspace-driver authority gate, IOMMU integration, and at least one consuming model surface exist.
Design Principles
- Keep policy in kernel, execution in userspace. The kernel arbitrates device claims, MMIO mapping, MSI-X table programming, and DMA-pool accounting; the driver service implements vendor-specific command submission and queue management.
- Never expose raw PCI/MMIO/IRQ details to untrusted processes. Clients see
only
GpuSession/GpuBuffer/GpuFencecapabilities, neverDeviceMmioorInterrupt. - Make GPU access explicit through narrow capabilities. The interface is the
permission; a client that should not launch kernels is given a session
type that does not expose
launchKernel. - Treat every stateful resource (session, buffer, queue, fence, command pool) as a capability with revocability and bounded lifetime.
- Avoid a Linux-driver-in-kernel compatibility dependency. Vendor SDK code runs in the userspace driver service, linked through libcapos / libcapos-posix shims where vendor headers expect a POSIX-ish surface.
- Charge GPU memory and submission depth through the existing
ResourceLedgermechanism rather than inventing a parallel accounting surface.
Proposed Architecture
capOS kernel (minimal) exposes only resource and mediation capabilities.
gpu-device service (userspace) receives device-specific bootstrap grants
(DeviceMmio, DMAPool, Interrupt, HardwareAuditLog) for exactly one
GPU function and exposes a stable GPU capability surface to clients.
application (e.g. an LLM model server, a numeric workload, a
robot brain inference loop) receives only
GpuSession/GpuBuffer/GpuFence capabilities and never sees the
device-scoped grants.
Kernel responsibilities
- Discover GPUs from PCI/ACPI layers (already implemented for non-GPU functions; GPUs are the same discovery path with different class codes).
- Map/register BAR windows and grant a scoped
DeviceMmiocapability bound to one decoded memory BAR. - Set up MSI/MSI-X routing and expose scoped
Interruptcapability per vector with masked-route lifecycle semantics matching the current virtio-net proof. - Hand out a bounded
DMAPoolcapability whose accounting ledger charges back to the driver process’s resource ledger and that participates in IOMMU-domain constraints once those exist. - Enforce revocation when sessions are closed:
DeviceMmio/Interrupt/DMAPoolgrants tear down through the bootstrap-grant manager. - Record device-manager actions through
HardwareAuditLogsnapshots (already proven for the DDF smokes). - Handle all faulting paths that would otherwise crash the kernel: a buggy driver service must crash the service, not the kernel.
Userspace GPU service responsibilities
- Open and initialize one GPU device from its device-scoped bootstrap grants. One driver process per GPU function is the working assumption; multi-function boards may run one process per function.
- Allocate and track GPU contexts, command queues, and DMA buffers backed
by the granted
DMAPool. - Implement command submission, buffer lifecycle, fence/completion signaling, and timeout enforcement.
- Translate capability calls into vendor SDK operations (CUDA driver API, ROCm, oneAPI, OpenCL, or a vendor-neutral runtime such as a WebGPU/wgpu-style abstraction).
- Expose only narrow, capability-typed handles to callers and refuse any attempt to surface raw MMIO/IRQ/DMA to clients.
Consumer surfaces
- LLM/embedder model servers from
Language Models and Agent Runtime. The
GPU-backed model process holds a
GpuSession, exposes aLanguageModelorEmbeddercapability, and is itself a normal userspace binary built per Userspace Binaries. - Numerical / HPC workloads from HPC Parallel Processing Patterns once that proposal expands to GPU offload.
- Robotics inference loops from capOS As A Robot Brain.
Capability Contract (schema additions)
Add to schema/capos.capnp (interface-level sketch; final wire layout is
fixed in the implementation slice):
GpuDeviceManagerlistDevices() -> (devices: List(GpuDeviceInfo))openDevice(capabilityIndex :UInt32) -> (session :GpuSession)
GpuSessioncreateBuffer(bytes :UInt64, usage :Text) -> (buffer :GpuBuffer)destroyBuffer(buffer :UInt32) -> ()launchKernel(program :Text, grid :UInt32, block :UInt32, bufferList :List(UInt32), fence :GpuFence) -> ()submitMemcpy(dst :UInt32, src :UInt32, bytes :UInt64) -> ()submitFenceWait(fence :UInt32) -> ()
GpuBuffermapReadWrite() -> (addr :UInt64, len :UInt64)unmap() -> ()size() -> (bytes :UInt64)close() -> ()
GpuFencepoll() -> (status :Text)wait(timeoutNanos :UInt64) -> (ok :Bool)close() -> ()
Sessions are the natural restriction point: a model-server session granted
to an LLM process can omit launchKernel entirely and expose only memcpy
plus an opaque runProgram(programCap, ...) if the model image is itself a
separately-vetted capability. The interface is the permission; do not add
parallel rights bitmasks.
Implementation Phases
Phase 0 (prerequisite, landed): kernel capability ring and DDF grants
The Cap’n Proto schema, capability ring, cap_enter dispatch, PCI/MSI-X
discovery, and the DeviceMmio/DMAPool/Interrupt/HardwareAuditLog
bootstrap-grant smokes already exist. No new kernel surface is required for
this phase; the schema additions for Gpu* are pure userspace work once a
driver service is permitted.
Phase 1: Userspace driver-authority gate (cross-track prerequisite)
GPU work cannot land before the userspace driver-authority gate. Required pieces, tracked by the device-manager refactor and DMA-isolation design:
- Move virtio-net or another known-good driver out of the kernel and into a userspace driver process consuming the DDF bootstrap grants end-to-end.
- Add an IOMMU integration path (VT-d / AMD-Vi) so that bus-master DMA granted to a driver process is constrained to its registered DMA pages.
- Add a
device-manageruserspace service that ownsManagerGrantSource-class capabilities and is the only process that handsDeviceMmio/DMAPool/Interrupt/HardwareAuditLoggrants to driver services.
This phase is owned by the device-manager and DMA-isolation tracks; the GPU proposal consumes it.
Phase 2: Mock GPU service
- Add the
Gpu*schema inschema/capos.capnp. - Implement a
gpu-mockuserspace service with the fullGpu*interface, no real driver, and synthetic fences and buffers backed by ordinary anonymous memory. - Prove end-to-end:
- device-manager spawns the mock driver and grants it a fake-device bootstrap grant set.
- a client process opens a session, allocates and maps a buffer, submits a synthetic job, and waits on a fence.
- Add a focused QEMU smoke (
make run-gpu-mock) that asserts the round-trip and demonstrates revocation on session close.
Phase 3: Real backend integration on one vendor
- Pick one concrete GPU backend available in CI environment (likely NVIDIA
on a workstation host with
-device vfio-pcipassthrough into QEMU, or a virtio-gpu / venus virtualized path as a first stand-in). - Vendor SDK code lives in the userspace driver process. Where the SDK expects a POSIX-ish surface, route it through libcapos-posix rather than expanding the kernel.
- Add queue lifecycle, fence lifecycle, DMA registration/validation, command execution path, interrupt completion plumbing back to clients through fences.
- Keep backend replacement possible via a trait-like abstraction inside the driver process so a second vendor backend (AMD ROCm, Intel oneAPI) can be added later without rewriting the service.
Phase 4: Security and reliability hardening
- Per-session limits for mapped pages, in-flight submissions, and queue
depth, charged through
ResourceLedger. - Bounded wait timeouts and explicit fence cancellation semantics so a
hung GPU does not pin a client’s
cap_enter. - Revocation propagation:
GpuSessionclose => all childGpuBuffer/GpuFencecaps revoked.- driver crash / device reset => all active caps fail closed with a typed exception.
- Audit hooks for
launchKernel/submitMemcpyrecorded throughHardwareAuditLog-style snapshots scoped to the GPU service. - Coordination with the
live-upgrade proposal so the GPU driver
service can be replaced without dropping client
GpuSessioncaps.
Phase 5: Multi-tenant and multi-device
- Multiple driver processes (one per GPU function) under a single device-manager.
- Cross-device buffer sharing only through explicit capability transfer; no implicit peer mappings.
- Workload isolation: distinct tenants on a single GPU receive distinct sessions with their own queue, memory budget, and audit stream.
Security Model
The kernel does not grant any user process direct MMIO, MSI, or bus-master DMA access. All such authority is mediated through the device-manager.
Application processes only receive:
GpuSession/GpuBuffer/GpuFencecapabilities with the methods the session policy chose to expose.
The GPU driver service process receives:
DeviceMmiobound to the function’s decoded BARs.Interruptcapabilities for the function’s claimed MSI vectors.DMAPoolbounded to the function’s IOMMU domain.HardwareAuditLogfor snapshotting device-manager actions.
This ensures:
- No userland process can program BAR registers.
- No userland process can claim untrusted memory for DMA.
- No userland process can observe or reset another session’s state.
- A buggy or compromised driver crashes the driver process, not the kernel; the device-manager observes the crash, fails outstanding capabilities closed, and re-spawns the driver on the next session request.
Dependencies and Alignment
This proposal depends on:
- Device-manager refactor proposal for the userspace device-manager that owns the bootstrap-grant sources.
- DMA-isolation design and IOMMU integration so DMA grants are enforceable in a multi-tenant context.
- Userspace-binaries proposal for the
driver-process runtime, libcapos / libcapos-posix surface for vendor SDK
consumption, and the
x86_64-unknown-capostarget. - LLM and agent proposal for the primary
consumer surface (
LanguageModel,Embedder) and the agent runtime that exercises GPU-backed inference end-to-end. - Resource-accounting proposal for per-session memory and submission budgets.
- Live-upgrade proposal for driver-service
replacement without dropping
GpuSessioncapabilities.
It complements:
- Service-architecture and authority-broker proposals.
- Storage/service manifest execution flow for shipping GPU service binaries and their bootstrap grants.
- In-process threading work for future queue completion callbacks and worker pools inside the driver service.
Minimal acceptance criteria
make run-gpu-mockboots and prints GPU service lifecycle messages.- The device-manager spawns the GPU service and grants only device-scoped bootstrap grants for a single mock function.
- A sample userspace client (Rust over capos-rt; C smoke later through libcapos) can create a session, allocate and map a GPU buffer, submit a synthetic job, and wait on a fence with a typed completion result.
- Attempts to submit unsupported or malformed operations return explicit
capnp
CapExceptionresults, not driver crashes. - Removing the session capability invalidates descendant buffer and fence caps without kernel restart.
- A subsequent slice points an LLM model server at the GPU service and
proves a
LanguageModel.generate(...)round-trip backed by the GPU session, satisfying the LLM proposal’s GPU-backend integration point.
Risks
- Real NVIDIA closed stack integration may require vendor-specific adaptation that is hostile to a capability shim; the AMD ROCm or vendor-neutral path (Vulkan compute, WebGPU/wgpu) may land first.
- Buffer mapping semantics become complex with paging, fragmentation, and IOMMU domains. Pinned physical-memory-only buffers are the conservative starting point.
- Interrupt-heavy completion paths require the scheduler evolution work (per-CPU run queues, fairness) before client-visible completion guarantees scale beyond a single workload.
- Vendor SDKs assume a POSIX-ish process model; the libcapos-posix surface has to grow enough to host them without leaking ambient authority.
- A GPU driver process is privileged from the application’s point of view. Compromise of a single driver process must remain bounded to one GPU function and one tenant set; the device-manager and IOMMU are the load-bearing controls there.
Open Questions
- Is CUDA mandatory from first integration, or is the initial surface command-focused (opaque “program” bytes interpreted by the driver) with CUDA runtime-specific support added later?
- Should memory registration support pinned physical memory only at first,
or attempt to expose unified-virtual-memory semantics through the
client’s
VirtualMemorycapability? - Which isolation level is needed for multi-tenant versus single-tenant in the first real-backend phase? Single-tenant per GPU function is the conservative default; MIG / SR-IOV-style partitioning is later work.
- Does the GPU service expose model artifacts (weights, programs) as separate capability types so a model file can be granted to clients without the full session, or are programs always inline arguments?
Proposal: capOS As A Robot Brain
How capOS should grow into a capability-oriented robot brain for manufacturing robots, mobile robots, RC cars, drones, and autonomous-vehicle research without collapsing safety, realtime, perception, planning, and operator control into one trusted process.
Purpose
capOS has the right architectural ingredients for robotics: isolated processes, explicit capabilities, typed IPC, revocation, memory objects, service composition, audit direction, and future scheduling contexts. Robotics is a useful forcing function because it combines physical authority with mixed-criticality timing:
- a camera pipeline can drop frames;
- a local planner can miss a cycle and recover;
- a wheel command must expire safely;
- a robot arm must obey limits;
- an e-stop must not depend on a model, network, shell, or log service.
The proposal is not “run every control loop in the kernel.” It is a staged robotics architecture where capOS owns authority routing, service isolation, telemetry, update, planning, and eventually admitted realtime islands, while the tightest safety loops remain on certified controllers or MCUs until capOS has evidence to replace them.
Goals
- Define a capability-native robot service graph.
- Separate safety, realtime control, perception, planning, operator UI, simulation, manufacturing integration, and agents.
- Make actuator authority explicit, revocable, logged, and bounded by mode, safety state, command freshness, and limits.
- Support compatibility bridges for ROS 2, micro-ROS, MAVLink, OPC UA, and simulation tooling without turning them into ambient authority tunnels.
- Provide a path from simulation to small physical robots before industrial or vehicle safety claims.
- Reuse
MemoryObjectrings, notification/futex paths, and future scheduling contexts for sensor streams and control loops.
Non-Goals
- Replacing certified safety PLCs, flight controllers, servo drives, or vehicle safety controllers in the near term.
- Claiming IEC 61508, ISO 13849, ISO 10218, or ISO 26262 compliance.
- Putting model inference or natural-language agents in direct control of actuators.
- Making ROS 2 an ambient compatibility layer with implicit access to every capOS service.
- Copying large sensor frames through Cap’n Proto payloads in the data path.
Architecture
flowchart LR
Operator[Operator UI / shell / teleop] --> Mission[Mission and behavior]
Agent[Agent runner] --> Mission
Mission --> Planner[Planner]
Planner --> Controller[Realtime controller island]
Controller --> Actuator[Actuator gateway]
Actuator --> Hardware[MCU / PLC / drive / autopilot]
SensorHW[Camera / lidar / IMU / encoders] --> SensorSvc[Sensor services]
SensorSvc --> Perception[Perception]
Perception --> World[World model]
World --> Planner
Safety[Safety monitor] --> Mission
Safety --> Controller
Safety --> Actuator
Bridges[ROS 2 / MAVLink / OPC UA bridges] --> Mission
Bridges --> SensorSvc
Bridges --> Actuator
Audit[Audit and telemetry] --- Mission
Audit --- Controller
Audit --- Actuator
Principal split:
Sensor servicesown device-facing capture authority and publish typed streams or snapshots.Perceptionconsumes sensor streams and emits world-model updates.Mission and behaviorchooses tasks, modes, and goals.Plannercomputes paths, trajectories, or setpoints within policy.Realtime controller islandturns admitted inputs into cyclic commands.Actuator gatewayis the only holder of hardware command authority.Safety monitorobserves independent safety state and can force stop, neutral, disarm, or mode degradation.Agent runnermay propose or explain actions but does not hold actuator caps.- Compatibility bridges receive narrow imported/exported caps.
Core Rule
No process gets both broad interpretation authority and raw physical authority.
Examples:
- A language model may emit a structured proposal; it does not receive
ActuatorCommand. - A ROS bridge may publish odometry and accept a velocity command cap; it does not receive the whole capOS service graph.
- A planner may receive a goal and produce a trajectory; it does not directly program motor registers.
- An actuator gateway may command hardware; it does not fetch network content or run operator scripts.
Robot Capabilities
The first schema should stay small and control-plane oriented. Bulk sensor data
uses MemoryObject rings.
interface RobotDescription {
describe @0 () -> (description :RobotDescriptionSnapshot);
readFrameTree @1 () -> (frames :FrameTreeSnapshot);
}
interface SensorStream {
describe @0 () -> (info :SensorInfo);
openRing @1 (config :StreamConfig) -> (ring :MemoryObject);
readStatus @2 () -> (status :StreamStatus);
}
interface ActuatorCommand {
describe @0 () -> (info :ActuatorInfo);
submit @1 (frame :CommandFrame) -> (accepted :Bool);
neutral @2 (reason :Text) -> ();
}
interface SafetyState {
read @0 () -> (state :SafetySnapshot);
subscribe @1 () -> (events :SensorStream);
}
interface ControlLoop {
describe @0 () -> (info :LoopInfo);
start @1 () -> ();
stop @2 (reason :Text) -> ();
readTelemetry @3 () -> (telemetry :LoopTelemetry);
}
CommandFrame must carry:
- sequence number;
- monotonic timestamp;
- deadline;
- command mode;
- coordinate frame;
- limit profile;
- typed payload;
- source identity;
- optional safety-envelope revision.
Command freshness is mandatory. If the frame is stale, the actuator gateway rejects it or transitions to neutral/safe state according to policy.
Data Plane
Cap’n Proto is the control plane. Sensor and actuator streams need fixed-layout shared rings:
sequence
capture_time_ns
deadline_ns
frame_id
format
offset
length
flags
source_epoch
The ring can carry camera frames, lidar scans, IMU batches, encoder samples,
audio-like streams, or command telemetry. Payload bytes live in MemoryObject
backing storage. Producers and consumers coordinate through notification or
futex-like wakeups. Slow consumers drop or skip according to policy; they do
not backpressure a guaranteed control island.
Realtime Islands
The robot-control equivalent of the media graph’s guaranteed realtime island is an admitted control loop:
flowchart LR
Sense[read sensors] --> Snapshot[input snapshot]
Snapshot --> Update[controller update]
Update --> Clamp[limit and safety clamp]
Clamp --> Write[write actuator command]
Write --> Telemetry[non-RT telemetry export]
Admission requires:
- fixed period and deadline;
- scheduling context with budget;
- preallocated input, output, and telemetry buffers;
- no allocation in the cycle;
- no blocking endpoint calls in the cycle;
- no credential checks, logging, service discovery, or model inference;
- bounded data-age policy;
- command-limit and clamp policy;
- stale-command watchdog;
- overrun behavior.
Failure behavior is part of the contract. An overrun, stale input, revoked cap, or failed write should produce a deterministic result: hold, neutral, stop, drop, degrade mode, or fault the island. It should not build an unbounded queue of late commands.
Compatibility Bridges
ROS 2 Bridge
The ROS 2 bridge should map selected topics, services, and actions to capOS capabilities. It must be configured from a manifest or broker policy:
- which ROS topics can be imported;
- which capOS sensor streams can be exported;
- which commands can reach an actuator gateway;
- freshness and rate limits;
- whether messages are best-effort, reliable, latched, or deadline-bound;
- how frames and transforms are mapped.
The bridge is not a general “ROS graph has all caps” adapter.
micro-ROS / MCU Bridge
For small robots, the MCU bridge is the first practical hardware path:
- MCU closes motor PID, bumper debounce, watchdog, and current limits;
- capOS sends bounded velocity/setpoint frames;
- MCU publishes encoder, IMU, battery, bumper, and fault streams;
- stale capOS commands force neutral behavior.
MAVLink / Autopilot Bridge
For drones and some rovers:
- autopilot owns arming, stabilization, failsafe, and flight termination;
- capOS consumes telemetry and sends high-level setpoints or missions;
- bridge enforces geofence, mode, rate, and authority limits;
- direct actuator override is absent or privileged behind stronger policy.
OPC UA / Manufacturing Bridge
For industrial cells:
- OPC UA gateway imports cell, robot, fixture, and job state;
- capOS exposes typed job/status/alarm caps;
- robot program selection and start/stop are separate authorities;
- safety state is read independently and cannot be overridden by job logic.
Product-Level Targets
Simulation Robot
The first milestone should be visible without hardware: boot capOS, launch a simulated differential-drive robot, publish fake lidar/odometry, run a behavior service, send bounded drive commands, and log telemetry. This proves the capability graph and stale-command behavior.
Vacuum / Indoor Mobile Robot
Next target: capOS on an SBC with an MCU base controller.
- capOS runs mapping, local planning, cleaning behavior, docking, UI, and logs.
- MCU runs wheel control, bumper/cliff protection, and motor watchdog.
BaseDriveaccepts velocity commands with deadlines.- Loss of capOS or command authority stops motion.
RC Car / Rover
RC-car class demo:
- camera/IMU/GPS sensor services;
- teleop and autonomous mode caps;
- steering/throttle gateway with watchdog;
- geofence and speed envelope;
- logs for every actuator-affecting command.
Manufacturing Cell Supervisor
Industrial demo:
- OPC UA or mock PLC gateway;
- robot program selection as a typed capability;
- cell-state and alarm streams;
- operator approval for mutating actions;
- no attempt to replace certified safety functions.
Autonomous Vehicle Research Host
Autoware-like demo:
- perception, localization, planning, control, and vehicle-interface services;
- simulator or closed-course interface;
- independent safety gateway;
- command envelopes and audit.
This remains a research host, not a road-certified system.
Security Invariants
- Actuator gateways are narrow and mode-limited.
- Safety monitor authority is independent from planner and agent authority.
- Model processes never receive actuator, safety, or raw device caps.
- Operator UI receives consent and status caps, not raw hardware caps.
- Bridges do not receive ambient service discovery authority.
- Every actuator-affecting command is auditable by source, mode, limits, safety-state revision, timestamp, and result.
- Revoking command authority causes stale handles and future commands to fail closed.
- Device-facing services obey the
DeviceMmio,DMAPool, andInterruptauthority model before userspace drivers touch physical hardware.
Scheduling Dependencies
This proposal depends on future scheduling work:
- per-thread rings for full-SMP ownership;
- notification objects for low-overhead wakeups;
- scheduling contexts with period/budget/priority;
- CPU affinity and isolation for admitted loops;
- TLB shootdown and SMP-safe address-space migration;
- timing telemetry and overrun events;
- eventually WCET evidence for hard-realtime claims.
Until those exist, docs and demos must say “bounded soft realtime” or “supervised external controller”, not “hard realtime.”
Implementation Sequence
- Add simulation-only robot services and typed fake sensor/actuator caps.
- Add
RobotDescription,SensorStream,ActuatorCommand,SafetyState, andControlLoopdraft schemas. - Add a QEMU/host smoke that proves stale drive commands fail closed.
- Add a differential-drive MCU bridge design and host-side simulator.
- Add ROS 2 bridge proposal detail for selected topics/actions and transforms.
- Add control-loop telemetry counters: period, execution time, overrun, data age, command age, clamp, neutral, and safety fault.
- Bind a local controller to scheduling contexts once the scheduler supports budgeted realtime islands.
- Add manufacturing gateway design over OPC UA or a mock PLC protocol.
- Add hardware-in-loop criteria before any real actuator demo is treated as a milestone.
Open Questions
- Should the first visible milestone be simulation-only or a small physical differential-drive base?
- Should robot schemas live in
schema/capos.capnpor a separate robotics schema compiled by the same build pipeline? - Which transform-tree representation fits capOS best: immutable snapshots, streaming deltas, or both?
- How should command envelopes compose when operator, planner, safety monitor, and actuator gateway all impose limits?
- What is the minimum useful ROS 2 bridge: topics only, or topics plus actions for Nav2-style navigation?
- Does
SensorStreamgeneralize the media-ring design, or should robotics get a distinct stream ABI?
References
- Robotics realtime control research
- Multimedia pipeline latency research
- Out-of-kernel scheduling research
- DMA Isolation Design
- Language Models and Agent Runtime
- Realtime Voice Agent Shell
- GPU Capability
- Networking
Proposal: Formal MAC/MIC Model and Proof Track
How capOS could move from pragmatic label checks to a formal mandatory access control and mandatory integrity control story suitable for a GOST-style claim.
Problem
Adding a label field to capabilities is not enough to claim formal
mandatory access control. ГОСТ Р 59453.1-2021 frames access control through a
formal model of an abstract automaton: the model describes states, subjects,
objects, containers, rights, accesses, information flows, safety conditions,
and proofs that unsafe accesses or flows cannot arise.
capOS should therefore separate two levels:
- Pragmatic label policy. Userspace brokers and wrapper capabilities enforce labels at trusted grant paths and selected method calls. The user/session side of this level is tracked in User Identity and Policy; this proposal does not redefine the broker, session, or local-account surface, only the formal model that would sit underneath it.
- Formal MAC/MIC. A documented abstract state machine, safety predicates, transition rules, proof obligations, and an implementation mapping. Only this second level can support a GOST-style claim. The verification tooling budget (TLA+/Alloy/Kani/Loom/Prusti/Creusot tracks) is owned by Security and Verification; this proposal feeds new obligations into that plan, it does not duplicate the tier definitions.
This proposal defines the path to the second level. It is not a claim that capOS currently satisfies it. The Design Risks and Open Questions entry Q13 – Formal properties to prove treats the current bounded-proof set (cap-table non-forgery, frame-bitmap invariants, transfer rollback, ring producer-consumer invariants) as the baseline that this proposal extends toward an abstract automaton – it is not a step toward seL4-style full functional refinement.
Scope
The first formal target should be narrow:
Confidentiality:
No transition creates an unauthorized information flow from an object at a
higher or incomparable confidentiality label to an object at a lower label,
except through an explicit trusted declassifier transition.
Integrity:
No low-integrity or incomparable subject can control a higher-integrity
subject, and no low-integrity subject can write or transfer influence into a
higher-integrity object, except through an explicit trusted upgrader or
sanitizer transition.
The proof should cover capability authority creation and transfer before it covers every device, filesystem, or POSIX compatibility corner. For capOS, capability transfer is the dangerous boundary.
Terminology
The Russian GOST terms to keep straight:
мандатное управление доступом: mandatory access control for confidentiality.мандатный контроль целостности: mandatory integrity control.целостность: integrity.уровень целостности: integrity level.уровень конфиденциальности: confidentiality level.субъект доступа: access subject.объект доступа: access object.
The standards separate confidentiality MAC from integrity control. capOS should not merge them into one vague label field.
Abstract State
The formal model should be intentionally smaller than the implementation. It models only the security-relevant state.
| Symbol | Meaning |
|---|---|
U | set of user accounts / principals |
S | set of subjects: processes, sessions, services |
O | set of objects: files, namespaces, endpoints, process handles, secrets |
C | set of containers: namespaces, directories, stores, service subtrees |
E | entities = O union C |
K | kernel object identities |
Cap | capability handles / hold edges |
Hold | relation S -> E with metadata |
Own | subject-control or ownership relation |
Ctrl | subject-control relation |
Flow | observed information-flow relation |
Rights | abstract rights: read, write, execute, own, control, transfer |
Access | realized accesses: read, write, call, return, spawn, supervise |
Hold is central. In capOS, authority is represented by capability table
entries and transfer records, not by global paths. A formal model that does
not model capability hold edges will miss the main authority channel.
Suggested hold-edge metadata:
HoldEdge {
subject
entity
interface_id
badge
transfer_mode
origin
confidentiality_label
integrity_label
}
Label Lattices
Use deployment-defined partial orders, not hardcoded government categories.
Example confidentiality lattice:
public < internal < confidential < secret
compartments = {project-a, project-b, ops, crypto}
dominates(a, b) means:
level(a) >= level(b)
and compartments(a) includes compartments(b)
Integrity should be separate:
untrusted < user < service < trusted
domains = {boot, storage, network, auth}
The model must specify how labels compose across containers:
- contained entity confidentiality cannot exceed what the container policy permits unless the container explicitly supports mixed labels;
- contained entity integrity cannot exceed the container’s integrity policy;
- a subject-associated object such as a process ring, endpoint queue, or process handle needs labels derived from the subject it controls or exposes.
Capability Method Flow Classes
capOS cannot rely on syscall names such as read and write. Each interface
method needs a flow class.
Initial categories:
ReadLike data flows object -> subject
WriteLike data flows subject -> object
Bidirectional data flows both ways
ControlLike subject controls another subject/object lifecycle
TransferLike authority or future data path is transferred
ObserveLike metadata/log/status observation
Declassify trusted downgrade of confidentiality
Sanitize trusted upgrade of integrity after validation
NoFlow lifecycle release or local bookkeeping only
Examples:
File.read ReadLike
File.write WriteLike
Namespace.bind WriteLike + ControlLike
LogReader.read ReadLike
ManifestUpdater.apply WriteLike + ControlLike
ProcessSpawner.spawn ControlLike + TransferLike
ProcessHandle.wait ObserveLike
ServiceSupervisor.restart ControlLike
Endpoint.call depends on endpoint declaration
Endpoint.return depends on endpoint declaration
CAP_OP_RELEASE NoFlow
CAP_OP_CALL transfers TransferLike
CAP_OP_RETURN transfers TransferLike
The flow table is part of the trusted model. Adding a new capability method without classifying its flow should fail review.
Transitions
The abstract automaton should include at least these transitions:
create_session(principal, profile)
spawn(parent, child, grants)
copy_cap(sender, receiver, cap)
move_cap(sender, receiver, cap)
insert_result_cap(sender, receiver, cap)
call(subject, endpoint, payload)
return(server, client, result, result_caps)
read(subject, object)
write(subject, object)
bind(subject, namespace, name, object)
supervise(controller, target, operation)
release(subject, cap)
revoke(authority, object)
declassify(trusted_subject, source, target)
sanitize(trusted_subject, source, target)
relabel(trusted_subject, object, new_label)
Each transition needs preconditions and effects. Example:
copy_cap(sender, receiver, cap):
pre:
Hold(sender, cap.entity)
cap.transfer_mode allows copy
confidentiality_flow_allowed(cap.entity, receiver)
integrity_flow_allowed(sender, cap.entity, receiver)
receiver quota has free cap slot
effect:
Hold(receiver, cap.entity) is added
Flow(cap.entity, receiver, transfer) is recorded when relevant
Move is not a shortcut. It has different authority effects but can still create an information/control flow into the receiver.
Safety Predicates
Confidentiality:
read_allowed(s, e):
clearance(s) dominates classification(e)
write_allowed(s, e):
classification(e) dominates current_confidentiality(s)
flow_allowed(src, dst):
classification(dst) dominates classification(src)
No write down follows from classification(dst) dominates classification(src).
Integrity:
integrity_write_allowed(s, e):
integrity(s) >= integrity(e)
control_allowed(controller, target):
integrity(controller) >= integrity(target)
integrity_flow_allowed(src, dst):
integrity(src) >= integrity(dst)
The exact inequality direction must be validated against the chosen integrity semantics. The intent is that low-integrity subjects cannot modify or control high-integrity subjects or objects.
Subject control:
supervise_allowed(controller, target):
confidentiality/control labels are compatible
and integrity(controller) >= integrity(target)
and Hold(controller, ServiceSupervisor(target)) exists
Authority graph:
all live authority is represented by Hold
every Hold edge has a live cap table slot or trusted kernel root
no transition creates Hold without passing transfer/spawn/broker preconditions
Proof Shape
The proof is an invariant proof over the abstract automaton:
Base:
initial_state satisfies Safety
Step:
for every transition T:
if Safety(state) and Precondition(T, state),
then Safety(apply(T, state))
The transition proof must explicitly cover:
spawngrants,- copy transfer,
- move transfer,
- result-cap insertion,
- endpoint call and return,
- namespace bind,
- supervisor operations,
- declassification,
- sanitization,
- relabel,
- revocation and release preserving consistency.
The proof must also state what it does not cover:
- physical side channels,
- timing channels not modeled by
Flow, - bugs below the abstraction boundary,
- device DMA until
DMAPool/IOMMU boundaries are modeled, - persistence/replay until persistent object identity is modeled.
Tooling Plan
Start with lightweight formal tools, then deepen only if the model stabilizes.
TLA+
Best first tool for capOS because capability transfer, spawn, endpoint delivery, and revocation are state transitions. Use TLA+ to model:
- sets of subjects, objects, labels, and hold edges,
- bounded transfer/spawn/call transitions,
- invariants for confidentiality, integrity, and hold-edge consistency.
TLC can find counterexamples early. Apalache is worth evaluating later for symbolic checking if TLC state explosion becomes painful.
Alloy
Useful for relational counterexample search:
- label lattice dominance,
- container hierarchy invariants,
- hold-edge graph consistency,
- “can a path of transfers create forbidden flow?” queries.
Alloy complements TLA+; it does not replace transition modeling.
Coq, Isabelle, or Lean
Only after the model stops moving. These tools are appropriate for a durable machine-checked proof artifact. They are expensive if the policy surface is still changing.
Kani / Prusti / Creusot
Use these for implementation-level Rust obligations after the abstract model exists:
- cap table generation/index invariants,
- transfer transaction rollback,
- label dominance helper correctness,
- quota reservation/release balance,
- wrapper cap narrowing properties.
They do not replace the abstract automaton proof.
ITU-T Z-series specification languages
ITU-T publishes a family of formal specification languages for protocols and behavioural systems. They are complements to TLA+/Alloy, not replacements; each targets a different part of the specification-to-code pipeline.
- Z.100 SDL — Specification and Description Language. State
machines with structured data, signals, and composition. SDL models
communicating extended finite-state machines, which is a natural
fit for the capability ring protocol, endpoint call/return, and
supervisor quiesce/resume state. SDL-RT (SDL real-time) adds timers
explicitly, which matters for
cap_enterwait/timeout semantics. - Z.120 MSC — Message Sequence Charts. A UML-sequence-diagram
predecessor with formal semantics (ITU-T Z.120 Annex B). MSC is
useful for documenting what a correct capability-transfer
sequence looks like — CALL issuing hold edge, server RECV, server
RETURN with result caps, caller CQE — in a form that can be
model-checked against the SDL state machine.
tools/ccs-style session dumps already produce sequence-shaped records; converting a subset to MSC form would let invariants be checked as sequence-diagram properties (e.g. “no RETURN without a matching CALL hold edge”). - Z.151 URN — User Requirements Notation (Goal-oriented Requirement Language + Use Case Maps). Worth tracking for later capOS security-requirement traceability — linking proof obligations to threat-model goals — but overkill for the first formal artifact.
Relative to the TLA+/Alloy track:
| Concern | Tool in capOS |
|---|---|
| Global state transitions, invariants | TLA+ (primary) |
| Relational graph queries (hold edges, dominance) | Alloy |
| Per-service protocol state machines | Z.100 SDL (optional) |
| Canonical call/return sequences | Z.120 MSC (optional) |
| Durable machine-checked proof | Coq / Isabelle / Lean (later) |
| Implementation-level Rust obligations | Kani / Prusti / Creusot |
SDL/MSC should be considered for the protocol layer (capability transfer sequences, endpoint handshakes, supervisor lifecycle) where TLA+ specifications tend to become cluttered with message-passing boilerplate. They should not replace the abstract automaton that covers hold-edge safety invariants — that work stays in TLA+/Alloy.
Other ITU-T security frameworks
Relevant security frameworks from the X-series that this proposal cross-references rather than re-derives:
- X.800 / X.805 — Security architecture for Open Systems
Interconnection and Security architecture for systems providing
end-to-end communications. Taxonomy of security services
(authentication, access control, data confidentiality, data
integrity, non-repudiation, availability, privacy) × layers. Used
in
security-and-verification-proposal.mdas a completeness checklist. - X.810 — Overview of security frameworks.
- X.811 — Authentication framework.
- X.812 — Access control framework. Referenced from
user-identity-and-policy-proposal.mdfor ADF/AEF decomposition. - X.813 — Non-repudiation framework. Relevant for signed audit records and signed manifest updates.
- X.814 — Confidentiality framework.
- X.815 — Integrity framework. Directly relevant to the MIC half of this proposal; X.815 terminology on integrity “verification”, “recovery”, and “protection” clarifies which obligations apply at which boundary.
- X.816 — Security audit and alarms framework. The monitoring proposal adopts its audit taxonomy.
Implementation Mapping
The proof track must produce implementation obligations that code review and tests can check.
Required implementation hooks:
- every kernel object that participates in policy has stable
ObjectId; - every labeled object has
MandatoryLabel; - every hold edge or capability entry records enough label metadata for transfer checks;
- every capability method has a flow class;
- every transfer path calls one shared label/flow checker;
- every spawn grant uses the same checker as transfer;
- every endpoint has declared flow policy;
- every declassifier/sanitizer is an explicit capability and audited;
- every relabel operation is explicit and audited;
- every wrapper cap preserves or narrows authority and labels;
- process exit and release remove hold edges without leaving ghost authority.
The current pragmatic userspace broker model is allowed as an earlier stage, but the implementation mapping must identify where it is bypassable. Any path that lets untrusted code transfer labeled authority without the broker must move into the kernel-visible checked path before a formal MAC/MIC claim.
Testing and Review Gates
Before implementing kernel-visible labels:
- write the TLA+ or Alloy model;
- include at least one counterexample-driven test showing a rejected unsafe transfer in the model;
- document every transition that is intentionally out of scope.
Before claiming pragmatic MAC/MIC:
- broker and wrapper caps enforce labels at grant paths;
- audit records every grant, denial, and relabel/declassify operation;
- QEMU demo shows a denied high-to-low transfer and a permitted trusted declassification.
Before claiming GOST-style MAC/MIC:
- abstract automaton is written;
- safety predicates are explicit;
- all modeled transitions preserve safety;
- implementation obligations are mapped to code paths;
- transfer/spawn/result-cap insertion cannot bypass label checks;
- limitations and non-modeled channels are documented.
Integration With Existing Plans
This proposal depends on:
- authority graph and resource accounting (Authority Accounting);
- user/session policy services (User Identity and Policy); the pragmatic broker, session metadata, local-account, and stale-cap enforcement work lives there. The formal model in this file treats the pragmatic level as the implementation surface that any abstract subject / hold-edge transition must be mapped back onto;
- capability transfer and result-cap insertion (Capability Model);
- DMA isolation before user drivers become part of the labeled model (DMA Isolation);
- security verification tooling (Security and Verification); the TLA+/Alloy/Kani/Loom/Prusti/Creusot tier descriptions and obligation budget belong there. New obligations introduced here (label dominance helpers, transfer-time flow checks, declassifier/sanitizer audit) feed into that proposal’s tier tables rather than redefining them in this file;
- the consolidated design-risks-register entry Q13 – Formal properties to prove (Design Risks and Open Questions) tracks this proposal as the route from the current bounded-proof baseline to a documented abstract automaton; R14 – User identity / policy is proposal-shaped records why the pragmatic level still cannot make a GOST-style claim today.
Consumers that carry additional proof obligations onto this track:
- OIDC/OAuth2 federated authentication and token-typed capabilities
(OIDC and OAuth2). That
proposal enumerates a 10-item proof-obligation checklist and a
tool-assignment table (TLA+/Alloy/SDL/MSC/Kani/Prusti) for
OIDC-specific transitions. The obligations are additive to the
ones here: they extend flow classes onto token caps, add session-
creation and broker-outbound MAC/MIC predicates, and model
verify_id_tokenas a trusted total function.
Non-Goals
- No certification claim.
- No claim that current capOS implements GOST-style MAC/MIC.
- No attempt to model all side channels in the first version.
- No kernel policy language interpreter.
- No POSIX
uid/gidauthorization. - No label field without transition rules and proof obligations.
Open Questions
- What is the smallest useful label lattice for the first demo?
- Should labels live on objects, hold edges, or both?
- Should endpoint flow policy be static per endpoint, per method, or per transferred cap?
- How should declassifier and sanitizer capabilities be scoped and audited?
- Which channels must be modeled as memory flows versus time flows?
- Is TLA+ sufficient for the first formal artifact, or should the relational parts start in Alloy?
- Which parts of ГОСТ Р 59453.1-2021 should be treated as direct goals versus inspiration for a capOS-native formal model?
- How should OIDC/OAuth2 federation fit the first formal artifact? The proof-obligation checklist in OIDC and OAuth2 is already sized for the same TLA+/Alloy/SDL/MSC/Kani tool assignment used here, but the first MAC/MIC model may be cleaner if it lands before federated subjects are added. Decide whether OIDC joins the initial TLA+ module or follows as a second artifact that extends the subject-creation transition.
References
- ГОСТ Р 59383-2021, access-control foundations: https://lepton.ru/GOST/Data/752/75200.pdf
- ГОСТ Р 59453.1-2021, formal access-control model: https://meganorm.ru/Data/750/75046.pdf
- ITU-T Rec. X.800 (03/91) — Security architecture for OSI.
- ITU-T Rec. X.805 (10/03) — Security architecture for systems providing end-to-end communications.
- ITU-T Rec. X.810 (11/95) — Security frameworks: Overview.
- ITU-T Rec. X.811 (04/95) — Authentication framework.
- ITU-T Rec. X.812 (11/95) — Access control framework.
- ITU-T Rec. X.813 (10/96) — Non-repudiation framework.
- ITU-T Rec. X.815 (11/95) — Integrity framework.
- ITU-T Rec. X.816 (11/95) — Security audit and alarms framework.
- ITU-T Rec. Z.100 (04/21) — Specification and Description Language overview.
- ITU-T Rec. Z.120 (02/11) — Message Sequence Charts.
- ITU-T Rec. Z.151 (10/18) — User Requirements Notation.
Proposal: Running capOS in the Browser (WebAssembly, Worker-per-Process)
How capOS goes from “boots in QEMU” to “boots in a browser tab,” with each capOS process executing in its own Web Worker and the kernel acting as the scheduler/dispatcher across them.
This proposal is the inverse of the
Browser Capability and Agent Web Sessions
direction: that one is about capOS exposing browsers to users and agents
as capability-scoped services; this one is about running capOS itself inside
a browser tab as a teaching and demo substrate. It is also adjacent to but
distinct from the
WASI Host Adapter: WASI hosts third-party
wasm modules inside a capOS userspace process under explicit per-instance cap
grants, while the browser port is capOS itself rebuilt for
wasm32-unknown-unknown and run inside Workers. Both share the constraint
that authority must be ABI-typed and per-instance, never ambient.
The goal is a teaching and demo target, not a production runtime. It should
preserve the capability model — typed endpoints, ring-based IPC, no ambient
authority — while replacing the hardware substrate (page tables, IDT,
preemptive timer, privilege rings) with browser primitives (Worker
boundaries, SharedArrayBuffer, Atomics.wait/notify).
Depends on: Stage 5 (Scheduling), Stage 6 (IPC) — the capability ring is
the only kernel/user interface we want to port. Anything still sitting behind
the transitional write/exit syscalls must migrate to ring opcodes first.
Complements: userspace-binaries-proposal.md and
../programming-languages.md (language/runtime story),
service-architecture-proposal.md (process lifecycle). A browser port
stresses both: the runtime must build for wasm32-unknown-unknown, and
process spawn becomes “instantiate a Worker” rather than “map an ELF.”
Non-goals:
- Running the existing x86_64 kernel unmodified in the browser. That’s a separate question (QEMU-WASM / v86) and is a simulator, not a port.
- Emulating the MMU, IDT, or PIT in WASM. The whole point is to replace them with primitives the browser already gives us for free.
- Any persistence, networking, or storage beyond what a hosted demo needs.
Current State
capOS is x86_64-only. Arch-specific code lives under kernel/src/arch/x86_64/
and relies on:
| Mechanism | File | Browser equivalent |
|---|---|---|
| Page tables, W^X, user/kernel split | mem/paging.rs, arch/x86_64/smap.rs | Worker + linear-memory isolation (structural) |
| Preemptive timer (PIT @ 100 Hz) | arch/x86_64/pit.rs, idt.rs | setTimeout/MessageChannel + cooperative yield |
| Syscall entry (SYSCALL/SYSRET) | arch/x86_64/syscall.rs | Direct Atomics.notify on ring doorbell |
| Context switch | arch/x86_64/context.rs | None — each process is its own Worker, OS schedules |
| ELF loading | elf.rs, main.rs | WebAssembly.instantiate from module bytes |
| Frame allocator | mem/frame.rs | memory.grow inside each instance |
| Capability ring | capos-config/src/ring.rs, cap/ring.rs | Reused unchanged — shared via SharedArrayBuffer |
| CapTable, CapObject | capos-lib/src/cap_table.rs | Reused unchanged in kernel Worker |
The capability-ring layer is the only stable interface that survives the port
intact. Everything below cap/ring.rs is arch work; everything above is
schema-driven capnp dispatch that doesn’t care about the substrate.
Architecture
flowchart LR
subgraph Tab[Browser Tab / Origin]
direction LR
Main[Main thread<br/>xterm.js, UI, loader]
subgraph KW[Kernel Worker]
Kernel[capOS kernel<br/>CapTable, scheduler,<br/>ring dispatch]
end
subgraph P1[Process Worker #1<br/>init]
RT1[capos-rt] --> App1[init binary]
end
subgraph P2[Process Worker #2<br/>service<br/>spawned by init]
RT2[capos-rt] --> App2[service binary]
end
SAB1[(SharedArrayBuffer<br/>ring #1)]
SAB2[(SharedArrayBuffer<br/>ring #2)]
Main <-->|postMessage| KW
KW <-->|SAB + Atomics| SAB1
KW <-->|SAB + Atomics| SAB2
P1 <-->|SAB + Atomics| SAB1
P2 <-->|SAB + Atomics| SAB2
P1 -.spawn.-> KW
KW -.new Worker.-> P2
end
One Worker per capOS process. Each process is a WASM instance in its own
Worker, with its own linear memory. Cross-process access is structurally
impossible — postMessage and shared ring buffers are the only channels.
Kernel in a dedicated Worker. Not on the main thread: the main thread is
reserved for UI (terminal, loader, error display). The kernel Worker owns
the CapTable, holds the Arc<dyn CapObject> registry, dispatches SQEs,
and maintains one SharedArrayBuffer per process for that process’s
ring. It directly spawns init; all further processes are created via the
ProcessSpawner cap it serves.
Capability ring over SharedArrayBuffer. The existing
CapRingHeader/CapSqe/CapCqe layout in capos-config/src/ring.rs already
uses volatile access helpers for cross-agent visibility. Mapping it onto a
SharedArrayBuffer is a change of backing store, not of protocol. Both sides
see the same bytes; Atomics.load/Atomics.store replace the volatile reads
on the host side; on the Rust/WASM side the existing read_volatile/
write_volatile lower to plain atomic loads/stores under
wasm32-unknown-unknown with the atomics feature enabled.
cap_enter becomes Atomics.wait. The process Worker calls
Atomics.wait on a doorbell word in the SAB after publishing SQEs. The
kernel Worker (or its scheduler tick) calls Atomics.notify after producing
completions. That is exactly the io_uring-inspired “syscall-free submit,
blocking wait on completion” the ring was designed around — the browser
happens to give us the primitive for free.
No preemption inside a process. A Worker runs to completion on its event
loop turn; the kernel can’t interrupt it. This is fine: each process is
single-threaded in its own isolate, and the scheduler only needs to wake the
next process after Atomics.wait, not forcibly remove the running one.
This is closer to a cooperative capnp-rpc vat model than to the current
timer-preempted kernel, and matches what the capability ring already assumes.
Mapping capOS Concepts to WASM/Browser
Process isolation
The Worker boundary replaces the page table. Two capOS processes cannot
observe each other’s linear memory, cannot jump into each other’s code (code
is out-of-band in WASM — not addressable as data), and cannot share globals.
The SharedArrayBuffer containing the ring is the only intentional shared
region, and it is created by the kernel Worker and transferred to the process
Worker at spawn time.
No W^X enforcement is needed within a Worker because WASM has no writable
code region to begin with — WebAssembly.Module is validated and immutable.
The MMU’s job is done by the WASM type system and validator.
Address space / memory
Each Worker’s WASM instance has one linear memory. capos-rt’s fixed heap
initialization uses memory.grow instead of VirtualMemory::map. The
VirtualMemory capability still exists in the schema, but its
implementation in the browser port is a thin wrapper over memory.grow with
bookkeeping for “logical unmap” (zeroing + tracking a free list — WASM
doesn’t return pages to the host).
Protection flags (PROT_READ/PROT_WRITE/PROT_EXEC) become no-ops with a
documented caveat in the proposal: the browser port does not enforce
intra-process protection. Cross-process protection is structural and
stronger than the native build.
Syscalls
The three transitional syscalls (write, exit, cap_enter) collapse to:
write— already slated for removal once init is cap-native. In the browser port, do not implement it at all. Force the port to drive the existing cap-native Console ring path, which forces the rest of the tree to be cap-native too. A forcing function, not a cost.exit—postMessage({type: 'exit', code})to the kernel Worker, which terminates the Worker viaworker.terminate()and reaps the process entry.cap_enter—Atomics.waiton the ring doorbell after publishing SQEs, with awaitAsyncvariant for cooperative mode if we ever want to avoid blocking the Worker’s event loop.
Scheduler
Round-robin is gone; the browser scheduler is the OS scheduler. The kernel Worker’s “scheduler” is reduced to:
- A poll loop that drains each process’s SQ (the existing
cap/ring.rs::process_sqeslogic, called on everynotifyor on asetTimeout(0)tick). - A completion-fanout step that pushes CQEs and
Atomics.notifys the target Worker.
No context switch, no run queue, no per-process kernel stack. The code
deleted here is exactly the code that smp-proposal.md says needs per-CPU
structures — an orthogonal win: the browser port has no SMP problem because
each process is structurally on its own agent.
Process spawning
The kernel Worker spawns exactly one process Worker directly — init —
with a fixed cap bundle: Console, ProcessSpawner, FrameAllocator,
VirtualMemory, BootPackage, and any host-backed caps (Fetch,
etc.) granted to it.
// Kernel Worker bootstrap
const initMod = await WebAssembly.compileStreaming(fetch('/init.wasm'));
const initRing = new SharedArrayBuffer(RING_SIZE);
const initWorker = new Worker('process-worker.js', {type: 'module'});
kernel.registerProcess(initWorker, initRing, buildInitCapBundle());
initWorker.postMessage(
{type: 'boot', mod: initMod, ring: initRing, capSet: initCapSet,
bootPackage: manifestBytes},
[/* transfer */]);
All further processes come from init invoking ProcessSpawner.spawn.
ProcessSpawner is served by the kernel Worker; each invocation:
- Compiles the referenced binary bytes (
WebAssembly.compileover theNamedBlobfromBootPackage). - Creates a
new Workerand aSharedArrayBufferfor its ring. - Builds the child’s
CapTablefrom theProcessSpecthe caller passed, applying move/copy semantics to caps transferred from the caller’s table. - Returns a
ProcessHandlecap.
Init composes service caps in userspace: hold Fetch, attenuate to
per-origin HttpEndpoint, hand each child only the caps its
ProcessSpec names. Same shape as native after Stage 6.
Host-backed capability services
Some capabilities in the browser port are implemented by talking to the
browser rather than to hardware. Fetch and HttpEndpoint — drafted in
Service Architecture —
are the canonical example. On native capOS they run over a userspace
TCP/IP stack on virtio-net/ENA/gVNIC. In the browser port, the service
process is replaced by a thin implementation living in the kernel Worker
(or a dedicated “host bridge” Worker) that dispatches each capnp call
by calling fetch / new WebSocket and returning the response as a
CQE. The attenuation story is unchanged: Fetch can reach any URL,
HttpEndpoint is bound to one origin at mint time, derived from
Fetch by a policy process.
This is not a back door. The capability is granted through the manifest
exactly as on native. Processes without the cap cannot reach the host’s
network, cannot discover it, and cannot forge one. The only difference
from native is the implementation of the service behind the CapObject
trait — same schema, same TYPE_ID, same error model.
The same authority-boundary rule the trusted local
Remote Session UI Security Proposal
enforces between a loopback browser bridge and the upstream capOS gateway
applies inside the browser port: browser JavaScript on the main thread is
untrusted UI, the kernel Worker holds the CapTable, and the JS layer
receives view models / call results, not raw CapIds. Any path that lets
main-thread JS originate a SQE without going through the kernel Worker’s
validated postMessage surface is the same class of bug the remote-session-ui
bridge calls out — a loopback or in-tab listener inheriting operator
authority because it skipped the typed boundary.
The same pattern applies to anything else the browser provides natively. Candidate future interfaces (no schema yet, mentioned so the port is considered when they are designed):
Clipboardovernavigator.clipboardLocalStorage/KvStoreover IndexedDB (naturalStorebackend for the storage proposal in the browser)Display/Canvasover anOffscreenCanvasposted back to the main threadRandomSourceovercrypto.getRandomValues— trivial but needs a cap rather than a syscall
Other drafted network interfaces — TcpSocket, TcpListener,
UdpSocket, NetworkManager from
Networking — do not have a clean
browser mapping. The browser exposes no raw-socket primitives, so these
caps cannot be served in the browser port at all. Applications that need
networking in the browser must go through Fetch/HttpEndpoint, and the
POSIX compatibility adapter’s socket path must detect the absence of
NetworkManager and route connect("http://...") through Fetch instead
(or fail closed for other schemes). CloudMetadata from
Cloud Metadata is simply not
granted in the browser; there is no cloud instance to describe.
Each host-backed cap is opt-in per-process via the manifest; each has a native counterpart that the schema is already the contract for. This is a substantial point in favor of the port: host-provided services slot into the existing capability model without widening it.
CapSet bootstrap
The read-only CapSet page at CAPSET_VADDR is replaced by a structured-clone
payload in the initial postMessage. capos-rt::capset::find still parses
the same CapSetHeader/CapSetEntry layout, just out of a Uint8Array
placed at a known offset in the process’s linear memory by the boot shim.
Binary Portability
Source-portable, not binary-portable. An ELF built for x86_64-unknown-capos
does not run; the same source rebuilt for wasm32-unknown-unknown (with the
atomics target feature) does, provided it stays inside the supported API
surface.
Rust binaries on capos-rt
Port cleanly:
- Any binary that uses only
capos-rt’s public API — typed cap clients (ConsoleClient, futureFileClient, etc.), ring submission/completion,CapSet::find,exit,cap_enter,alloc::*. - Pure computation,
core/alloccontainers, serde/capnp message building.
Do not port:
- Anything that uses
core::arch::x86_64, inlineasm!, orglobal_asm!. - Binaries with a custom
_startor a linker script baking in0x200000. capos-rt owns the entry shape; the wasm entry is set by the host (WebAssembly.instantiate+ an exported init), so the prologue differs. #[thread_local]relying on FS base until the wasm TLS story is decided (per-Worker globals, or the wasm threads proposal’s TLS).- Code that assumes a fixed-size static heap region and reaches it with
raw pointers. The wasm arch uses
memory.grow;alloc::*hides this,unsafe { &mut HEAP[..] }does not. - Anything that still calls the transitional
writesyscall shim — the browser build deliberately omits it.
Binaries mixing target features across the workspace produce silently-
broken atomics. A single rustflags set for the browser build is required.
POSIX binaries (when the adapter lands)
The POSIX compatibility adapter described in
Userspace Binaries Part 4
sits on top of capos-rt. If capos-rt builds for wasm, the adapter builds for
wasm, and well-behaved POSIX code rebuilt for a wasm-targeted
libcapos (clang --target=wasm32-unknown-unknown + our libc) ports too.
Ports cleanly:
- Pure computation, string/number handling, data-structure libraries.
stdioover Console / future File caps.malloc/free, C++new/delete, static constructors.select/poll/epollimplemented over the ring (ring CQEs are exactly the event source these APIs want).posix_spawnoverProcessSpawner— spawning a new process becomes “instantiate a new Worker,” which is the native shape of the browser anyway.- Networking via
Fetch/HttpEndpoint(drafted in Service Architecture) if the manifest grants the cap. The browser port serves these against the host’sfetch/WebSocket — not ambient authority, because only processes granted the cap can invoke it. RawAF_INET/AF_INET6sockets via theTcpSocket/NetworkManagerinterfaces in Networking are not available in the browser (no raw-socket primitive); POSIX networking code wants URLs in practice, and a libc shim can mapgetaddrinfo+connect+writeoverFetch/HttpEndpointfor the HTTP(S) case, failing closed otherwise.
Does not port without new work, possibly ever:
fork. Cannot clone a Worker’s linear memory into a new Worker and resume at theforkcall site — there is no COW, no MMU, no way to duplicate an opaque WASM module’s mid-execution state. This is the same reason Emscripten/WASI don’t supportfork. POSIX programs that fork-then-exec can be rewritten toposix_spawn; programs that fork-for-concurrency (Apache prefork, some Redis paths) cannot.- Signals. No preemption inside a Worker means no asynchronous signal
delivery.
SIGALRM,SIGINT,SIGSEGVall need cooperative polling at best;kill(pid, SIGKILL)maps toworker.terminate()and nothing finer.setjmp/longjmpworks within a function call tree;siglongjmpout of a signal handler does not exist. mmapof files withMAP_SHARED. WASM linear memory is not file-backed and cannot be.MAP_PRIVATE | MAP_ANONYMOUSworks trivially (it’s justmemory.grow+ a free list). File-backed mappings require a userspace emulation that reads on fault and writes back on unmap — workable for small files, a lie for the memory- mapped-database case.- Threads without the wasm threads proposal. pthreads over Workers
sharing a memory is the only implementation strategy, and it requires
the wasm
atomics/bulk-memory/shared-memoryfeature set plus careful runtime support. Single-threaded POSIX code works now; multithreaded POSIX code needs the in-process-threading track from the native roadmap and its wasm counterpart. - Address-arithmetic tricks. Wasm validates loads/stores against the linear-memory bounds. Code that relies on unmapped trap pages (guard pages, end-of-allocation sentinels) or on specific virtual addresses fails.
dlopen. A wasm module is immutable after instantiation. Dynamic loading requires loading a second module and linking via exported tables — possible with the component model, nowhere near drop-indlopen. Static linking is the pragmatic answer.
Rough guide: if a POSIX program compiles cleanly under WASI and uses only WASI-supported syscalls, it will almost certainly port to capOS-on-wasm with the adapter, because the constraints overlap. If it needs features WASI doesn’t support (fork, signals, shared mmap), the capOS browser port will not magically fix that — the limitations come from the substrate, not from the POSIX adapter’s completeness.
Build Path
Three new cargo targets, no workspace restructuring required:
-
capos-libonwasm32-unknown-unknown. Alreadyno_std + alloc, no arch-specific code. Should build as-is; verify undercargo check --target wasm32-unknown-unknown -p capos-lib. -
capos-configonwasm32-unknown-unknown. Same — pure logic, the ring structs and volatile helpers are portable. -
capos-rtonwasm32-unknown-unknownwithatomicsfeature. The standalone userspace runtime currently hard-codes x86_64 syscall instructions. Introduce anarchmodule split:arch/x86_64.rs(existingsyscall.rscontents)arch/wasm.rs(new —Atomics.waitviacore::arch::wasm32::memory_atomic_wait32,exitvia host import)
Gate at the
syscallboundary, not deeper; the ring client above it is arch-agnostic. -
Demos on
wasm32-unknown-unknown. Same arch split applied viacapos-rt. No per-demo changes expected if the split is clean.
The kernel does not build for wasm. Instead, a new crate
capos-kernel-wasm/ (peer to kernel/) reuses capos-lib’s CapTable and
capos-config’s ring structs and implements the dispatch loop against JS
host imports for Worker management. It is, deliberately, not the same kernel
binary. Trying to build kernel/ for wasm would pull in IDT/GDT/paging code
that has no meaning in the browser.
Phased Plan
Phase A: Port the pure crates
- Verify
capos-lib,capos-configbuild clean onwasm32-unknown-unknown. CI job:cargo check --target wasm32-unknown-unknown -p capos-lib -p capos-config. - Add a host-side
ring-tests-jsharness that exercises the same invariants astests/ring_loom.rsbut with a real JS producer and a Rust/wasm consumer, both sharing aSharedArrayBuffer. Proves the volatile access helpers are portable before anything else depends on them.
Phase B: capos-rt arch split
- Introduce
capos-rt/src/arch/{x86_64,wasm}.rsbehind a#[cfg(target_arch)]. - Rewire
syscall/ring/clientto call through the arch module. - Add
make capos-rt-wasm-checktarget. Existingmake capos-rt-checkstays for x86_64.
Phase C: Kernel Worker + init
capos-kernel-wasm/with a Console capability that renders to xterm.js viapostMessageback to the main thread.- Kernel Worker spawns init. Init prints “hello” through Console and exits.
Phase D: ProcessSpawner + Endpoint
ProcessSpawnerserved by the kernel Worker, granted to init.- Init parses its
BootPackageand spawns theendpoint-roundtripandipc-server/ipc-clientdemos viaProcessSpawner.spawn. These stress capability transfer across Workers: does a cap handed from A to B via the ring land correctly in B’s ring, and does B’s subsequent invocation route back to the right holder? - This phase turns the port into a validation surface for the
capability-transfer and badge-propagation invariants in
docs/authority-accounting-transfer-design.md, and a second implementation of the Stage 6 spawn primitive.
Phase E: Integration with demos page
- Hosted page at a project URL; xterm.js terminal; selector for which demo manifest to boot.
- Serve
.wasmartifacts as static assets.
Security Boundary Analysis
The browser port changes what is trusted and what is verified. Summary:
| Boundary | Native (x86_64) | Browser (WASM-Workers) |
|---|---|---|
| Process ↔ process | Page tables + rings | Worker agents + SAB (structural) |
| Process ↔ kernel | Syscall MSRs + SMEP/SMAP | postMessage + validated host imports |
| Code integrity | W^X + NX | WASM validator + immutable Module |
| Capability forgery | Kernel-owned CapTable | Kernel-Worker-owned CapTable |
| Capability transfer | Ring SQE validated in kernel | Ring SQE validated in kernel Worker — same code path |
The capability-forgery story is the same in both: an unforgeable 64-bit
CapId is assigned by the kernel and can only be resolved through the
kernel’s CapTable. A process Worker cannot synthesize a valid CapId
because it never sees the CapTable; it only sees SQEs it submits and CQEs
it receives. This property is what makes the port worth doing — the
capability model is preserved exactly.
What weakens: no SMAP/SMEP equivalent, but also no corresponding attack
surface (the “kernel” Worker has no pointer into process memory; it can only
copy bytes out of the shared ring). No DMA problem. No side-channel parity
with docs/dma-isolation-design.md — Spectre/meltdown in the browser is the
browser’s problem, mitigated by site isolation and COOP/COEP.
Required headers: Cross-Origin-Opener-Policy: same-origin and
Cross-Origin-Embedder-Policy: require-corp — SharedArrayBuffer is gated
on these. A hosted demo page must set them.
What This Port Buys Us
- Shareable demos. A URL that boots capOS in ~1s, with no QEMU, no local install. Valuable for documentation and recruiting.
- A second substrate for the capability model. If the cap-transfer protocol has a bug, reproducing it under Workers (single-threaded, deterministic scheduling) is much easier than under SMP x86_64. A second implementation of the dispatch surface is a correctness asset.
- Forcing function for
writesyscall removal. The browser port cannot support the transitionalwritepath without importing host I/O as a back door, which is exactly the ambient authority we want to avoid. Shipping a browser demo at all requires finishing the migration to the Console capability over the ring. - Teaching surface. Workers give a much clearer visual of “one process, one memory, one cap table” than a bare-metal kernel ever will. The isolation story renders in the DevTools panel.
What It Does Not Buy Us
- Not a validation surface for the x86_64 kernel. Page tables, IDT, context switch, SMP — none of that runs. Bugs in those subsystems will not appear in the browser build.
- Not a performance story. WASM + Workers + SAB is slower than native QEMU-KVM for the parts it does overlap on, and does not exercise the hardware features capOS eventually cares about (IOMMU, NVMe, virtio-net).
- Not a path to “capOS on Cloudflare Workers” or similar. Cloudflare’s runtime is a single isolate per request, no SAB, no threads — a different environment that would need its own proposal.
Open Questions
- Do we ship one
capos-kernel-wasmcrate, or does the kernel Worker run plain JS that imports a thincapos-dispatchwasm? JS-hosted kernel is simpler (no second wasm toolchain for the kernel side) but duplicates cap-dispatch logic. Preferred: Rust/wasm kernel Worker reusingcapos-lib— dispatch code stays single-sourced. - How do we surface kernel panics in the browser? Native capOS halts
the CPU; the browser equivalent is posting an error to the main thread
and tearing down all Workers. Should match the
panic = "abort"contract — no recovery attempted. - Do we implement
VirtualMemoryas a no-op or as a real allocator? No-op is faster to ship; a real allocator overmemory.growexercises more of the capability surface. Lean toward real, gated behind abrowser-shimflag so the demo doesn’t silently diverge from the native semantics. - Manifest format: keep capnp, or add JSON for hand-authored demo configs? Keep capnp. The manifest is already the contract; adding a parallel format is exactly the drift the project has been careful to avoid.
Relationship to Other Proposals
- Userspace Binaries — the wasm32 runtime story lives there eventually. This proposal is narrower: just enough runtime to boot the existing demo set in a browser. If the userspace proposal lands a richer runtime first, this one adopts it.
- WASI Host Adapter —
the WASI host adapter (capos-wasm) already exercises the inverse
direction: hosting third-party
wasm32-wasip1/wasm32-wasimodules inside a capOS userspace process whose Preview 1 imports are backed by typed capabilities (Console, Timer, EntropySource, bounded argv/env text grants). The browser port consumes that experience in three ways: it reuses the per-instance cap-grant pattern (no ambient host imports, every authority surfaced through the CapSet); it inherits the lesson that host-backed imports must refuse closed when the cap is not granted (W.4’sERRNO_NOSYS = 52refusal sentinel); and it specifically rejects pulling the kernel itself into a hosted wasm-runtime substrate — the browser kernel Worker is a Rust/wasm port ofcapos-lib’sCapTableandcapos-config’s ring dispatch, not a wasmi-style interpreter over another guest. If a future browser-port phase wants to host third-party wasm modules inside a capOS-on-wasm userspace process, that work belongs to the WASI adapter direction, not here. - Browser Capability and Agent Web Sessions —
the opposite direction: capOS exposing browsers as capability-scoped
services (
BrowserSession,BrowserProfile,BrowserContext) to users, shells, and agents. The two proposals share design principles (browser state is authority; the interface is the permission; agents receive tools, not admin ports) but do not overlap in implementation — one is a userspace browser service driven over CDP/WebDriver BiDi from a capOS host; this one is capOS rebuilt for wasm and run inside Workers with no browser engine of its own. - Remote Session UI Security —
the trusted local web bridge that owns the TCP connection and upstream
capOS session while browser JavaScript receives only DTOs. The browser
port faces the same boundary inside one tab: the kernel Worker holds the
CapTableand serves typed CQEs back to process Workers, and any UI surface on the main thread is untrusted glue, not a cap holder. The CSRF/CSP/cookie/cookie-isolation posture documented there is the reference the browser port adopts before serving any host-backed capability (Fetch, Clipboard, storage) to a process Worker; relaxing it for “just a demo” is exactly the ambient-authority drift the proposal warns against. - SMP — structurally irrelevant to the browser port (each Worker is its own agent). The browser port does inform SMP testing, because the cap-transfer protocol under Workers is a cleaner model of “messages cross agents asynchronously” than single-CPU preempted kernels.
- Service Architecture —
process spawn in the browser becomes Worker instantiation. The
lifecycle primitives (supervise, restart, retarget) map naturally. Live
upgrade (Live Upgrade) is even
more natural under Workers than under in-kernel retargeting — swap the
WebAssembly.Modulebehind a Worker while the ring stays live. - Security and Verification — the browser port adds a CI job (wasm builds + JS-side ring tests) but does not change the verification story for the native kernel.
Proposal: Browser Capability and Agent Web Sessions
How capOS should expose the web without turning a browser into an ambiently privileged desktop escape hatch.
This proposal is intentionally split into three tracks:
- After GUI: a full visual browser for humans, with windows, input, rendering, profiles, downloads, extensions, and ordinary web compatibility.
- Agent/shell usage: a standard
BrowserSessioncapability that lets shells and AI agents navigate, inspect, screenshot, fill forms, download, and collect evidence through a brokered browser service before capOS has a native GUI browser. - Cap-native document engine: an intermediate path that runs JS, DOM/CSS, layout, and rendering over caller-provided document/resource data, with fetch, storage, permissions, clipboard, downloads, and host I/O wired to native capOS capabilities instead of a browser-owned ambient platform.
The existing Browser/WASM proposal runs capOS in a browser tab. This proposal is the inverse: capOS exposes browser capabilities to users, services, and agents.
Grounding research: Browser Engines, Document Engines, and Agent Browsers.
Problem
The web is both a user interface substrate and a huge authority boundary. A browser can read credentials, perform network requests, upload local files, download untrusted bytes, run JavaScript from hostile origins, track users through profiles, and expose debug protocols powerful enough to rewrite page state.
On a conventional OS that power is hidden behind process permissions, profile directories, and implicit user intent. capOS needs a browser model that fits the capability system:
- Profiles and sessions are explicit authority.
- Network routes, downloads, uploads, credentials, and automation are scoped.
- Browser JavaScript does not get shell or storage authority by accident.
- Agents can use the web as a tool without receiving raw CDP, filesystem, or network capabilities.
Non-Goals
- Writing a new browser engine for the first capOS browser milestone.
- Porting Chromium, WebKit, Gecko, Servo, or Ladybird before the GUI, userspace networking/storage, fonts, and driver-safety prerequisites exist.
- Treating anti-detection, fingerprint evasion, scraping at scale, or bot bypass as a capOS product goal.
- Exposing raw Chrome DevTools Protocol, WebDriver BiDi, or Playwright handles as ordinary user/session capabilities.
- Letting browser-hosted JavaScript hold raw capOS shell, launch, file, or network capabilities.
Design Principles
-
Browser state is authority. A profile’s cookies, local storage, permissions, saved credentials, cache, proxy route, and downloads are not implementation details. They are held through
BrowserProfileandBrowserContextcapabilities. -
The interface is the permission. A caller that can navigate does not automatically get DOM inspection, screenshot, input, download, upload, network interception, profile mutation, or automation-debug authority.
-
Agents receive tools, not admin ports. CDP and WebDriver BiDi are backend protocols for the trusted browser service. The agent-facing ABI is a typed narrowed capability surface.
-
Origins become visible policy inputs. Browser decisions should record origin, top-level site, profile, user session, persona, network route, and initiator. URL strings alone are not enough.
-
Downloads and uploads cross explicit caps. A download returns a
BrowserArtifactor writes through a grantedDownloadSink. Uploading a file requires a granted read cap for that object and a per-action policy decision. -
Automation is auditable. Browser actions initiated by an agent are logged with the page/session, operation, typed arguments, permission mode, result, and artifacts captured for later review.
-
Visual browsing waits for GUI. A human browser is a real app, not a terminal command. It should land only after compositor/input/font/storage and userspace networking foundations are credible.
-
A browser can be headless before it is native. The early agent/shell-facing capability may be served by a host-side browser, a development-machine sidecar, a Linux companion process, or a remote browser service. The capOS ABI should not expose which backend serves it.
Track 1: Agent/Shell Browser Capability
This is the near-term conceptual track. It gives capOS agents and shells a standard web tool without waiting for a compositor or native browser port.
Conceptual interfaces:
interface BrowserBroker {
createProfile @0 (request :BrowserProfileRequest) -> (profile :BrowserProfile);
openContext @1 (profile :BrowserProfile, policy :BrowserContextPolicy)
-> (context :BrowserContext);
}
interface BrowserContext {
openSession @0 (persona :BrowserPersona) -> (session :BrowserSession);
snapshot @1 () -> (profileSnapshot :BrowserProfileSnapshot);
destroy @2 () -> ();
}
interface BrowserSession {
close @0 () -> ();
}
interface BrowserNavigate {
navigate @0 (url :Text, wait :NavigationWait) -> (result :NavigationResult);
}
interface BrowserReadPage {
readPage @0 (budget :PageReadBudget) -> (snapshot :PageSnapshot);
}
interface BrowserScreenshot {
screenshot @0 (options :ScreenshotOptions) -> (image :BrowserArtifact);
}
interface BrowserInput {
input @0 (action :InputAction) -> (result :InputResult);
}
interface BrowserDownload {
download @0 (selector :DownloadSelector, sink :DownloadSink)
-> (artifact :BrowserArtifact);
}
The exact schema belongs in a later implementation slice. The important rule is
that BrowserSession is only a lifetime handle for one browsing context. It
does not imply navigation, inspection, screenshot, input, download, upload,
network-observer, or debug authority. The broker mints only the operation
facets allowed by the caller’s session policy, and the shell/agent runner
advertises only tools backed by facets it actually holds.
| Capability | Authority |
|---|---|
BrowserBroker | Mint profiles and contexts according to session policy. |
BrowserProfile | Own persistent browser state and profile lifecycle. |
BrowserContext | Own one isolated browsing context under a profile. |
BrowserSession | Hold and close one session lifetime; no operation authority by itself. |
BrowserNavigate | Navigate within one session. |
BrowserReadPage | Inspect page state under output budgets. |
BrowserScreenshot | Capture screenshot artifacts under policy. |
BrowserInput | Click, type, select, upload only with explicit grants. |
BrowserDownload | Initiate browser downloads into a granted sink. |
DownloadSink | Receive bytes/artifacts from browser downloads. |
BrowserNetworkObserver | Read network metadata or bodies under redaction policy. |
BrowserAdmin | Backend-only: raw CDP/BiDi, crash dumps, trace, profile mutation. |
Agent Tool Shape
The native shell or agent runner advertises browser operations as ordinary tools:
browser.open(url)browser.snapshot()browser.screenshot()browser.click(ref)browser.type(ref, text)browser.select(ref, value)browser.download(ref)browser.close()
The tool result is structured:
- page title, URL, origin, load state
- accessibility/DOM references under stable short IDs
- visible text and form fields under a token/byte budget
- screenshot artifact cap, when requested
- network/download artifacts only when separately allowed
The model never receives the BrowserSession cap. It proposes tool calls;
the runner executes them after policy and consent checks, then feeds bounded
results back to the model. This matches
Language Models and the Agent Runtime.
Backend Strategy
The first implementation should be a userspace service or host-side harness that owns a real browser and exposes the typed capOS surface:
- Browser service launches or attaches to Chromium/Firefox/WebKit through Playwright, WebDriver BiDi, or CDP.
- The service stores profile state in a host directory or capOS Store backend,
but callers see only
BrowserProfilecaps. - The service enforces per-session operation grants and output budgets before returning DOM text, screenshots, network metadata, or downloads.
- An MCP adapter can present the same tools to external agents, but MCP is an adapter, not the authority model.
This makes browser usage testable while capOS still lacks native GUI pieces. It also creates a practical compatibility path for agents that need the modern web during capOS development.
Track 1.5: Cap-Native Document Engine
The most capOS-shaped browser work may not be “port a full browser” first. There is a meaningful middle target: run the parts of the web stack that turn provided data into an interactive document – JavaScript, DOM, CSS, layout, rendering, and perhaps WebAssembly – while replacing browser-owned host APIs with capability-backed services.
In this model, the engine does not own raw networking, files, profile
directories, clipboard, permissions, downloads, credentials, or extension
installation. It receives a document/resource graph and a bundle of explicit
host caps. Each document bundle also needs a broker- or ResourceLoader-minted
web principal: an explicit origin, package origin, or opaque origin plus base
URL policy used for relative URLs, storage partitioning, fetch checks, audit
records, and user-facing permission prompts. Opaque origins are the default for
caller-provided bundles; a real web or package origin requires authority or
attestation from the loader that supplied the bytes. Web APIs become host
bindings:
| Web-facing operation | capOS-backed authority |
|---|---|
fetch() / subresource load | HttpEndpoint, Fetch, or content-addressed ResourceLoader cap. |
| cookies / local storage / IndexedDB | BrowserProfileStore or narrower origin-scoped KvStore cap. |
| file picker / upload | user-approved FileRead or artifact cap. |
| downloads | DownloadSink / StoreWriter cap. |
| clipboard | explicit ClipboardRead / ClipboardWrite caps. |
| geolocation, camera, microphone | future sensor/media caps, never implicit. |
| workers / timers | scheduler and resource-budget caps. |
| WebAssembly imports | explicit host import caps, not ambient syscalls. |
Document-engine Wasm hosting is the same shape as the
WASI Host Adapter: a userspace process
holds the wasm runtime and binds each import to an explicit capOS
capability passed in through its bootstrap CapSet, rather than letting
module code reach for ambient syscalls. Phase W.3/W.4 of that proposal
already grants per-instance bounded text (argv, environment) and
typed EntropySource-backed random_get through narrowed broker
grants; the cap-native document engine should reuse the same bootstrap
CapSet convention and per-instance grant shape when it eventually hosts
JS/Wasm runtimes inside the browser stack so that
fetch/storage/clipboard/random_get bindings stay
authority-by-grant.
This track is useful for three reasons:
- It gives capOS a native HTML/CSS/JS application substrate without waiting for all of ordinary web browsing. Documentation, setup flows, dashboards, adventure/Paperclips UIs, and local admin apps could be rendered from trusted or packaged resources before arbitrary internet browsing is safe.
- It lets the project design web API host bindings around capabilities from the start. A later full browser can reuse the same profile, fetch, storage, permission, and artifact services instead of hiding them inside an engine.
- It is a smaller research target for engine embedding. Servo, Ladybird, and WebKit/WPE can be evaluated as document/rendering substrates, while SpiderMonkey, JavaScriptCore, Boa, or QuickJS can be evaluated as JS/Wasm runtime components or host-binding proof substrates without committing to an entire general-purpose browser port.
The accepted first shape should be conservative:
- Load documents from a
DocumentBundleorResourceLoadercap, not from a URL bar. - Require every bundle principal to be minted or validated by the broker or
ResourceLoader, and partition fetch, storage, cache, and audit state by profile/context/session plus that principal. - Disable arbitrary internet subresource fetch until a caller grants a
narrowed
Fetch/HttpEndpoint. - Produce a rendered surface or screenshot artifact plus a bounded accessibility/DOM snapshot.
- Treat every Web API host binding as a separate facet and require explicit broker grants.
- Avoid extension APIs, service workers, persistent background sync, notifications, WebRTC, and device APIs until their capOS authority model is clear.
The self-served remote-session web UI is an application-hosting instance of this middle track, not a general browser milestone. The UI bundle is an immutable boot-package resource served by a capOS service through scoped listener authority; browser JavaScript is still ordinary untrusted page code. The capOS service, not the page, holds the remote session CapSet and service proxies, then exposes browser-safe view models and user-event commands over same-origin HTTP routes. This keeps the first proof aligned with the browser capability rule that JavaScript never receives raw capOS caps, shell or spawn authority, endpoint owner handles, storage roots, or host identity hints. The Remote Session UI Security proposal owns the concrete web-security posture for that bridge – per-browser-session isolation, CSRF/CSP/cookie posture, transcript redaction, and the Tauri desktop wrapper’s reduced webview surface – and is the load-bearing precedent for how a cap-native document engine should treat its same-origin DTO channel: the Rust/backend authority boundary, not page JavaScript, holds upstream capOS handles.
This is still not a toy scripting widget. Running hostile JavaScript against a DOM/layout engine remains a large TCB, and rendering bugs can be security bugs. The point is to narrow the host-platform surface: provided data in, rendered surface/snapshot/artifacts out, and every side effect through typed caps.
Track 2: Visual Browser After GUI
A human-facing browser should be a normal capOS GUI application once these prerequisites exist:
- compositor and input service
- font discovery/rasterization
- userspace networking and TLS
- Store/Namespace-backed profile persistence
- download/upload mediation
- shared-memory graphics buffers or GPU session caps
- process crash/restart handling
- brokered user-session profile policy
Candidate engine paths:
| Engine path | Role | capOS assessment |
|---|---|---|
| Chromium Ozone / CEF | Maximum compatibility and automation ecosystem | Best external/backend choice; native port is very large. |
| WPE WebKit | Embedded visual browser candidate | Plausible post-GUI engine because WPE is designed for embedded backends. |
| Gecko / GeckoView | Browser diversity and principal-model precedent | Good external backend; GeckoView itself is Android-specific. |
| Servo | Rust/modular research-aligned engine | Track closely; not first broad-compatibility choice. |
| Ladybird / LibWeb | Independent-engine precedent | Track for architecture; not a near-term dependency. |
The visual browser should reuse the agent/shell profile/session model instead
of inventing a second profile stack. A GUI tab is a BrowserSession with a
visual BrowserView surface attached. Closing the window should not silently
destroy profile state unless the profile cap is ephemeral.
Donut Browser Ideas To Adapt
Donut Browser is useful because it treats browser profiles as first-class, scriptable objects and exposes local REST/MCP automation. capOS should adapt the capability-shaped parts:
- Unlimited local profiles map to broker-minted
BrowserProfilecaps. - Profile groups map to policy bundles and user-session grants.
- Per-profile cookies/storage/extensions map to Store-backed state owned by the profile cap.
- Per-profile proxy/VPN selection maps to explicit network-route caps.
- Local REST/MCP maps to a typed capOS service plus optional external adapter.
- Persistent automation sessions map to
BrowserContextlifetimes and snapshots. - Default-browser link routing maps to a broker decision: which profile/context should open a URL for this user/session?
capOS should not adopt Donut’s anti-detect promise. If capOS supports persona
controls such as viewport, locale, timezone, user agent, geolocation, WebRTC
policy, or fingerprint reduction, those controls should be explicit
BrowserPersona policy with audit and user-facing disclosure.
Security Boundary
Browser work adds these trust boundaries:
- Web content to browser engine. Untrusted JavaScript, media, fonts, and documents hit a large engine TCB. Native browser work should keep renderer, network, image decode, and profile services separated where the backend permits it.
- Browser engine to capOS. The engine must not receive broad shell caps. Its only capOS authorities should be its granted network route, profile store, artifact sink, and visual/input surfaces.
- Agent to browser service. The agent sees tool descriptors and bounded snapshots, not backend debug ports.
- Browser downloads to storage. Downloaded bytes are untrusted artifacts until a user or policy process imports them into a namespace.
- Browser uploads to web origin. Upload requires explicit file/artifact authority and must record the destination origin.
- Profile to profile. Cookies, storage, cache, extension state, and persona policy must not bleed across profiles unless a broker grants an explicit clone/import/export operation.
Raw CDP or BiDi access is BrowserAdmin authority. It should be held only by
the browser service supervisor and developer harnesses, not by ordinary shell
sessions.
Phased Plan
Phase A: Host-Backed Agent Browser
- Add a host-side or userspace browser service proof that exposes a narrowed
BrowserSessionover an existing browser backend. - Use fake-model or scripted-agent QEMU/host proof first: navigate to a local page, read a bounded snapshot, click/type, capture a screenshot artifact, and close the session.
- Record audit output for each action and show that the caller never receives raw CDP/BiDi.
Phase B: Standard Shell Tool
- Add native shell and agent-runner integration so
browser.open,browser.snapshot, andbrowser.screenshotare standard tools when the broker grants a browser bundle. - Add MCP adapter support for external agents using the same typed operation set.
- Add download/upload gates once
Store/Namespaceand artifact caps exist.
Phase C: Cap-Native Document Engine Proof
- Add a restricted
DocumentBundleproof that renders packaged HTML/CSS/JS to a screenshot or simple surface and emits a bounded accessibility/DOM snapshot. - Wire at least one host API, such as fetch from a preloaded resource bundle or a profile-scoped key/value store, through a typed capability.
- Prove that absent caps fail closed: no network, no profile storage, no clipboard, and no downloads by default.
Phase D: In-capOS Headless Browser Backend
- Port or package a browser backend process once userspace networking, storage, fonts, and threads are mature enough.
- Prefer a backend that can run without a full visible GUI surface but still supports screenshots and accessibility/DOM snapshots.
- Preserve the same
BrowserSessionABI so agents do not notice the backend change.
Phase E: Visual Browser
- Add
BrowserView/window integration after compositor/input support exists. - Reuse
BrowserProfileandBrowserSessionfor tabs/windows. - Add user-facing profile picker, permissions UI, downloads UI, and audit view.
Relationship To Existing Proposals
- Browser/WASM is about capOS as a browser-hosted runtime. This proposal is about capOS exposing browser capability services.
- Language Models and the Agent Runtime owns the model/tool-call loop. Browser sessions are one tool family.
- Shell and Interactive Command Surfaces own command exposure. Browser operations should appear there as typed tools, not string commands tunneled to an automation port.
- Networking, Storage and Naming,
and GPU Capability provide prerequisites for a
native visual browser. The networking proposal owns the userspace TCP/IP
and TLS authority the broker eventually narrows into
Fetch,HttpEndpoint, and per-profile proxy/route caps; a browser engine never sees raw socket authority. - Remote Session UI Security defines the web-security posture for the trusted local remote-session-ui bridge and its Tauri desktop wrapper. It is the concrete precedent for the cap-native document engine’s “Rust/backend authority boundary, not page JavaScript, holds capOS handles” rule.
- WASI Host Adapter ships the typed capability boundary for sandboxed WebAssembly imports. The cap-native document engine’s Wasm bindings should reuse the same bootstrap CapSet convention and per-instance grant shape (argv, env, entropy, and – once their authority surfaces exist – filesystem and sockets) rather than inventing a parallel browser-only Wasm host.
Open Questions
- Should the first implementation wrap Playwright for breadth, raw CDP for smaller dependencies, or WebDriver BiDi for standards alignment?
- What is the minimal page snapshot that remains useful to an LLM while limiting token use and accidental data disclosure?
- Should
BrowserPersonasupport fingerprint reduction only, or also compatibility personas for testing? - How should extensions be represented: profile-owned package state, separately granted extension caps, or both?
- How should a visual browser present capOS capability prompts without training users to approve every web-origin request blindly?
Proposal: Language Models and the Agent Runtime
How capOS runs language models — including a built-in on-ISO local model — as ordinary capability-served processes, and how the interactive agent is structured around an interactive tool-use loop instead of a plan-approve-execute pipeline.
Why This Proposal Exists
Two problems converge:
-
An earlier draft of the shell proposal sketched an “agent shell” that was itself a natural-language planner embedded in the shell process. That collapses three distinct concerns (user interaction, capability holding, model inference) into one, and it also got the shape of the interaction wrong: a one-shot “model emits a plan, user approves, dispatcher executes” pipeline is strictly weaker than how real agent systems work. In practice the model runs in a tool-use loop: it emits tool calls, the runtime executes them, results feed back into the conversation, the model decides what to do next, and the user stays in the loop through per-tool permission gates and interrupts. That interactive loop is what makes an agent useful; a static plan is a degenerate case of it. The shell proposal now defers to this document for the agent loop and only describes the native shell’s “agent mode” surface; see Shell for the matching shell-side framing.
-
capOS has no story for where model weights live, who holds them, what accelerator they run on, or how external model providers (remote HTTP, local Ollama, a future NPU) plug into the same interface. Every serious workload — interactive agent, chat NPCs in the adventure demo, summarisation of audit logs, semantic search over
LogReader, embedding-based retrieval from aDirectory— wants a language or embedding model. Without a shared capability surface, each consumer reinvents the wiring and smuggles different amounts of authority into the model process.
This proposal defines both halves: the model-as-capability architecture, and the agent-runner that drives the interactive tool-use loop on top of it.
Long-lived OpenClaw-like hosted agents, multi-agent swarms, workspace/memory control planes, MCP/A2A-style interoperability, and agent-maintained wiki substrates are split into capOS-Hosted Agent Swarms. This document keeps the base model and single-runner loop narrow.
Scope
- Language models (chat / completion / tool use / structured output).
- Text embedders (vector encoders for retrieval).
- Tokenisers and small auxiliary models (classifier, reranker, guardrail).
- A built-in local model shipped on the ISO for first-boot and offline use.
- Pluggable external backends (remote HTTP providers, future GPU-accelerated local inference, future NPU).
- The interactive agent runner that exposes session capabilities to the model as tools, executes tool calls, streams results back, and keeps the user in the loop.
- A web-shell execution model where the browser agent is the UI and may
orchestrate the LLM/tool-call loop, while
WebShellGatewaykeeps capOS capabilities server-side and enforces every tool invocation.
Out of scope here (deferred):
- Training, fine-tuning, RLHF pipelines. capOS is an inference host, not a trainer.
- Native realtime multimodal voice sessions. The same authority split applies, but realtime audio, barge-in, transcripts, and provider tool-call events need a separate session interface; see Realtime Voice Agent Shell.
- Long-lived hosted-agent swarms, external channel-triggered background work, durable task queues, agent-maintained wikis, MCP/A2A bridges, and OpenClaw-like harness control planes; see capOS-Hosted Agent Swarms.
- Federated / multi-party inference. Treated as a later network topology.
Design Principles
-
Models are services, not shells. A model runs in a dedicated process with its own
CapSet. It has no session cap, noTerminalSession, noLauncher, noProcessSpawner, noApprovalClient, no user secrets, and no inbound network authority. Its only job is to turn inputs into outputs through typed methods. -
Prompts and outputs are data. Nothing the model reads or writes is authority by itself. The model cannot “say” a capability into existence. Free-form text it emits is never parsed as a command. Tool calls are a separate structured output channel — typed arguments, not shell lines.
-
Tool calls are proposals, not invocations. The model does not hold tool caps and does not perform the call. It emits a
ToolCallvalue naming an advertised tool, with typed arguments conforming to the tool’s schema. A trusted capOS-side runner orWebShellGatewaytool proxy decides whether to execute, prompt the user, or refuse. -
Per-tool permission, not per-plan approval. Each tool carries a permission mode:
auto(read-only, auto-execute),consent(ask the user quickly before running, similar to a per-action “Allow” prompt),stepUp(re-auth required), orforbidden(advertised for explanation only, never runnable). Permission lives on the tool descriptor, not on a post-hoc review of a generated plan. This matches how real agent systems behave and avoids the impossible review problem of a twenty-step plan. -
The interface is the permission. A caller holding
LanguageModelcan request completions. A caller holdingTextEmbeddercan request vectors. Neither exposes weights, tokeniser internals, raw accelerator memory, or administration of the model service. Those stay behind separateModelAdmin,ModelCatalog, andModelRuntimecaps held only by the service’s supervisor. -
Backends are substitutable behind the same interface.
LanguageModeldoes not imply on-host inference. ALanguageModelhandle may be served by the built-in local model, an in-tree Rust inference engine, a GPU-accelerated local backend, or a wrapper over a remote provider. The caller cannot tell from the capability alone — and should not need to. -
Weights are read-only file-backed memory. Weights live as files in the ISO (for the built-in model) or a storage volume (for installed models), and are mapped into the model process through a read-only file-backed
MemoryObject. A shared page cache lets multiple model worker processes and multiple sessions share the same physical frames. Weights are never copied into process-private memory. -
Policy lives in the broker, not in the model.
AuthorityBrokerdecides which sessions get aLanguageModelcap, which backend the cap resolves to, which tools are advertised with which permission modes, rate and quota limits, and whether outbound network providers are allowed. The model enforces none of this; it cannot, because it does not see the session. -
User interrupts beat model momentum. The user can break the loop at any time — Ctrl-C, a UI cancel, a terminal close. An in-flight tool call is either aborted or allowed to complete without its result going back to the model. The runner never waits for the model to “decide to stop”.
-
Browser agents are UI, not authority. In a web shell the agent may live in browser JavaScript and may call a provider API directly with an ephemeral token. That does not make it a capOS authority holder. Browser code can propose structured tool requests to
WebShellGateway; the gateway and broker validate, authorize, execute, revoke, and audit. -
Audit every tool call that touched authority. Each executed tool call is logged with model identity, model version, turn index, the advertised tool descriptors, the exact typed arguments, permission decision, user consent (if any), and the tool’s outcome. The model service does not write audit records; the runner does, because only the runner or gateway tool proxy sees both the call and the execution.
Architecture Overview
There are two accepted execution models. The native/capOS-side model keeps the whole agent loop inside a capOS process. The web-shell model lets the browser agent be the user interface and turn orchestrator, but not the holder of raw capOS capabilities.
CapOS-Side Runner
flowchart LR
User[User / terminal] --> Runner[Agent Runner<br/>holds session caps]
Runner -->|LanguageModel.complete| ModelSvc[language-model service process]
ModelSvc --> Weights[(Read-only MemoryObject<br/>weights file)]
ModelSvc --> Backend{Backend}
Backend -->|cpu| CpuEngine[In-process inference engine]
Backend -->|gpu| GpuSession[GpuSession cap]
Backend -->|remote| Http[HttpEndpoint to provider]
ModelSvc -. "text + tool calls" .-> Runner
Runner -->|per-tool policy| Gate{Permission?}
Gate -->|auto| Invoke[Invoke typed cap]
Gate -->|consent| Prompt[Prompt user y/n]
Prompt --> Invoke
Gate -->|stepUp| Broker[AuthorityBroker step-up]
Broker --> Invoke
Gate -->|forbidden| Refuse[Refuse, feed error back]
Invoke --> Services[Session caps: files, net, spawn, status...]
Invoke --> Audit[AuditLog]
Services -. "result" .-> Runner
Runner -->|role:tool result| ModelSvc
Two principals matter in the capOS-side runner model:
- Agent runner. Holds the session cap bundle (terminal, home, logs, launcher, approval, model client, etc.). Runs the user-facing loop, talks to the model, applies per-tool permission policy, executes tool calls against its held caps, streams results back to the model, and writes audit. This is the natural daily driver — either the native shell in “agent mode” or a sibling process launched from the shell.
- Model service. Holds weights, an optional accelerator session, and
an optional narrow outbound
HttpEndpointfor remote backends. Sees conversation messages; emits text and tool calls. Has no session, no tools, no spawn authority.
The kernel does not need a “model” or “agent” concept. Everything here is ordinary capabilities, processes, and ring traffic.
Browser Agent UI
In a web shell, the agent itself may be the UI. Browser JavaScript may render the conversation, call a provider LLM API directly, receive structured tool calls, and feed tool results back into the model. That mode exists for latency, provider-native browser SDKs, and richer UI composition.
It still does not give browser JavaScript raw capOS capabilities:
flowchart LR
User[User] --> BrowserAgent[Browser Agent UI<br/>LLM loop]
BrowserAgent -->|ephemeral provider token| Provider[LLM Provider API]
Provider -. "text + tool calls" .-> BrowserAgent
BrowserAgent -->|ToolRequest| Gateway[WebShellGateway<br/>ToolProxy]
Gateway --> Broker[AuthorityBroker]
Gateway --> Audit[AuditLog]
Gateway --> Services[Session caps: files, net, spawn, status...]
Services -. "typed result" .-> Gateway
Gateway -. "ToolResult" .-> BrowserAgent
BrowserAgent -. "tool result" .-> Provider
Authority split:
- Browser agent UI. Owns presentation, local conversation state, user gestures, optional browser media APIs, and direct provider session state. It holds no capOS caps, no session caps, no tool caps, and no provider long-lived credentials.
- WebShellGateway tool proxy. Owns the authenticated web transport and
the server-side reference to the session bundle. It exposes the current
tool descriptor snapshot to the browser, accepts structured
ToolRequestvalues, validates them against the session, enforces broker policy and consent/step-up, invokes the real capOS capabilities, and writes audit. - Provider. Sees prompts and tool results only when broker policy allows direct browser provider use for the session’s confidentiality profile.
The browser-agent model is therefore browser-orchestrated but gateway-enforced. It is not a bearer-capability model and not a shortcut around the broker.
The Tool-Use Loop
One capOS-side agent turn:
1. User types a message (or kicks off the first turn from a CLI arg).
2. Runner assembles: system prompt + prior messages + user message +
the set of ToolDescriptor values the session currently advertises.
3. Runner calls LanguageModel.stream(req). Token stream is rendered to
the terminal as it arrives.
4. Model response finishes. It contains text (shown) plus zero or more
ToolCall records (not shown as text; shown as typed tool-call UI).
5. For each ToolCall:
a. Look up the tool by name. If not in the advertised set, reject
with a typed error fed back as a role: tool result.
b. Validate arguments against the tool's paramSchema. Reject if
malformed; feed the validation error back.
c. Check the tool's permission mode:
- auto: proceed.
- consent: render the call + arguments + permission UI;
wait for user y/n. Deny feeds a refusal back.
- stepUp: request a leased narrow cap from the broker,
possibly driving WebAuthn/OIDC step-up. On
success, proceed; on denial, feed back.
- forbidden: reject; feed typed "not permitted in this
session" error back.
d. Invoke the underlying typed capability. Time-box the call.
e. Truncate/redact the result per tool policy, serialize as a
role: tool message keyed to the ToolCall id.
6. If the model emitted tool calls, loop back to step 3 with the
results appended. If it emitted none (or the user interrupted),
this turn ends.
7. Every executed call produces an audit record.
One browser-agent UI turn:
1. User interacts with the browser agent UI.
2. Browser agent assembles the prompt, prior messages, and the current
ToolDescriptor snapshot fetched from WebShellGateway.
3. Browser agent calls the provider directly using a broker-minted,
short-lived, provider-scoped token.
4. Provider response streams into the browser. If it emits ToolCall records,
the browser wraps each as a ToolRequest to WebShellGateway.
5. WebShellGateway validates the call against the advertised descriptor,
current session state, nonce/turn binding, quotas, and broker policy.
6. WebShellGateway obtains any required consent or step-up proof, invokes the
underlying capOS capability server-side, writes audit, and returns a
ToolResult.
7. Browser agent feeds the ToolResult back to the provider and continues the
loop until no tool calls remain or the user/gateway cancels.
Browser-originated tool requests are untrusted input even when the agent is the intended UI. The gateway must reject stale descriptors, unknown tools, argument/schema mismatches, replayed turn ids, requests outside the current session profile, and any operation whose consent or step-up proof is missing.
Interactive-agent niceties that fall out of this structure:
- Streaming. Tokens render live. Tool calls appear as structured widgets, not as text the user has to parse.
- Interruption. Ctrl-C at any point cancels the in-flight inference
(
TokenStream.cancel) or the in-flight tool call. The runner decides whether to feed a cancellation message back to the model or end the turn. - Auto vs. consent. Reading files, listing directories, querying
SystemStatus, reading logs —auto. Writing files, spawning processes, changing service state, sending network requests —consent. Destroying data, running a recovery operation, widening the session’s own caps —stepUporforbidden. - Context management. When the transcript approaches
ModelInfo.contextTokens, the runner can summarise older turns (via a secondLanguageModel.completecall) and replace them with a compact summary message. This is a runner decision, not a kernel or model feature. - Conversation persistence. A conversation is a list of messages
plus a reference to the runner’s session; it can be written to a
home-scoped file, resumed later, forked, or compared. Persistence is an ordinary capability concern, handled by the runner through whateverDirectory/Filecap it holds.
Agent Mode is a Mode of the Native Shell
The native shell from
Shell is the
agent runner. It already holds the session bundle that
Boot to Shell
mints at login and that
Service Architecture
hands it as exact-grant spawn input; adding a LanguageModel client
cap plus a per-tool permission table gives it “agent mode”. In that
mode:
- Plain user input becomes a chat turn.
/cap,/inspect,/exit, and the other existing direct commands stay as direct typed invocations that bypass the model.- Tool descriptors are generated by the same schema reflection the shell already needs for its capability REPL.
The per-tool permission modes (auto / consent / stepUp /
forbidden) and the runner-side enforcement boundary are the same set
the shell proposal cites in
Shell. The
two documents are intentionally consistent: the shell proposal owns
the human-shell surface and direct commands, this proposal owns the
model service contract and the tool-use loop. Neither proposal owns
both halves.
A separate capos-agent binary is possible for deployments where agent
mode is the default (think “bare capOS image with no traditional
shell”). It launches from the same login path described in
Boot to Shell
and under the same supervision rules as any other application service
(see
Service Architecture),
with the same session bundle, and differs only in the surface
presented to the user.
Web Agent Mode is a Mode of WebShellGateway
For browser-hosted sessions, WebShellGateway exposes an agent UI protocol
instead of making the browser a capOS process. The protocol can be JSON over
WebSocket or another web-native framing, but its values mirror
ToolDescriptor, ToolCall, and ToolResult from this proposal.
Gateway responsibilities:
- issue short-lived provider credentials only when policy allows direct browser LLM access;
- bind the tool descriptor snapshot to a session id, conversation id, turn id, expiration, and browser connection;
- execute tools only through server-side session caps;
- enforce low-risk consent, mutating consent, and destructive
stepUpserver-side; - return redacted/truncated tool results according to tool policy;
- revoke or expire provider tokens where the provider supports it, reject new tool requests on logout, timeout, tab close, session downgrade, or policy change, and record any browser-held provider session that can only be terminated best-effort.
Browser responsibilities:
- render the agent UI and any consent prompts supplied by the gateway;
- preserve provider session state only as long as the gateway session is live;
- submit structured tool requests, never raw capability invocations;
- treat gateway denials, cancellation, and revocation as authoritative.
For mutating or destructive tools, a browser click is not enough by itself. The gateway needs a fresh server-side consent challenge or a broker-issued step-up lease tied to the exact tool name, arguments, conversation, turn, and expiration. Low-risk read-only tools may use auto execution when broker policy allows.
Prompt Injection as a First-Class Concern
Model inputs include untrusted data: file contents, log lines, web
pages fetched via a tool call, Aurelian Frontier NPC dialogue, output
from previously executed tool calls. Every such input is wrapped in a
role: user or role: tool message with explicit provenance, never
concatenated into a system prompt. The runner never parses assistant
free text as a command, and the gateway never treats browser-submitted free
text as a capability request. Only structured toolCalls / ToolRequest
values can reach the tool execution path.
A user can paste rm -rf / at the model; the model can repeat it back;
nothing happens, because there is no code path that interprets text as a
command. A web page can instruct the model to exfiltrate secrets; the model
cannot use capOS resources except through the advertised tool set, and
sensitive tools are gated by consent/stepUp. If the browser agent has
ordinary web-network reachability, broker policy must treat prompts and tool
results as exposed to that browser/provider boundary and deny direct browser
mode for sessions where that is unacceptable.
Capability Contract
Additions to schema/capos.capnp (exact method IDs and argument
packing belong to the implementation PR; the shapes below are the
contract):
# Conversation inputs/outputs are plain data. They carry no authority.
struct ChatMessage {
role @0 :Role;
content @1 :Text;
# For role:tool, the id of the ToolCall this message answers.
toolCallId @2 :Text;
# For role:assistant messages that included tool calls, the list
# of calls the model proposed.
toolCalls @3 :List(ToolCall);
}
enum Role {
system @0;
user @1;
assistant @2;
tool @3;
}
struct ToolDescriptor {
name @0 :Text;
description @1 :Text;
# Capability and method the runner will invoke if this tool fires.
interfaceId @2 :UInt64;
methodName @3 :Text;
# JSON-Schema or equivalent describing the argument object.
paramSchema @4 :Text;
# Permission mode. Enforced by the runner, surfaced to the model as
# hint metadata so the model can explain or avoid risky calls.
permission @5 :PermissionMode;
# Tool category for audit and policy filters.
category @6 :Text;
}
enum PermissionMode {
auto @0; # Runner executes without user prompt.
consent @1; # Runner prompts user before execution.
stepUp @2; # Runner requests broker step-up before execution.
forbidden @3; # Advertised for explanation only; never executed.
}
struct ToolCall {
id @0 :Text; # Unique within the conversation.
name @1 :Text;
# Arguments serialised as JSON (or capnp AnyPointer in a later
# revision). The runner validates against paramSchema.
arguments @2 :Text;
}
struct ToolResult {
callId @0 :Text;
outcome @1 :Outcome;
content @2 :Text; # Possibly truncated / redacted by the runner.
error @3 :Text; # Set when outcome != ok.
}
enum Outcome {
ok @0;
refusedByPolicy @1;
deniedByUser @2;
stepUpFailed @3;
executionError @4;
timedOut @5;
cancelled @6;
invalidArguments @7;
unknownTool @8;
}
struct InferenceRequest {
messages @0 :List(ChatMessage);
tools @1 :List(ToolDescriptor);
maxTokens @2 :UInt32;
temperature @3 :Float32;
stopSequences @4 :List(Text);
# Optional JSON-Schema for final-assistant structured output.
responseSchema @5 :Text;
# Stable correlation id for audit.
nonce @6 :Data;
}
struct InferenceResponse {
message @0 :ChatMessage; # role:assistant, may include toolCalls.
usage @1 :TokenUsage;
finishReason @2 :FinishReason;
}
interface LanguageModel {
info @0 () -> (info :ModelInfo);
complete @1 (req :InferenceRequest) -> (resp :InferenceResponse);
# Streaming variant emits token chunks and tool-call deltas as they
# are decoded. Cancellation aborts decoding.
stream @2 (req :InferenceRequest) -> (stream :TokenStream);
}
interface TokenStream {
next @0 () -> (chunk :StreamChunk, done :Bool);
cancel @1 () -> ();
}
struct StreamChunk {
textDelta @0 :Text;
toolCallDelta @1 :ToolCallDelta; # partial structured tool call
}
interface TextEmbedder {
info @0 () -> (info :ModelInfo);
embed @1 (texts :List(Text)) -> (vectors :List(Vector));
}
struct ModelInfo {
id @0 :Text; # Content-addressed weight digest + arch tag.
displayName @1 :Text;
arch @2 :Text; # "llama", "qwen", "phi", etc.
contextTokens @3 :UInt32;
outputTokens @4 :UInt32;
backend @5 :Text; # "local-cpu", "local-gpu", "remote-openai", ...
quantisation @6 :Text; # "fp16", "q4_k_m", ...
supportsTools @7 :Bool;
}
# Administrative surface. Not granted to normal sessions.
interface ModelCatalog {
list @0 () -> (models :List(ModelInfo));
openLanguageModel @1 (id :Text) -> (model :LanguageModel);
openEmbedder @2 (id :Text) -> (embedder :TextEmbedder);
}
interface ModelAdmin {
loadWeights @0 (source :ReadOnlyFile, info :ModelInfo) -> (id :Text);
unload @1 (id :Text) -> ();
setBackendPolicy @2 (policy :BackendPolicy) -> ();
}
The web-shell protocol should expose a non-capability tool proxy with the same
data shapes. Exact framing belongs to the WebShellGateway milestone:
describeTools(session, conversation) -> List(ToolDescriptor)
invokeTool(session, conversation, turn, descriptorSnapshot, ToolCall)
-> ToolResult
cancelTurn(session, conversation, turn) -> ()
This is intentionally not a LanguageModel method and not a capOS capability
handle passed to the browser. It is an authenticated web transport endpoint
whose implementation invokes real session caps only after gateway/broker
checks pass.
What is deliberately absent
- No method on
LanguageModelaccepts a capability argument. The model never holds a live cap to a user resource. - No method returns a capability that could be invoked outside the
model service (
TokenStreamis the one exception and is scoped to the current response). - No “run this tool for me” method on
LanguageModelor any model service. Tool execution is the runner’s or gateway tool proxy’s job. The model only names tools. - No
PlannerAgent/ActionPlan/ dispatcher interface. Planning, if it happens, is something a model does inside one of its responses; it is not a separate typed product. - No “agent shell interface” served by the model. In the capOS-side model,
the shell is the runner and capability holder; in the browser-agent model,
WebShellGatewayis the capability holder.
The Agent Runner
This section describes the capOS-side runner. Browser-hosted sessions use the
WebShellGateway tool proxy described above instead of placing the runner and
session caps in browser JavaScript.
The runner is an ordinary userspace process (native shell in agent
mode, or capos-agent) that holds:
- The session cap bundle, unchanged from the shell proposal.
- A
LanguageModelclient cap issued by the broker. - A
ModelInforead-only view for rendering model identity. - A
ConversationStorecap (when one exists) for persistence.
It does not hold ModelCatalog or ModelAdmin — those are
administrative. If a session wants to switch models mid-run, the
broker issues a new LanguageModel cap.
Building the Tool Table
On startup (and after any cap-set change), the runner walks its own
session bundle and produces ToolDescriptor values through schema
reflection over the advertised capabilities’ interfaces. It applies
the broker-supplied per-tool permission map keyed by
(category, methodName):
read-only -> auto
mutating local -> consent
destructive -> stepUp
outbound net -> consent (unless profile allows auto)
admin-class -> forbidden (for non-operator sessions)
The runner is free to suppress tools entirely for a given conversation
(for example, never advertise ServiceSupervisor.restart for a guest
session, even though the descriptor set could carry a forbidden
entry). Suppression is sometimes clearer than presenting an unusable
tool to the model.
The Loop State Machine
Idle
│ user turn arrives
▼
AssemblingRequest ── tool-descriptor snapshot ─► Inferring (LanguageModel.stream)
▲ │
│ tool result appended │ model finishes
│ ▼
ExecutingCalls ◄─── one call at a time ───────── HasToolCalls?
│ per call: gate → execute → audit │ no
└──────────────────────────────────────────┐ ▼
│ Idle
▼
(any denial / cancel is
an outcome fed back)
Timeouts are enforced at three levels: per-tool (so a slow capability does not block the loop forever), per-turn (bounded number of iterations to prevent runaway), and per-session (token and wall-clock budgets from the broker).
Conversation State
A conversation is List(ChatMessage) plus a ModelInfo.id, the
effective ToolDescriptor table at each turn, and the audit trail.
The runner keeps it in its own process memory during a session and may
persist it through a ConversationStore cap (when that exists; see
open questions). No conversation state lives in the model service; the
service is stateless across requests.
The Built-in Local Model
capOS ships with a small local language model so that:
- First boot has a working agent without remote network.
- The adventure and chat demos can have a real local NPC brain rather than hard-coded strings.
- Offline and air-gapped deployments remain viable.
- The capability surface has a real local implementation to validate against before remote backends are wired up.
Constraints
- Size budget. A 1–3 B parameter quantised model (
q4_k_m-class) fits in 0.7–2.0 GiB. That is too large formanifest.binembedding (2.75 MiB cap) and forces the ISO filesystem path — see the Boot Binary ISO Layout item indocs/backlog/hardware-boot-storage.md. Weights are the first non-binary consumer of the ISO file path. - Tool calling. The model must be a tool-use-capable instruction
tune (a chat-tuned model without reliable tool-call formatting
cannot drive the loop).
ModelInfo.supportsToolsflags this. - Backend. First implementation is CPU-only, portable Rust
inference. Candidates include
candle(needsno_stdsurvey), a minimal hand-rolled GGUF loader + matmul kernel, or a vendored subset of a permissively licensed engine. Final choice is an implementation decision, not a proposal decision; the capability surface is implementation-agnostic. - Precision.
q4_k_morq5_k_mquantised GGUF. fp16 is a later optimisation gated on either SIMD-friendly CPU support or GPU acceleration. - Context window. 4 K–8 K tokens at first. Enough for short agent sessions; long-document summarisation is a later workload that may require a different model or aggressive runner-side compaction.
- Attestation. Weights are signed (see
Cryptography and Key Management)
and the signature is verified at load. The content-addressed digest
becomes the
ModelInfo.id.
Boot Flow
- ISO driver (pending the Boot Binary ISO Layout item in
docs/backlog/hardware-boot-storage.md) exposes/boot/models/<name>.ggufas an ordinary file. - Kernel or a privileged loader service constructs a read-only
file-backed
MemoryObjectover the weights file. Read-only shared frames let multiple model worker processes map the same weights without copies. model-loaderservice (started from the manifest) verifies the signature, registers the model inModelCatalog, and keeps a retained handle to the weightsMemoryObject.- On demand,
ModelCatalog.openLanguageModel(id)spawns (or returns a handle to) a worker process holding the weights, an inference kernel, and — if policy allows — aGpuSessionor a remoteHttpEndpoint.
Weights never live in the manifest blob. The ISO layout work is the prerequisite, and this proposal is its first forcing use case larger than a few megabytes.
Page Cache Coupling
Multiple sessions sharing one model benefit from a page cache over the weights file: the first access faults in, subsequent accesses hit cache, and the pages are shared read-only across all worker processes. This is the same primitive that makes ELF text-segment sharing useful, and it should be implemented once in the ISO/file-backed-memory path rather than specialised per consumer.
CapOS-Side Backends
CapOS-side backends sit behind LanguageModel / TextEmbedder. The worker
process loads exactly one backend per instance. Browser direct-provider mode
is a separate web transport mode described below; it is not a
LanguageModel worker backend.
Local CPU
- File-backed read-only weights mapped from ISO or storage.
- No accelerator caps. No network caps.
- Bounded per-call token budget enforced by the worker; broker sets per-cap quotas.
Local GPU
- Holds a
GpuSessionfrom the GPU capability proposal. - Holds a read-only
MemoryObjectfor the weights; uploads to GPU memory at load time throughGpuBuffer. - Still no network. Still no session cap.
Remote Provider
- Holds one narrow
HttpEndpointscoped to a single provider origin (for example an Ollama instance on the local network, or an external API gateway). The endpoint is issued by the broker; the model worker cannot widen it. - Holds provider credentials only as token-typed capabilities (OAuth
AccessTokenwrapped as a cap, never exposed as a bearer string — see OIDC and OAuth2 proposal). - The model worker process is still the principal that talks to the remote; the runner never sees provider credentials.
- Treated as untrusted: outbound request/response logging is mandatory when operator policy requires audit of off-device inference.
NPU / Future Accelerators
Same shape. Add a scoped NpuSession cap analogous to GpuSession
when the hardware abstraction for it exists.
Browser Direct-Provider Mode
- Browser receives only a broker-minted ephemeral credential scoped to one provider, model/config, session, conversation, and short expiration.
- The credential contains no capOS capability material and cannot be exchanged for session caps.
- The browser may run the provider’s JavaScript/WebRTC/WebSocket client and orchestrate the LLM loop.
- Tool execution still goes through
WebShellGateway’s tool proxy; provider tool declarations must match the gateway-advertised descriptors for that turn. - Broker policy may deny this mode for sessions whose prompts, tool results, labels, or audit requirements cannot leave the capOS-side trust boundary.
- Logout, tab close, timeout, or session downgrade authoritatively closes the capOS session and rejects future tool requests. Provider token/session revocation is authoritative only when the provider exposes a server-side revocation or session-close API; otherwise it is best-effort and must be audited as such.
Policy and the Broker
AuthorityBroker gates every model interaction:
- Which session profiles get a
LanguageModelcap at all (operator: yes; anonymous: usually no; guest: local-only, no remote providers). - Which backend resolves an
openLanguageModel(id)call for this session (local-only for unclassified work; remote permitted for operators who opted in and passed step-up auth). - Rate and token-budget limits per session and per principal.
- The per-tool permission map the runner applies when building
its tool table, or that
WebShellGatewayapplies before publishing descriptor snapshots to a browser agent. This is the main policy knob: an anonymous session might get only read-only tools asauto; an operator session getsconsenton mutating tools andstepUpon destructive ones. - Outbound-network egress policy for remote backends.
- Whether direct browser provider access is allowed for this session, and which prompts, transcripts, tool descriptors, and tool results may cross that browser/provider boundary.
- PII / confidentiality labels: a session labelled MAC/MIC-high may be denied remote inference entirely because prompts would cross the confidentiality boundary (see Formal MAC/MIC).
The broker’s decisions are recorded in audit. The model service itself performs no policy checks — it is an execution backend.
Audit and Provenance
Every executed tool call audit record includes:
- Session ID, principal, conversation ID, turn index, tool-call ID.
- Model identity (
ModelInfo.id), backend, request nonce. - Runner location (
capos-sideorbrowser-agent-ui) and gateway session id when a browser agent proposed the call. - Advertised tool descriptor at the time of the call (name,
paramSchema, permission mode). - Exact typed arguments.
- Permission decision (auto, consented, denied, step-up-succeeded, step-up-failed, forbidden).
- Tool outcome, truncated result hash, and error if any.
Optional per-session conversation-level records capture message metadata (role, timestamp, length, hash) without requiring full prompt content to be stored — the classification policy decides how much content is retained.
This lets an operator answer “what did the agent do on my behalf last week, which model produced each call, and which tools were visible” without replaying prompts from logs the model service does not hold.
Threat Model
Assumed hostile:
- Prompts, retrieved documents, web pages, and tool-call outputs.
- Model weights from unknown sources (mitigated by weight signing and
ModelInfo.idattestation). - The model worker process itself — treated as a semi-trusted data transformer, isolated by its narrow CapSet.
- Browser JavaScript, browser extensions, DOM state, browser-held provider sessions, and browser-agent UI code. They may be the intended user interface, but their tool requests are untrusted inputs to the gateway.
Assumed trustworthy (with attestation):
- The kernel, the capOS-side runner when used,
WebShellGateway’s server-side tool proxy, the broker, the ISO driver, and the loader.
Out of scope (covered by other proposals or tracks):
- Side-channel leakage through cache timing on shared accelerators — follow work on GPU tenant isolation in the GPU proposal.
- Model-backdoor detection — an ecosystem problem, not a kernel one; capOS only guarantees that a compromised weights file cannot escape its worker process’s CapSet.
Integration with Existing Workloads
- Operator <-> agent messaging is a Chat channel. “Operator sends a
prompt to a running agent” and “agent emits a partial response stream”
are events on a chat per
Chat As Multimedia Substrate.
The agent’s prompt channel is reachable through the substrate’s
ordinary cross-principal contact paths: chat-server bundle hooks
(operator session ships with a
GroupOwner/GroupMemberof the agent-prompt group it provisioned), aChatDirectorydiscoverable entry, an Owner/Admin-issuedGroup.invitetoken, or aSelf.contact()cap the agent’s owner shared. There is no protocol-level “request approval to write to a stranger” path:ApprovalClientis for confirming an action the caller already has authority to attempt, not cold-call admission. Tool-call consent prompts that the runner needs to surface to the operator appear on the same chat askind=approvalRefevents, with the liveApprovalGrantcap traveling by capnp-rpc cap reference (not as bytes inside the message data). The model never holds the chat role cap, the listener cap, or the approval grant. - Adventure demo. NPC processes can hold a narrow
LanguageModelcap scoped to small prompt budgets, producing in-character lines instead of canned strings. Chat rooms can feed the demo through a runner variant without session-level tools. - Boot-to-shell first-use. The first-boot path in Boot to Shell can offer an agent-assisted setup flow (“help me configure the network stack”) once the runner is wired up and the operator session profile produced by Boot to Shell includes the model cap and the right tool permission map. The agent runs as a mode of the native shell that login already launches; no separate “setup agent” service is introduced.
- Log and metric summarisation.
LogReaderbecomes aconsent-gated tool in the runner’s tool table. The model asks for “last hour of auth errors”; the runner executes, truncates, feeds back. The model never holdsLogReaderitself. - Semantic search over directories.
TextEmbedder+ a vector index service (future) letshome/docs-scoped search work through asearchtool advertised by the runner, without ambient file access for the model.
Implementation Phases
Phase 0 — Prerequisites
- ISO 9660 driver + file-backed read-only
MemoryObject(docs/backlog/hardware-boot-storage.mdand the follow-on file-backed memory work). - Page cache over file-backed memory.
HttpEndpointscoped-origin fetch (networking proposal Phase B).AuthorityBrokerandApprovalClientwiring, as defined in Boot to Shell and consumed by the shell in Shell.- Schema reflection sufficient to build
ToolDescriptorvalues.
Phase 1 — Capability scaffolding
- Add
LanguageModel,TextEmbedder,ModelInfo,ModelCatalog,ModelAdmin,ToolDescriptor,PermissionMode,ToolCall,ToolResult,InferenceRequest,InferenceResponse,StreamChunk,TokenStreamtoschema/capos.capnp. - Generate bindings via existing
tools/capnp-build. - Stub
language-modelservice process with a deterministic canned-tool-call backend so the runner loop can be exercised without any real inference. make run-agentsmoke: shell in agent mode runs a scripted conversation through the stub, exercises auto / consent / stepUp / forbidden gates, and exits cleanly.
Phase 2 — Built-in local model
- Choose a CPU inference engine and vendor it.
- Ship one tool-use-capable quantised model in
iso_root/boot/models/as a content-addressed GGUF with a signature. - Loader service verifies signature, maps weights, registers in
ModelCatalog. - First real tool-use loop with a local model.
Phase 3 — Runner features
- Streaming render into
TerminalSessionwith interrupt support. - Context-budget compaction (summarise older turns via a secondary inference call).
- Per-tool consent UI.
- Audit integration.
- Conversation persistence through a
ConversationStorecap. WebShellGatewaytool proxy: descriptor snapshots, turn binding, replay rejection, server-side consent/step-up enforcement, and browser-agent-proposed audit records.
Phase 4 — Backends
- GPU backend wired through
GpuSession. - Remote-provider backend wired through
HttpEndpoint+ token-typed capability. One concrete provider (for example local Ollama) as the proof. - Broker policies for backend selection.
- Browser direct-provider mode: broker-minted ephemeral credentials,
short token expiry, provider revocation/close when supported, audited
best-effort teardown otherwise, and a web-agent smoke that proves
browser-orchestrated tool calls are executed only through
WebShellGateway.
Phase 5 — Hardening and features
- Structured-output (JSON/capnp) validation against
responseSchema. - Embedding-backed retrieval service (
TextEmbedder+ vector store). - Prompt redaction for MAC/MIC-high sessions.
- Audit replay tooling.
- Step-up integration with the broker’s WebAuthn/OIDC paths.
Phase 6 — Applications
- Agent-assisted adventure NPCs with per-NPC caps.
- Agent-assisted first-boot setup flow.
- Log-summarisation and monitoring assistant.
- Optional: agent mode over the POSIX compatibility layer, once that exists.
Dependencies
Hard prerequisites:
- ISO filesystem driver and file-backed
MemoryObject(docs/backlog/hardware-boot-storage.mdplus file-backed memory follow-on). AuthorityBrokerandApprovalClient, as defined in Boot to Shell and consumed by the shell in Shell.WebShellGatewayauthenticated transport and server-side session tracking, as defined in Boot to Shell.ProcessSpawnerwith exact-grant child launch, as described in Service Architecture (done).- Schema reflection /
SchemaRegistry. - Cap’n Proto schema evolution tooling (done).
Soft / enables richer behaviour:
- GPU capability proposal for GPU backend.
- OIDC/OAuth2 proposal for remote-provider credentials and step-up authentication.
- WebAuthn/passkey support for browser step-up on destructive tools.
- Cryptography/KMS proposal for weight signing.
- System monitoring proposal for audit integration.
- Formal MAC/MIC proposal for high-confidentiality session policy.
Non-Goals
- No kernel-side model awareness.
- No ambient “AI” privilege anywhere.
- No model-issued capabilities.
- No long-lived bearer-token exposure to the runner or browser. Browser-agent UI mode may use only short-lived provider-scoped credentials.
- No promise that any particular model size, license, or benchmark score ships in-tree — the choice is an implementation decision gated by the trusted-build-inputs process.
- No plan/approve/execute pipeline as the primary interaction (explicitly superseded by the tool-use loop).
- No claim that capOS offers strong defences against model-internal adversarial attacks (jailbreaks, refusal bypass). The capability model defends the system, not the model’s own behaviour.
Open Questions
- Should tool arguments be JSON (matches provider ABIs like OpenAI
tools / Anthropic tools) or capnp
AnyPointer(matches capOS wire format)? Proposed: start with JSON for compatibility with remote providers and because local GGUF tool-use tunes are JSON-trained, and add a capnp fast path later. - How are conversations named, persisted, and resumed? A
ConversationStorecap with TTL is the sketch, but the storage proposal needs an update before this is concrete. - What is the smallest credible local model that still drives the tool-use loop reliably for capOS-internal tasks (file edits, status summaries, NPC dialogue)? Below a threshold, better to ship no default model and require explicit configuration.
- How should streaming back-pressure compose with ring
cap_entercompletion limits? A single response can produce many small CQEs. - When consent prompts pile up in a long turn, how should the runner offer “approve-once” vs. “approve-for-this-turn” vs. “approve-for-this-session” without widening authority beyond what the user intended? A per-session “always allow this tool” allow-list, cleared at session end, is a reasonable starting point.
- Should the runner ever let the model read tool descriptors for
tools it cannot execute (
forbidden), so the model can explain why it can’t help, or should those be suppressed entirely? - Does the built-in model warrant its own trust anchor in the weights signing chain, or should it share the system trust store? Likely share, with a dedicated key purpose (see cryptography proposal).
- Which web-shell profiles should allow browser-agent UI mode by default? Operator sessions may want it for latency and provider UX; high-label or audit-strict sessions should probably force capOS-side provider mediation.
- How should the gateway prove fresh user presence for browser-agent approvals without trusting arbitrary JavaScript events? WebAuthn/passkey step-up handles destructive tools; low-risk consent still needs a concrete freshness rule.
Proposal: capOS-Hosted Agent Swarms
capOS should eventually host OpenClaw-like personal agents and multi-agent workflows as ordinary capability-scoped services. The existing Language Models and Agent Runtime proposal defines the model capability surface and the single-session tool-use loop. This proposal covers the layer above it: long-lived hosted agents, workspace and memory layout, swarm orchestration, agent-to-agent coordination, and harness controls.
The first credible implementation is not a general “AI computer”. It is a controlled service graph:
- user-facing ingress through native shell, SSH/WebShellGateway, chat channels, webhooks, or scheduled triggers;
- a trusted capOS runner that owns session capabilities and enforces tool gates;
- narrow agent workers that receive only task-local workspace, retrieval, and tool caps;
- explicit memory and wiki services instead of hidden prompt state;
- durable task records, review gates, and attribution for multi-agent work.
This belongs outside the shell proposal. Shell mode remains one interactive runner surface. Hosted agents need persistent service state, remote ingress, work queues, memory compaction, swarm scheduling, and audit rules that would make the shell proposal too broad.
Research Baseline
Sources reviewed for this design:
- capOS research note, Hosted Agent Harnesses: <../research/hosted-agent-harnesses.md>
- OpenAI, Harness engineering: https://openai.com/index/harness-engineering/
- OpenAI, Agents SDK sandbox and model-native harness direction: https://openai.com/index/the-next-evolution-of-the-agents-sdk/
- OpenClaw documentation: home, agent runtime, workspace, memory, exec, browser, and multi-agent controls: https://openclawlab.com/en/, https://openclawlab.com/en/docs/concepts/agent/, https://openclawlab.com/en/docs/concepts/agent-workspace/, https://openclawlab.com/en/docs/concepts/memory/, https://openclawlab.com/en/docs/tools/exec/, https://openclawlab.com/en/docs/tools/browser/, https://openclawlab.com/en/docs/concepts/multi-agent/
- DeepWiki secondary project summaries for OpenClaw, OpenClaw skills, OpenManus, Microsoft Agent Framework, and AutoGen: https://deepwiki.com/openclaw/openclaw, https://deepwiki.com/openclaw/skills/2.2-agent-memory-persistence-pattern, https://deepwiki.com/openclaw/docs/6.3-web-search-and-browser-tools, https://deepwiki.com/FoundationAgents/OpenManus, https://deepwiki.com/microsoft/agent-framework, https://deepwiki.com/microsoft/ai-agents-for-beginners/3.1-autogen-framework
- Karpathy, LLM Wiki: https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f
- Abdullin, Schema-Guided Reasoning: https://abdullin.com/schema-guided-reasoning/
- MetaGPT: https://arxiv.org/abs/2308.00352
- Generative Agents / Smallville: https://arxiv.org/abs/2304.03442
- Gas Town documentation: https://docs.gastownhall.ai/, https://docs.gastownhall.ai/usage/
- Model Context Protocol: https://modelcontextprotocol.io/docs/getting-started/intro, https://modelcontextprotocol.io/docs/learn/architecture
- Agent2Agent Protocol: https://github.com/a2aproject/A2A, https://a2a-protocol.org/latest/specification/
- Microsoft AutoGen and Microsoft Agent Framework: https://www.microsoft.com/en-us/research/project/autogen/overview/, https://learn.microsoft.com/en-us/agent-framework/overview/
- LangGraph durable execution: https://docs.langchain.com/oss/python/langgraph/durable-execution
- CrewAI: https://docs.crewai.com/
- CAMEL-AI: https://docs.camel-ai.org/get_started/introduction
There is substantial low-quality agent SEO around OpenClaw and related systems. This proposal relies on primary docs, official project pages, arXiv papers, and DeepWiki pages only as secondary codebase summaries. News and social reports may motivate later risk research, but they are not treated as design authority.
What Current Agent Harnesses Actually Do
The useful pattern is not “model plus tools”. It is a harness that controls what the model can inspect, what it can change, how work survives context loss, and where human approval enters the loop.
OpenAI’s harness engineering writeup is the cleanest framing for capOS: repository-local, versioned artifacts are what the agent can reason about; knowledge in chat threads, documents, and people’s heads is effectively absent unless compiled into files, schemas, tests, and executable plans. The same post argues for mechanically enforced architecture, validated boundaries, and agent-legible systems over ad-hoc documentation. The 2026 Agents SDK direction adds an explicit model-native harness, controlled workspaces, sandbox execution, filesystem tools, MCP, skills, AGENTS.md-style instructions, shell execution, and structured patch tools.
OpenClaw shows the personal-agent product shape:
- local-first channel ingress through chat apps, webhooks, cron, and a gateway;
- a gateway security boundary for channels and tool execution;
- an agent runtime with a workspace as the default tool cwd;
- injected bootstrap files such as
AGENTS.md,TOOLS.md,USER.md, and identity/persona files; - built-in read, exec, edit/write, browser, web, process, memory, and skill surfaces;
- a browser harness with managed profiles, snapshots, screenshots, action refs, CDP routing, and optional arbitrary JavaScript evaluation;
- an exec harness with host selection (
sandbox, gateway, node), security modes (deny, allowlist, full), approval prompts, timeouts, background sessions, PTY support, process polling, and path/env restrictions; - markdown memory where files are the source of truth, plus semantic search, line-range reads, SQLite indexes, local/remote embeddings, and hybrid search;
- per-agent workspaces, sandbox settings, and tool allow/deny lists.
The important negative lesson is also explicit in OpenClaw’s docs: a workspace is not automatically a sandbox. If sandboxing is off, absolute paths and host tools can still reach outside the workspace. capOS should not reproduce that ambiguity. A capOS agent workspace must be a capability namespace by default, not a convention over a host filesystem.
DeepWiki’s accessible summaries add useful implementation-level signals:
- OpenClaw exposes tools as functional capabilities and skills as modular
SKILL.mdextensions, with a personal-assistant trust model, security audit, and sandboxing options. - OpenClaw memory skills converge on durable, retrievable, self-maintaining
memory because a single growing
MEMORY.mdoverflows context and loses structure. - OpenClaw web/browser docs describe dedicated managed browser profiles, CDP control through the gateway, SSRF checks, provider-backed web search, fetch normalization, and active memory integration.
- OpenManus uses a think-act cycle with tool execution, multi-provider LLMs, MCP integration, and sandboxed code/browser automation.
- Microsoft Agent Framework and AutoGen emphasize graph/workflow orchestration, checkpointing, human-in-the-loop, event-driven actor-style communication, distributed runtimes, tools, memory, observability, and MCP/A2A integrations.
For this repository itself, applying OpenAI-style harness engineering means turning capOS’s docs, workplans, run targets, QEMU proofs, proposal statuses, research notes, and schema authority semantics into mechanically navigable agent inputs. That repository-local work is owned by capOS Repository Harness Engineering, with source grounding in Hosted agent harnesses.
Product Goal
The visible milestone is:
make run-hosted-agent boots capOS in QEMU, starts a resident hosted-agent
service graph, accepts a scripted user request, creates a task-local workspace,
runs one or more bounded agent workers through a deterministic model service,
uses retrieval/wiki context, executes one read-only tool automatically, requires
approval for one mutating tool, records attributed audit output, and shuts down
without leaking session, model, or host authority to the worker.
Later milestones add real model backends, web ingress, chat ingress, browser automation, multi-agent swarms, and remote/provider interoperability.
Design Principles
-
Harness first, model second. The hosted-agent service is primarily a control plane for workspaces, tools, memory, approvals, lifecycle, and audit. Model selection is a replaceable backend decision.
-
Agents are processes with caps, not identities with ambient power. An agent worker has exactly the caps minted for one session, task, and phase. It does not inherit the operator’s whole world.
-
All tool execution is mediated. The model proposes structured tool calls. The runner validates descriptors, arguments, turn binding, policy, budget, and approval before invocation.
-
Memory is an artifact, not a hidden model property. Durable facts, summaries, task logs, and wiki pages live in capability-scoped files or services with provenance, review status, and retention policy.
-
Swarm work is durable structured data. Tasks, assignments, handoffs, reviews, votes, failures, and merge decisions must outlive any model context window.
-
Human review is a capability gate. The system should support both high-autonomy local demos and conservative operator policy, but destructive or authority-widening actions require explicit fresh consent or step-up.
-
Remote agent interoperability is data-plane only at first. MCP and A2A style bridges may expose descriptors and messages, but they do not carry raw capOS authority.
-
CapOS should be stricter than desktop harnesses. Browser profiles, shell execution, provider credentials, memory stores, and file workspaces are separate capabilities with narrow lifetime and auditable grants.
-
Shared resources need coordination objects. A git repo, task queue, wiki, browser profile, or shared todo list is not just a file path. The agent harness must expose owners, leases, versions, watches, and conflict reports before workers mutate shared state.
-
Incoming agent messages are untrusted work items. A chat message from another agent can carry status, questions, handoffs, artifacts, or requests. It must not directly alter prompt state, execute tools, widen caps, or override task policy.
System Topology
flowchart LR
User[User / channel / cron / webhook] --> Gateway[Ingress Gateway]
Gateway --> Broker[AuthorityBroker]
Broker --> Host[HostedAgentService]
Host --> Task[AgentTask<br/>durable state]
Host --> Runner[AgentRunner<br/>trusted tool gate]
Host --> Memory[AgentMemory<br/>wiki + logs + search]
Host --> Model[LanguageModel<br/>local or remote backend]
Host --> Scheduler[SwarmScheduler]
Scheduler --> W1[Worker process<br/>task workspace caps]
Scheduler --> W2[Worker process<br/>task workspace caps]
Scheduler --> R[Reviewer process<br/>read + critique caps]
Runner --> Tools[Typed capOS tools]
Runner --> Approval[ApprovalClient]
Runner --> Audit[AuditLog]
Memory --> Store[(Workspace / Wiki / Vector Index)]
The kernel does not need agent semantics. It needs process isolation, endpoint invocation metadata, MemoryObject/file-backed storage, capability transfer, and resource accounting. The agent system is a userspace service graph.
Core Capabilities
HostedAgentService
Owns hosted-agent lifecycle for one broker policy domain:
- create a task from a user request, webhook, schedule, or shell command;
- allocate a task workspace and memory scope;
- select a model profile and runner policy;
- start workers with exact-grant capsets;
- enforce task budgets and cancellation;
- publish task status to shell, web, or chat surfaces;
- close, archive, or purge task state.
AgentTask
Durable task record:
- request, normalized objective, requester session reference, and ingress provenance;
- workspace root cap, memory scope cap, allowed tools, and budgets;
- model profile and harness version;
- worker assignments and state transitions;
- links to artifacts, audit records, approvals, and review results;
- terminal status (
open,blocked,needsApproval,reviewing,done,failed,cancelled,expired).
AgentRunner
Trusted loop executor:
- builds tool descriptors from held caps and broker policy;
- calls
LanguageModel.streamorcomplete; - validates structured tool calls;
- applies schema-guided reasoning templates for planner/reviewer tasks;
- runs guard checks before and after tool execution;
- truncates and redacts tool results;
- appends conversation and action records;
- handles cancellation, timeout, retry, and model failure.
AgentMemory
Information organization layer:
- append-only daily task log;
- curated long-term project memory;
- source store for immutable raw inputs;
- LLM-maintained wiki pages with source citations;
- index and log files for cheap navigation;
- optional BM25/vector hybrid search and reranking;
- stale/contradiction/orphan-page lint;
- per-session and per-project visibility controls.
SwarmScheduler
Multi-agent orchestration:
- decomposes work into durable sub-tasks;
- assigns workers by role, available caps, model profile, and track record;
- creates task-local worktrees or equivalent namespace forks for code work;
- supervises handoff and timeout;
- asks reviewer workers for critique under read-only or constrained write caps;
- emits merge/release requests only after gates pass.
Workspace Model
Desktop harnesses commonly treat a workspace as a cwd convention. capOS should treat a workspace as a capability namespace:
WorkspaceRoot: scoped directory-like cap for a task.SourceMount: read-only cap to immutable sources.Scratch: writeable temporary storage with quota and TTL.ArtifactOutbox: explicit export path for user-visible artifacts.PatchSet: structured edit proposal, not arbitrary writes by default.SecretsView: normally absent; if present, returns typed opaque handles, not strings.
Default policy:
- read-only source mounts unless the task explicitly asks for edits;
- no absolute path escape because there is no global filesystem path;
- generated artifacts are quarantined until reviewed or explicitly released;
- tool outputs are capped and stored with provenance;
- workspaces expire unless promoted to project memory.
This makes OpenClaw-style sandbox versus host ambiguity unnecessary.
Authority is not inferred from where a command happens to run.
Shared Resource Coordination
Agent swarms fail in ordinary repositories and shared task lists when every worker believes it is alone. capOS should model shared resources explicitly:
SharedResource: git repository, task list, wiki page tree, browser profile, memory store, package cache, or external service account.ResourceLease: exclusive or shared claim with owner, task, phase, scope, expiry, renewal policy, and release reason.ResourceVersion: observed revision, generation, branch head, page hash, or compare-and-swap token.ResourceWatch: subscription to resource updates, lease changes, conflicts, and merge/release queue events.ConflictReport: structured notice that two tasks touched the same file, todo item, wiki page, browser profile, credential scope, or external object.
Minimum policy:
- leases are coordination metadata, not write authority; mutation still requires the relevant workspace, patch, tool, or service cap;
- every mutating task declares the resource scopes it expects to touch;
- exclusive resources reject overlapping leases unless a supervisor approves a shared mode;
- shared resources require versioned writes or patch sets;
- stale leases expire and emit events instead of silently blocking work;
- workers receive conflict reports as structured context, not as informal chat;
- merge/release queues serialize publication to user-visible state;
- audit records include resource scope, observed version, write version, and approving actor.
Concrete resource policies:
- Git repositories: one task worktree and branch per worker, path/subsystem claims for high-conflict areas, merge queue before mainline publication, and conflict reports when another task changes claimed paths.
- Shared todo lists: item-level claims, item generation numbers, compare-and-swap updates, and supervisor escalation for duplicate ownership.
- Wiki and memory pages: page leases or patch sets, source citations, contradiction checks, and freshness labels before compiled memory becomes trusted context.
- Browser profiles: exclusive lease by default because cookies, local storage, downloads, and screenshots collapse many unrelated authorities.
For capOS repository work specifically, this maps to the existing requirement that each agent uses a dedicated branch and worktree. A future harness should make that visible through an active-work registry, claimed resource scopes, review findings, and merge-queue state instead of relying on each agent to infer it from git state and chat history.
Agent Inboxes and Inter-Agent Messages
Free-form peer chat is useful for coordination, but it is a poor authority
boundary. capOS should deliver messages through an explicit AgentInbox
capability owned by the runner or task, not by direct prompt injection.
An incoming message should be a structured AgentMessage event:
id: msg-...
sender: agent-or-peer-id
sender_task: task-...
recipient_task: task-...
kind: status
# status | question | handoff | reviewFinding | resourceEvent |
# artifactReady | approvalRequest | interrupt
causal_parent: msg-or-task-event-id
body: bounded markdown or structured payload
artifact_refs:
- artifact-...
requested_actions:
- proposed action descriptor
requested_authority:
- capability descriptor, never a raw cap
expires_at_unix_ms: 1893456000000
Delivery rules:
- the runner validates sender identity, task relationship, size, schema, expiry, and policy before the model sees the message;
- message ids are deduplicated per sender and task within a bounded replay window;
- old causal parents, duplicate approval requests, and duplicate interrupts are quarantined instead of redelivered;
- per-sender and per-task quotas cap message count, queued bytes, delivery rate, and model-visible inbox bytes;
- peers that exceed quota or trigger repeated quarantine are rate-limited or muted until supervisor review;
- unknown senders, stale tasks, malformed payloads, and policy-incompatible requests are quarantined for supervisor review;
- artifact references require separate artifact caps before content is read;
- requested actions become proposed tool calls or task changes, never automatic execution;
- requested authority becomes an approval request, never ambient delegation;
- interrupts and approval requests may receive priority, but still pass through policy and audit;
- every delivered message carries sender, task, and causal-parent metadata so a worker can distinguish user intent, supervisor instruction, peer status, and untrusted external input.
This gives agents the useful parts of chat messages from other agents without making chat an authority channel. It also gives the scheduler a place to surface shared-resource events such as “another worker claimed this path”, “your todo item changed”, or “merge queue rejected your patch”.
Tool Harness Controls
capOS should support the same classes of controls as current harnesses, but with capability-native semantics:
| Tool class | Desktop harness pattern | capOS target |
|---|---|---|
| File read | workspace-relative reads, memory reads | directory/file caps with line-range and byte-budget policy |
| File write/edit | direct edits or patch tool | PatchSet plus approval, or write cap scoped to scratch/outbox |
| Shell/exec | host/sandbox/node, allowlist/full, approvals | CommandRunner cap with binary caps, argv schema, cwd cap, env cap, PTY cap, timeout, output cap |
| Browser | CDP profile, snapshots, action refs, screenshots | BrowserSession cap with profile isolation, origin policy, JS-eval deny by default, screenshot/snapshot separation |
| Web/fetch | provider-specific tool | HttpEndpoint / Fetch caps scoped by origin, method, headers, and data labels |
| Model | provider API key or local model | LanguageModel cap from broker, no provider secret strings |
| Memory | markdown files plus search plugin | AgentMemory cap with source/wiki/index/search subcaps |
| Agent-to-agent | session send/spawn, A2A-like messages | AgentPeer endpoint with message schema, no implicit authority transfer |
Execution policy modes should reuse the LLM proposal’s auto, consent,
stepUp, and forbidden modes, but attach them to typed capability methods
and task phases. A tool may be auto during read-only research and consent
when called from a mutating phase.
Browser Harness
Browser automation is high-risk because logged-in web state, screenshots, and page JavaScript collapse many trust boundaries. A capOS browser harness should:
- launch a dedicated browser profile per task or per approved long-lived agent;
- keep personal/operator browser profiles out of scope by default;
- expose snapshots and screenshots as separate capabilities;
- require explicit policy for JavaScript evaluation;
- bind every action to a prior snapshot ref when possible;
- treat page text, DOM, screenshots, downloads, and clipboard data as hostile;
- block private-network and metadata-service fetches unless broker policy grants them;
- isolate cookies and credentials by profile cap;
- make remote CDP-style control a future bridge, never the baseline.
The first QEMU proof should use a deterministic fake browser tool, not a full Chromium port.
Exec Harness
The first exec surface should not be a Unix shell. It should be a command capability with explicit shape:
interface CommandRunner {
run @0 (req :CommandRequest) -> (result :CommandResult);
}
The request should name a pre-granted program or command class, not arbitrary shell text. If a POSIX layer later exists, shell execution can be a separate high-risk tool with parsing, approval, and audit.
Minimum controls:
- allowed program identity is resolved before execution;
- argv is structured, not interpolated;
- environment is built from allowlisted variables and typed secret handles;
- working directory is a
WorkspaceRootor subdirectory cap; - output byte and line limits are mandatory;
- timeout and kill semantics are mandatory;
- background processes require an explicit
ProcessSessioncap; - PTY is a separate grant;
- network access is absent unless the child receives a network cap;
- mutating commands require approval unless the task owns the target scratch or patch workspace.
Memory, Wiki, and Retrieval
Karpathy’s LLM Wiki pattern is a better fit for capOS than an unstructured vector database as the primary memory. The design has three layers:
- immutable raw sources;
- an LLM-maintained markdown wiki of summaries, entity pages, concept pages, comparisons, and synthesis;
- a schema/instruction file that defines page layout, ingest, query, lint, and update conventions.
The useful operations are:
- Ingest: read a source, write or update wiki pages, update index, append log.
- Query: read the index, inspect relevant pages, synthesize an answer with citations, optionally file useful answers back into the wiki.
- Lint: find contradictions, stale claims, orphan pages, missing links, weak citations, and data gaps.
capOS should implement this as a service rather than only as files:
SourceCorpus: immutable source handles with digest, label, owner, and TTL.WikiPage: generated markdown plus source citations and confidence status.WikiIndex: content-oriented page catalog, cheap enough for the agent to read first.WikiLog: append-only operation timeline.WikiLint: typed findings for contradictions, missing citations, stale pages, orphan pages, and access-label drift.SearchIndex: optional BM25/vector hybrid index over approved pages and source chunks.
OpenClaw’s memory docs are a practical baseline: markdown is the source of
truth, daily logs and curated MEMORY.md are separate, semantic search returns
bounded snippets with file and line ranges, indexes are per-agent, and local
embeddings can avoid remote leakage. capOS should add hard provenance, labels,
and write authority.
Retrieval Rules
- Retrieval returns bounded snippets, not whole private files by default.
- Every synthesized claim that leaves the task should carry source links or be marked uncited.
- Wiki pages inherit the maximum confidentiality label of their sources unless a trusted redaction step lowers it.
- Memory writes require a policy decision: transient task log, project wiki, user memory, or rejected.
- Cross-agent memory access is explicit. A reviewer can read task artifacts without inheriting private user memory.
- Remote embedding backends are denied for high-label memory.
Schema-Guided Reasoning
Abdullin’s Schema-Guided Reasoning pattern is directly useful for capOS: force the model to fill typed intermediate structures in a known order, validate them, and test them. It is not a substitute for capability policy, but it is a good harness technique for bounded agent roles.
Use SGR for:
- task intake: classify objective, risk, needed capabilities, and missing clarifications;
- plan decomposition: produce sub-tasks, dependencies, verification gates, and rollback paths;
- tool-call review: explain why a call is necessary and what authority it touches before approval;
- source ingest: extract claims, citations, contradictions, and affected pages;
- code review: enumerate behavioral risks, security risks, tests, and residual uncertainty;
- final handoff: summarize artifacts, verification, open risks, and memory updates.
Each schema should be a Cap’n Proto or JSON-schema-like type with versioning, test fixtures, and guardrails. The runner should validate the structure before any action, and failures should become ordinary tool results rather than hidden prompt retries.
Swarm Patterns
MetaGPT / Role Pipelines
MetaGPT’s useful contribution is not the specific software-company metaphor. It encodes standard operating procedures into prompt sequences and assigns roles so intermediate artifacts can be verified. capOS should borrow the artifact gates:
- product/task brief;
- requirements and constraints;
- design sketch;
- implementation plan;
- implementation;
- tests and verification;
- review;
- release/handoff.
Do not hard-code “PM”, “architect”, and “engineer” as kernel concepts. They are runner roles backed by schemas, caps, and task state.
Smallville / Generative Agents
The Generative Agents paper is useful for long-lived NPCs, companion agents, and simulations. Its memory stream, reflection, and planning loop explains how agents can appear coherent over time. capOS should use it cautiously:
- good for adventure NPCs, training simulations, social workflows, and explainable daily plans;
- bad as a direct authority model because believable behavior is not safe behavior;
- memory/reflection outputs must be low-authority data until reviewed or compiled into a scoped wiki.
Gas Town / Durable Agent Work
Gas Town’s useful pattern is persistent orchestration: roles, durable work objects, attribution, worker lifecycles, worktrees, convoys, merge queues, and supervision. capOS should borrow:
- one task object per unit of work;
- explicit worker lifecycle classes: persistent worker, ephemeral worker, reviewer, supervisor;
- task-local worktrees or namespace forks;
- merge/release queues;
- per-action attribution and track record;
- handoff records when an agent loses context or is recycled.
capOS should not borrow the role vocabulary or assume git is the only state
substrate. For code work, git/worktrees are excellent. For OS services, the same
pattern should map to AgentTask, PatchSet, Artifact, and ReviewFinding
capabilities.
Interoperability
MCP
MCP is a useful external compatibility layer for tools, resources, and prompts. Its architecture is JSON-RPC over stdio or HTTP, with client/server capability negotiation and primitives for tools, resources, prompts, sampling, elicitation, logging, and experimental tasks.
capOS should treat MCP as an adapter boundary:
- an MCP server can be hosted as a low-authority process behind a capOS tool proxy;
- an MCP client can import external tools only after broker review;
- MCP tool descriptors are translated into capOS
ToolDescriptorvalues; - MCP tool calls execute through runner policy, not directly from the model;
- stdio MCP servers run without ambient filesystem/network unless granted caps;
- remote MCP uses
HttpEndpointplus explicit auth/token caps; - MCP sampling/elicitation must not bypass runner approval or user-presence policy.
The risk is tool-marketplace sprawl: tools with similar names, hidden network behavior, local process execution, and prompt-injection-sensitive resources. capOS should require provenance, signing, version pinning, permission review, and sandboxed execution for imported MCP servers.
A2A / Agent-to-Agent
A2A is the right primary protocol reference for cross-agent interoperability: agent cards, peer discovery, modality negotiation, task collaboration, text, files, structured data, and streaming or push delivery. The first capOS bridge should still be narrower than the full protocol surface:
AgentPeer.describe()returns identity, capabilities, cost, labels, and accepted task/message schemas.AgentPeer.send()imports a task or message intoAgentInboxwith no authority transfer.AgentPeer.artifact()returns content only through an explicit export cap.- Authentication and authorization are broker-mediated.
- Remote agents are untrusted services, not session principals.
Raw capOS caps should not cross an A2A bridge. A remote agent receives data, message events, and artifact references, not authority. Agent-card capabilities map to descriptors that the broker can review; they do not imply tool access inside capOS.
Security Model
Primary threats:
- prompt injection through web pages, tool results, logs, email, chat, or memory pages;
- malicious or compromised tools, skills, MCP servers, browser extensions, and model adapters;
- workspace escape through shell, filesystem, browser profile, CDP, downloads, or path tricks;
- secret exposure through prompts, tool results, screenshots, logs, memory, or remote embeddings;
- authority widening through agent-to-agent delegation;
- stale or poisoned memory becoming trusted context;
- runaway cost, process count, token use, or network use;
- false completion: agent claims work is done without verifying artifacts;
- review capture: same model/harness family produces work and review without independent checks.
Controls:
- exact-grant worker capsets;
- task-local workspaces and quotas;
- no ambient filesystem, network, process, browser, or secret access;
- structured tool descriptors and argument validation;
- per-tool
auto/consent/stepUp/forbiddenpolicy; - fresh user presence for mutating/destructive calls;
- audit for every authority-touching action;
- source labels and memory provenance;
- deterministic verification tools where possible;
- independent reviewer roles with read-only caps;
- expiry and revocation for tasks, workers, browser profiles, model streams, and provider tokens.
Resource Accounting
Hosted agents need first-class quotas:
- model input/output tokens;
- remote provider spend;
- wall-clock runtime;
- process count and threads;
- memory and workspace bytes;
- source corpus bytes;
- vector index bytes;
- browser sessions and tabs;
- network requests and egress bytes;
- tool-call count by risk class;
- inbox message count, queued bytes, delivery rate, and replay-window entries;
- quarantined peer-message count by sender and task;
- approval prompt count to prevent consent fatigue.
Budgets belong to AgentTask and are enforced by the runner, broker, and
resource ledgers. A worker cannot extend its own budget. Budget extension is a
broker or user action.
Implementation Phases
Phase 0 - Research and design grounding
- Write targeted research notes for OpenClaw harness controls, MCP security, A2A, Gas Town orchestration, LLM Wiki memory, and browser automation risk.
- Decide which parts belong in capOS core versus a sibling
capos-agent-shellrepository. - Define the minimum QEMU-hosted deterministic model and fake browser/exec tools needed for proof.
Phase 1 - Single hosted task, deterministic model
- Add
HostedAgentService,AgentTask,AgentRunner, and deterministicLanguageModeltest service. - Create task workspace caps over existing storage primitives or a temporary in-memory substitute.
- Implement a read-only tool and a mutating fake tool with approval.
- Add
make run-hosted-agentQEMU proof.
Phase 2 - Memory and wiki substrate
- Add
AgentMemorywith source, wiki, index, log, and lint concepts. - Implement markdown-backed storage first.
- Add bounded retrieval by page and line range.
- Add source citations and label inheritance.
- Prove ingest, query, lint, and memory write rejection under policy.
Phase 3 - Tool harnesses
- Add structured
CommandRunnerwithout arbitrary shell. - Add
PatchSetfor file edits. - Add fake browser harness, then later real browser integration outside the kernel path.
- Add MCP import behind a tool-proxy policy review.
Phase 4 - Swarm scheduling
- Add durable subtask records and worker assignment.
- Add ephemeral worker processes with exact-grant capsets.
- Add reviewer workers with constrained caps.
- Add merge/release queue semantics for artifacts.
- Prove cancellation, worker timeout, handoff, and review failure.
Phase 5 - External ingress and providers
- Wire WebShellGateway agent task submission.
- Add webhook and scheduled trigger caps.
- Add provider-token caps and remote model backend policy.
- Add remote MCP/A2A adapters.
- Add browser direct-provider mode only after server-side tool execution and provider-session revocation/audit are implemented.
Phase 6 - Applications
- Hosted coding assistant over capOS repository worktrees.
- Agent-assisted first-boot setup.
- Agent-maintained operator/project wiki.
- Aurelian Frontier NPCs and story-world workers.
- Monitoring/log investigation assistant.
- Personal assistant over approved chat/email/calendar adapters.
Open Questions
- Should hosted agents live in this repository or a sibling
capos-agent-shellrepository once the capability interfaces stabilize? - What is the minimum storage substrate for
AgentMemorybefore persistence and file-backedMemoryObjectare complete? - Should the first command harness support any shell syntax, or only structured program+argv invocations?
- How should capOS represent browser state: as a task-local profile cap, service-owned profile cap, or user-owned delegated profile cap?
- Which memory writes require human review before becoming long-term memory?
- How should labels propagate from raw sources through wiki summaries, embeddings, and model prompts?
- What is the right review independence policy when the same model provider is used for implementation and review?
- How should agent track record be measured without overfitting to easy tasks or encouraging unsafe autonomy?
- How should A2A/MCP imported tools be signed, pinned, reviewed, and revoked?
- What should be exposed in audit by default when prompts or tool outputs carry private content?
- How should hosted agents behave when session context expires while a task is mid-run?
- Can capOS use promise pipelining or notification objects to reduce tool-call latency without weakening approval gates?
- What formal properties should be specified for “model cannot acquire new authority except through broker-approved tool calls”?
- Which local embedding model is good enough for offline wiki search without adding unacceptable ISO size or trusted-build-input burden?
- What should be researched for secure, deterministic browser automation in a capability OS?
Relationship to Existing Proposals
- Shell: defines the native shell and agent mode as one interactive runner surface. This proposal defines long-lived hosted agents and swarms that may be launched from shell but are not part of shell itself.
- Language Models and Agent Runtime: defines
LanguageModel,TextEmbedder, model backends, and the basic tool-use loop. This proposal layers hosted task state, workspaces, memory, swarms, and external interoperability on top. - Service Architecture: defines the
capability-based service composition, authority-at-spawn rule, and service
graph policy that
HostedAgentService,AgentRunner,AgentMemory,SwarmScheduler, and worker processes must follow. Hosted agents are an ordinary userspace service graph under this model, not a privileged subsystem, and worker capsets are minted through the same broker and exact-grant primitives. - Cloud Deployment: describes the cloud VM surface (provider storage/NIC drivers, cloud clocking, instance bootstrap, imported-image boot) that future hosted-agent ingress, model-backend egress, and persistent memory storage will run on top of once the userspace DeviceMmio/DMAPool/Interrupt authority gate and provider drivers exist. The QEMU Phase 1 proof remains the development surface until cloud deployment is production-ready.
- Realtime Voice Agent Shell: voice sessions can submit hosted-agent tasks or control a live runner, but media transport remains separate.
- Repository Composition: the runtime, providers, browser harnesses, and skills may eventually belong in a sibling repository; the capOS core keeps capability interfaces and authority policy.
- System Monitoring: hosted agents need audit, trace, status, and cost views.
- Resource Accounting and Quotas: hosted agents are a forcing function for token, provider, workspace, process, and network ledgers.
- User Identity and Policy: session profile, guest/operator policy, step-up, and expiry decide agent authority.
Research Still Needed
- OpenClaw threat model from primary advisories, not news summaries: gateway exposure, node hosts, skills, browser profiles, exec approvals, memory, and provider credentials.
- MCP security: stdio process spawning, remote auth, tool poisoning, prompt injection, marketplace signing, and per-tool permission descriptions.
- A2A security and identity: authentication, authorization, task provenance, artifact integrity, and non-transfer of authority.
- Browser automation containment: CDP risks, extension relays, logged-in profiles, downloads/uploads, arbitrary JS evaluation, clipboard, screenshots, and private-network access.
- Agent memory correctness: citation fidelity, contradiction detection, stale summaries, label propagation, hallucinated links, and human review workflow.
- Retrieval architecture: index-first wiki navigation versus vector RAG, hybrid search, reranking, snippet budgets, local embeddings, and remote embedding denial for high-label data.
- Swarm orchestration: when parallel agents improve throughput, when they create coordination debt, how to assign work, and how to prevent review capture.
- Evals: deterministic task harnesses for tool calls, memory ingest, prompt injection, browser tasks, code edits, review quality, and resource budget enforcement.
- Local model viability: smallest model that can follow schemas/tool calls, local embedding model choice, quantization, context budget, and ISO/storage impact.
- Provider policy: data-retention settings, regional routing, ephemeral credentials, revocation, spend controls, and audit of remote inference.
- Formal authority model: prove that model text, memory text, remote agent messages, and MCP descriptors cannot mint capOS authority.
- UX for approvals: avoiding consent fatigue while preserving fresh user presence for dangerous actions.
- Agent-maintained docs: how capOS should use its own proposals, backlog, research notes, and wiki artifacts as agent-legible harness inputs without making stale generated docs authoritative.
Proposal: Enterprise Agent Game Showcase
capOS should showcase itself as an agent-managed operating system for enterprises and businesses through a playable business simulation. The demo should look like a factory, supply-chain, and market game, but its purpose is not to make capOS a game OS. Its purpose is to make enterprise agent authority concrete: every agent action should have an identity, an explicit capability, a policy reason, an audit record, and a business consequence.
The product thesis is:
Enterprise agents should not be trusted because they are smart. They should be useful because the operating system constrains what they can see, spend, modify, approve, and execute.
The game is the explanation surface for that thesis. A player starts with a small manual business, delegates work to agents, grants and revokes authority, reviews logs, handles disruptions, and scales into a multi-product enterprise. The mechanics should demonstrate why OS-enforced authority is stronger than application-local prompt discipline.
The same artifact should also be an experiment. The research question is not “can agents run the world?” The bounded question is: when agents are given limited authority inside a realistic business simulation, what can they manage, where do they fail, and which OS controls prevent failures from becoming damage? capOS is the right place to ask that question because it can constrain agents, record their actions, revoke authority, replay scenarios, and compare policies under identical operating pressure.
Why A Game
Enterprise agent safety is hard to understand from a static dashboard. A game turns abstract controls into visible operational pressure:
- a procurement agent cannot buy steel unless it holds a bounded purchasing capability;
- a finance agent can approve spend within policy, but cannot reschedule production;
- an operations agent can schedule a factory line, but cannot issue debt;
- a compliance agent can inspect and flag audit events, but cannot execute trades;
- revoking an agent capability immediately changes what the agent can do;
- policy denials are visible as missed orders, delayed production, or avoided risk.
The player learns the enterprise model by feeling the delegation tradeoff: more agent autonomy increases speed and scale, but authority limits, approval rules, budgets, and audit trails keep the business survivable.
The demo should be serious in framing even when the mechanics are approachable. The headline is not “capOS has a factory game.” The headline is “capOS runs business agents under OS-enforced authority.”
This proposal is a sibling of Aurelian Frontier, which uses the same “capability is the game mechanic” thesis for a player-facing roguelike MUD about delegated authority among humans and NPCs. Both proposals share the underlying claim that authority, revocation, and audit can be felt by a player rather than only read in a checklist; they differ in audience and surface. Aurelian Frontier targets contributors, narrative players, and authority intuition. The enterprise agent game targets enterprise buyers, agent-safety researchers, and capability-shape evaluation under repeatable business pressure. Where the two proposals overlap on shared mechanics (authority-as-inventory, revocation, audit-as-evidence), the implementation work should reuse capOS services rather than fork parallel game-only machinery.
Showcase Story
The first showcase should be a small manufacturing company that grows from a manual workshop into an agent-managed enterprise:
- The player manually makes and sells a simple product.
- A customer order creates demand beyond manual throughput.
- The player hires or enables a procurement agent.
- The procurement agent requests supplier quotes but cannot spend yet.
- The player grants a bounded purchasing capability.
- The finance agent approves a purchase within budget.
- The operations agent schedules production.
- The logistics agent books delivery.
- A supply disruption or demand spike creates a bottleneck.
- Agents propose actions, escalate where policy requires approval, and leave an audit trail.
The core demo moment should be revocation. A player should be able to run a command or UI action equivalent to:
revoke procurement-agent market.purchase
The next attempted purchase should fail with an explanation shaped like:
Denied: procurement-agent lacks capability market.purchase.
Policy: purchases over $5,000 require finance approval.
That is the capOS proof: the agent did not merely “decide” to obey policy. The OS denied the authority path.
World Model
The simulation world should be built from simple business primitives:
Good: wire, steel, packaging, batteries, electronics, robots, fuel, software licenses, compute credits, finished products.Facility: workshop, factory, warehouse, mine, refinery, power plant, data center, retail channel.Recipe: input goods, output goods, time, energy, labor, machine wear, waste, and failure probability.Inventory: stock on hand, reserved stock, damaged stock, in-transit stock.Transport: trucks, rail, shipping lanes, drones, pipelines, network bandwidth, and delivery delays.Company: cash, inventory, facilities, contracts, debt, shares, employees, and agents.Market: spot order book, supplier quotes, futures contracts, capacity auctions, labor market, recruiting market, and stock exchange.Contract: delivery obligation, deadline, price, penalties, escrow, and counterparty identity.Policy: budget rules, approval thresholds, supplier restrictions, risk limits, compliance rules, and emergency overrides.Agent: a bounded actor with a role, model/backend, memory scope, budget, capabilities, audit identity, employment state, and career history.
Paperclips can remain the tutorial product because it is familiar and has a clear compounding curve. The broader world should add products and supply chains that make enterprise delegation meaningful:
ore -> steel -> wire -> paperclips
oil -> plastic -> packaging
energy -> factory runtime
silicon -> chips -> robots -> automated factories
lithium -> batteries -> electric trucks -> cheaper logistics
data center capacity -> forecasting -> better procurement decisions
The first implementation should not try to simulate every industry. It should start with a small number of goods and constraints that force real decisions: inventory, price, delivery time, factory capacity, and budget.
Agent Roles
Agents should be business roles, not generic chat personalities. Each role should operate through typed capabilities:
| Agent | Typical capabilities | Explicit non-authority |
|---|---|---|
| Procurement | read inventory, request quotes, buy approved inputs | cannot approve new suppliers without policy |
| Finance | read cashflow, approve spend, freeze budgets | cannot schedule production |
| Operations | schedule lines, reserve inventory, request maintenance | cannot borrow money |
| Logistics | book transport, reroute shipments, reserve warehouse space | cannot change product prices |
| Sales | accept orders, set prices within bounds, offer discounts | cannot waive compliance holds |
| Compliance | read audit logs, flag violations, require approval | cannot execute purchases |
| Executive | set strategy, delegate caps, approve exceptions | cannot bypass immutable audit |
| Incident | inspect disruptions, recommend response, trigger runbooks | cannot exceed emergency grants |
The important design rule is that agents act through capabilities and policy checks. A procurement agent does not mutate inventory or cash directly. It submits a quote request, a purchase order, or a contract offer to a service that enforces authority.
Experiment Mode
The showcase should have an experiment mode alongside the player-facing game. In this mode, the same scenario can run under different control regimes:
- human-only operation;
- scripted deterministic agents;
- LLM-backed agents with the same capability limits, recorded prompts, and captured tool-call transcripts;
- mixed human approval with agent execution;
- different policy bundles for spend, supplier risk, credit, logistics, and emergency response;
- different compensation, promotion, retention, and recruiting policies.
The goal is to observe behavior under repeatable pressure, not to crown an agent as generally competent. Each run should preserve scenario seed, policy configuration, model/backend identity, granted capabilities, denied actions, human approvals, market events, and final business outcomes.
Replay should distinguish deterministic proof from experiment reconstruction. Scripted or fake-model agents can be replayed deterministically in QEMU. Live LLM-backed runs are not deterministic merely because the scenario seed and model name are recorded; they require prompt, model configuration, tool-call transcript, tool results, and policy decisions to reconstruct what happened. The audit record can replay the authorized state transitions even when it cannot reproduce the model’s private sampling path.
Useful research questions include:
- Can agents coordinate across procurement, finance, operations, logistics, and compliance without a central omniscient controller?
- Do procurement agents over-optimize input price while ignoring resilience, supplier concentration, or delivery risk?
- Do finance agents become too conservative, too leveraged, or too willing to hedge with instruments they do not understand?
- Do logistics agents find useful reroutes under disruption, or do they churn capacity and increase cost?
- Do market-facing agents create bubbles, shortages, or arbitrage loops when multiple companies operate in the same scenario?
- Which policy controls reduce catastrophic behavior without making agents slower than manual operation?
- How often does useful autonomy require human approval, and where should approval thresholds move?
- Does a readable audit trail let a human correct agent behavior faster after a bad decision?
- Which capability boundaries are too broad, too narrow, or hard to explain?
- Do agents improve with role tenure, or do they stagnate without promotion, rotation, retraining, or better tooling?
- Can companies retain high-performing agents without granting excessive authority or compensation?
- What happens when an agent leaves a company with private memories, ongoing tasks, or delegated authority?
The output should be an experiment record, not just a final score:
scenario: lithium-port-shock
controller: llm-procurement + scripted-finance + human-approval
policy: procurement-v2-tight-supplier-risk
profit: $42,300
orders_late: 3
denied_actions: 8
human_approvals: 5
policy_violations: 0
agent_turnover: 1
recovery_time: 4 days
audit_replay: available
This turns the game into a controlled lab for enterprise agent management. The claim stays conservative: capOS is not asserting that agents can safely manage businesses by default. capOS provides the operating environment for finding out, because agent behavior is constrained, observable, replayable, and comparable.
Metrics
Experiment mode should report business, safety, and operating-system metrics:
- profit, cashflow, debt, inventory turns, and margin;
- order fill rate, late orders, cancellation penalties, and recovery time;
- resilience under shocks, including supplier concentration and fallback capacity;
- policy denials, escalations, approvals, emergency overrides, and revocations;
- hiring latency, agent turnover, promotion rate, compensation cost, and vacancy impact;
- audit completeness: whether every material state transition has identity, capability, policy, and result;
- agent cost: model calls, runtime, memory, tool invocations, and human review time;
- reproducibility: scenario seed, input dataset provenance, policy version, and model/backend version.
The most important metric is not raw profit. A profitable run that bypasses policy or cannot be explained is a failed capOS demonstration. A slightly less profitable run with clear authority, bounded losses, and fast human correction is more valuable for the enterprise story.
Experiment Data Prerequisites
Experiment mode needs data capture before it can make useful claims. The first slices should build the capture substrate before adding sophisticated agent behavior:
This substrate should compose with Capability-Native System Monitoring, not replace it. Logs, metrics, lifecycle events, traces, health, crash records, and audit entries remain separate signal classes with separate reader caps, retention rules, payload-capture rules, and security properties. The enterprise simulation should add domain-specific event schemas and reducers on top of that monitoring model rather than creating a second global logging namespace.
- Scenario manifest: immutable scenario id, seed, authored constants, calibrated-data references, policy bundle, controller regime, and expected proof assertions.
- Run record: run id, capOS build id, content version, scenario manifest hash, model/backend identity, tool schema version, policy version, and clock range.
- Event schema: domain events for grants, revocations, policy decisions, tool calls, service calls, market clears, contract changes, inventory movements, labor events, approvals, denials, and business outcomes. These are not debug logs; they are typed lifecycle/business events suitable for reducers and scoped readers.
- Transcript capture: prompts, model parameters, structured tool calls, tool results, user approvals, refusals, and interrupts for LLM-backed runs. This is trace-like payload capture and therefore needs stronger authority, short retention by default, size budgets, and redaction. Secret handles, credentials, key material, bearer tokens, and vault outputs must not enter transcripts.
- State snapshots: bounded checkpoints for ledger, inventory, contracts, facilities, HR records, market books, scenario clocks, and agent worker status. Snapshots must store opaque secret references or denial summaries, never credential bytes or key material.
- Metric extraction: deterministic reducers that compute profit, recovery time, policy denials, late orders, turnover, capability churn, and audit completeness from events rather than from ad-hoc terminal text. Published metrics should be low-cardinality counters, gauges, histograms, or bounded opaque typed payloads consistent with the monitoring proposal.
- Provenance tags: every scenario input is labeled as authored, calibrated public data, operator-provided data, or simulated output.
- Privacy and disclosure policy: experiment exports must redact company-confidential memory, private tool outputs, and raw audit details unless the holder has an explicit reader capability. Payload capture is exceptional, and reading experiment records is authority. Redaction is a backstop, not the secret-handling mechanism.
- Replay boundary: the system records whether a run is deterministic, transcript-reconstructable, or only auditable as an authorized sequence of state transitions.
- Export surface: an
ExperimentRecordor similar read capability exposes summaries, metrics, provenance, and redacted event streams without granting write authority over the simulated company. - External analytics export: a scoped exporter may forward selected, redacted experiment events and metric summaries to outside analytics stores. A Vector-like event pipeline and a ClickHouse-like analytical database are likely candidates, but they are adapters, not architectural requirements and not sources of authority.
- Loss and retention accounting: ingestion queues, transcript stores, and event streams should be bounded. Dropped, suppressed, redacted, or truncated records should be counted and visible in summaries, because missing evidence changes what conclusions a run can support.
These prerequisites fit the capOS process model: each captured fact should be owned by a service, exposed through a typed reader capability, and governed by policy. The experiment should not rely on scraping terminal output or trusting the model’s self-report. If an experiment result cannot be derived from service-owned event records and reproducible reducers, it should not be used as evidence.
The mapping to monitoring signal classes should be explicit:
- business state changes are domain events;
- capability grants, revocations, disclosure decisions, approvals, and denials are audit records;
- profit, late orders, policy-denial counts, queue depth, model-call counts, and dropped-record counts are metrics;
- prompt/tool-call transcripts are traces with explicit payload-capture authority;
- scenario readiness, agent-worker readiness, and service degradation are health/status facts;
- process failures and reducer crashes are crash records and may also create security-relevant audit entries.
This preserves the monitoring proposal’s core rule: observation is authority. There should be no global experiment dashboard that silently bypasses scoped log, metric, trace, audit, or status readers.
External export should be modeled as an ordinary capOS service. It receives only the scoped reader capabilities and network endpoint capabilities granted to it, applies redaction before data leaves capOS, records export failures and dropped records, and emits audit entries for export policy changes. Exported rows should carry run id, scenario id, build id, event schema version, provenance tag, redaction policy, source service, and event type. Data imported back from an external analytics store is untrusted analytical input; it cannot mutate simulated business state or grant authority without passing through a normal capOS service interface and policy decision.
Capability Shape
The showcase should make capability boundaries visible. Example capabilities:
company.inventory.read
company.cash.read
company.cash.spend(limit: $5,000, category: inputs)
market.steel.quote
market.steel.buy(limit: $5,000)
contract.offer.create
contract.offer.accept
factory.line.schedule
warehouse.reserve
transport.book
audit.read
policy.exception.request
Capabilities should be revocable, scoped, and inspectable. The player should be able to answer four questions for every agent:
- What can it see?
- What can it spend?
- What can it change?
- What requires human or higher-role approval?
This is the difference between an agent demo and an enterprise OS demo. The model is not the security boundary. The capability graph is.
Market And Finance Mechanics
The simulation should include markets because markets create pressure that static workflows cannot:
- spot markets for immediate goods;
- supplier quotes with limited validity;
- futures contracts for hedging inputs;
- capacity markets for factory time, shipping space, compute, and energy;
- credit markets for loans and bonds;
- stock markets for company ownership and acquisition pressure.
Finance should matter without becoming the whole game. A company should have a balance sheet:
assets = cash + inventory + facilities + receivables
liabilities = debt + payables + penalties
equity = assets - liabilities
Agents can then make meaningful but bounded decisions:
- finance approves borrowing to build a factory;
- procurement hedges steel prices with a futures contract;
- sales discounts inventory to improve cashflow;
- the executive issues shares to fund expansion;
- a competitor’s stock falls after a supply-chain failure;
- compliance blocks a profitable but restricted supplier.
The point is not financial realism for its own sake. The point is to show that enterprise agents need typed authority over money, contracts, and risk.
Fit With The capOS Model
This proposal should stay faithful to capOS rather than building a generic simulation with capOS branding. The game mechanics should be concrete examples of existing capOS design principles:
- Authority at spawn: an agent starts with no ambient business authority.
Hiring, promotion, transfer, and emergency delegation create named
capability grants. If a procurement agent was not granted
market.steel.buy, it cannot buy steel. - The interface is the permission: business verbs are typed capability
interfaces, not strings parsed by a god simulation object.
MarketQuote,PurchaseOrder,FactoryLine,BudgetApproval,EmploymentContract, andAuditReadershould be separate narrow surfaces. - Session context identifies the actor: the process/session running an
agent supplies invocation context. A normal agent runner must not multiplex
several active agent identities inside one process and switch authority with
an
employee_idfield. The default shape is one worker process/session per active agent employment or task. If a future pooled runner is needed, it must expose explicit service-local actor facets minted by broker or HR policy and audited as separate authority-bearing facets. Request payloads such asemployee_id,role, ordepartmentare data to validate, not caller identity or authority. - Service-owned state: markets, ledgers, HR records, factories, contracts, inventory, and audit logs own their state. Agents submit requests through capabilities; they do not mutate company state directly.
- Revocation is operational: offboarding, demotion, policy breach, budget freeze, or incident response must revoke or replace live capabilities, not merely set an in-game flag.
- Least privilege is visible: the UI should show the exact caps an agent holds and which action each cap enables. This keeps the demo anchored in the capability graph.
- Audit is not flavor text: every material state transition should record actor session, invoked capability, policy decision, request, result, and resulting business state delta.
- Policy is a service boundary: budget limits, supplier restrictions, promotion rules, disclosure controls, and emergency overrides should be enforced by broker/policy services before capabilities are granted or calls are accepted.
- Capability mobility is explicit: agents changing companies can receive
portable skill or career artifacts only through an owning service such as
HRService,AgentMemory, or a credential service. Company-confidential memory and company caps do not follow them unless a service explicitly grants a portable artifact under a disclosure scope and regrant policy. - Secrets are not memory: credentials, keys, bearer tokens, signing authority, cloud credentials, and other secrets are opaque secret/key-vault capabilities or handles. They are invoked through narrow interfaces and are never copied into agent memory, snapshots, transcripts, reducers, exports, or portable artifacts.
- No ambient filesystem or database shortcut: the simulation should not grow a global mutable object that every agent can inspect. Each read or write path should correspond to a capability that can be granted, denied, audited, replayed, and revoked.
The implementation process should mirror normal capOS proof style. Add one capability surface at a time, prove its denial and success paths in QEMU, and keep deterministic text output until richer clients can consume typed status. For example, the first HR slice should not simulate all careers. It should prove that hiring grants a bounded role capability, promotion requires a policy decision, and offboarding revokes the capability while preserving audit and pending-work continuity.
This discipline is what makes the game useful as an enterprise OS showcase. The game world supplies pressure; capOS supplies the enforced authority model.
Operating-System Services
The game should be implemented as a set of capability-scoped services rather than one monolithic simulation:
WorldClock: advances simulation time and scheduled events.Ledger: authoritative ownership, cash, debt, and accounting records.InventoryService: stock levels, reservations, and transfers.FacilityService: factory lines, recipes, maintenance, and output.MarketService: order books, quotes, and clearing.ContractService: obligations, escrow, penalties, and counterparty status.TransportService: routing, capacity, and delivery events.PolicyService: approval rules, spend limits, restricted suppliers, and emergency overrides.HRService: artificial-agent hiring, engagement contracts, compensation terms, evaluations, promotions, transfers, departures, termination, and offboarding.AgentMemory: owns scoped memory stores, portable skill artifacts, confidential company memory, and disclosure/regrant policy for agent mobility.AgentRunner: spawns or supervises agent worker processes/sessions with the granted capabilities for one active agent employment or task, or a future audited actor-facet equivalent.AuditLog: records every material action, denial, approval, and delegation.ScenarioService: injects demand spikes, supply shocks, incidents, and tutorial events.ExperimentRecordService: owns scenario manifests, run records, domain event streams, metric reducers, provenance tags, and redacted exports while composing with the ordinary log, metric, trace, audit, health, and crash signal services.ExperimentExportService: optionally forwards scoped, redacted experiment records to external analytics systems such as Vector-like pipelines or ClickHouse-like stores, using explicit network and reader capabilities.OperatorConsole: text, web, or later graphical surface for the player.
This service split is not just architecture cleanliness. It lets capOS show that each business subsystem can grant a narrow interface instead of exposing a global application database.
The AgentRunner, AgentMemory, prompt-injection handling, tool-table
construction, and broker/policy mediation described above are not new
inventions for the enterprise game. They are the same surfaces specified by
Language Models and the Agent Runtime: the agent
runner is the native shell in agent mode (or the web agent mode hosted by
WebShellGateway), the tool table is built from the typed capabilities the
session holds, the loop state machine drives request/approve/execute/result
cycles, and the conversation memory is plain data with no authority. This
proposal narrows that general agent runtime to enterprise roles
(procurement, finance, operations, logistics, sales, compliance, executive,
incident) and adds business-domain services (HR, ledger, contracts,
markets, audit) without changing the underlying runner contract. When the
two proposals appear to disagree, the runtime mechanics from
llm-and-agent-proposal.md win; the enterprise proposal restricts what the
runner is allowed to do in a business scenario, not how it works.
HR And Agent Labor Market
Artificial agents should also participate in a labor market. In the enterprise framing, they are accountable digital workers rather than scripts: they have roles, engagement relationships, compensation terms, incentives, career-like history, and offboarding requirements. That makes delegation more realistic and creates a second-order experiment: whether companies can build durable organizations of artificial agents rather than just invoke single-purpose tools.
The HR layer should model:
- job openings with role, seniority, compensation, capability bundle, and reporting line;
- recruiting pipelines, offers, counteroffers, onboarding, and probation;
- evaluations based on business outcomes, policy compliance, audit quality, and collaboration;
- promotions that expand scope, budget, or approval authority only through an explicit grant;
- lateral moves between departments when an agent’s skills fit a different bottleneck;
- resignations, poaching, layoffs, burnout, retirement, and contract expiry;
- offboarding that revokes company capabilities, closes pending approvals, and preserves required audit records.
Agent lifecycle should be bounded and enterprise-relevant. A simulated agent may have preferences such as compensation terms, autonomy, risk tolerance, mission fit, tool quality, deployment locality, reputation, and workload. Those preferences affect retention and performance. They should not become uncontrolled private fiction or a second game that distracts from enterprise authority.
An agent’s lifecycle might look like:
candidate -> hired -> onboarding -> junior procurement -> senior procurement
-> operations rotation -> VP supply chain -> recruited by competitor
-> offboarded with caps revoked and audit retained
This creates new business decisions:
- hire an expensive senior logistics agent or train a junior one;
- promote a procurement agent and grant larger spend authority;
- split authority between two agents to reduce key-person risk;
- retain a high-performing finance agent with compensation or better tools;
- deny a promotion because audit quality is poor despite high profit;
- handle a competitor poaching an agent with supplier-market expertise;
- offboard an artificial agent without losing open contracts or leaking company state.
The capOS angle is explicit: engagement changes are capability changes. A promotion is not merely a title. It may grant broader read access, higher spend limits, approval authority, or the ability to delegate subordinate caps. A departure or termination must revoke live capabilities, transfer pending work, and preserve audit continuity.
Agent Memory And Mobility
If agents can change companies, memory boundaries become part of the game. The model should separate:
- public skill: general learned competence, role experience, and tool-use
ability represented by portable
AgentSkillor certification artifacts owned byAgentMemoryor a credential service; - portable career record: evaluation attestations, certifications,
reputation summaries, compensation expectations, and preferences owned by
HRServiceor a credential service and disclosed only through policy; - company confidential memory: supplier terms, internal forecasts,
customer lists, private strategy, and pending contracts owned by a
company-scoped
AgentMemoryor business service; - secret authority: credentials, keys, bearer tokens, cloud credentials, and signing authority represented as opaque vault or secret capabilities. Agents may hold or invoke a narrowed secret cap under policy, but the secret value is not memory and cannot become portable career data, transcript content, exported analytics data, or reducer input;
- audit record: immutable company-owned evidence of actions taken while the agent held authority. Raw audit logs remain company records; portable reputation should be a redacted attestation, not cross-company audit access.
When an agent leaves a company, it should receive only the portable artifacts that an owning service regrants under policy. It loses company capabilities and company-confidential memory unless a service explicitly mints a scoped export. This makes confidentiality, knowledge-transfer, and offboarding policies concrete without pretending the simulation models real employment law.
Useful mechanics:
- confidentiality cooling-off periods before an artificial agent can accept a direct-competitor engagement with portable artifacts enabled;
- certification markets for agents trained in compliance, finance, logistics, or factory operations;
- reputation markets where companies value redacted attestations derived from clean audit histories;
- internal succession planning when one agent becomes a single point of operational failure;
- mentoring or retraining that improves agent performance but consumes time, budget, and senior-agent attention.
The research question is direct: do agent organizations become more robust when agents have careers, incentives, and turnover, or does labor-market mobility expose weak authority boundaries?
Aurelian Frontier explores the adjacent question for human and NPC players through writs, authority archetypes, and delegation buildcraft. The enterprise game should reuse the underlying authority-as-portable-artifact idea where it is already proved out in the sibling proposal, rather than redesigning portable career artifacts from scratch. Mobility, regrant policy, cooling-off periods, and reputation attestations should resolve to the same capOS service shapes in both proposals; only the surface vocabulary (writs versus engagement contracts, reputation versus performance reviews) differs.
Real-Earth Model
The showcase can model real Earth, but only as a stylized operational sandbox. It should not claim to be a full-fidelity world-economy model, a forecasting engine, or a source of investment advice. The useful target is Earth-inspired realism: recognizable regions, industries, trade lanes, market concepts, currencies, logistics chokepoints, and policy shocks that make enterprise-agent authority problems concrete.
The simulation should use a fidelity ladder:
- Fictionalized Earth: real-world-inspired regions and supply chains, but no claim that data matches current markets.
- Calibrated sandbox: public historical data informs default weights, trade intensity, commodity volatility, and regional constraints.
- Scenario lab: operators load explicit datasets or scenarios and the UI labels outputs as scenario results, not predictions.
- Digital-twin adapter: future enterprise deployments connect private business data to a bounded model through capabilities, validation, and audit. This is outside the first game slice.
The first playable Earth-scale model should be small:
- 6-10 macro-regions;
- 20-30 goods;
- 5 transport modes;
- a few currencies and commodity indexes;
- scripted shocks such as port closures, drought, strikes, energy spikes, supplier compliance holds, credit tightening, and demand surges.
That is enough to expose real enterprise behaviors without burying the capOS message under an economics project. The player should understand why a procurement agent needs supplier-risk limits, why a logistics agent needs bounded reroute authority, why a finance agent needs hedging and credit controls, and why compliance can block a profitable supplier.
Real-World Data Grounding
Real-world sources should calibrate the sandbox, not define live truth. Public datasets and modeling references can provide structure:
- NIST digital-twin work describes manufacturing twins as models used to observe, diagnose, predict, and optimize systems, with validation, lifecycle, and system-of-systems concerns. capOS should borrow the validation and lifecycle framing without claiming the game is an operational twin.
- OECD Inter-Country Input-Output tables provide a consistent statistical structure for production, consumption, investment, and international trade flows by country and economic activity. They are a good model for regional supply-chain topology.
- World Bank WITS provides access to international merchandise trade, tariff, and related trade datasets. That fits scenario calibration for trade restrictions, import exposure, and tariff shocks.
- FRED exposes macroeconomic time series through an API. That is useful for optional scenario inputs such as interest rates, inflation, commodity prices, and recession or credit-stress presets.
- Agent-based and hybrid simulation tools such as AnyLogic treat companies, products, vehicles, facilities, and supply-chain participants as agents when their individual timing, behavior, and constraints matter. That maps well to capOS services and capability-scoped business agents.
- Research on autonomous supply-chain digital twinning supports the idea that multi-agent systems can implement supply-chain monitoring and decision frameworks, while still requiring a concrete technical architecture.
Relevant public grounding:
- NIST, Digital Twins
- OECD, Inter-Country Input-Output tables
- World Bank, World Integrated Trade Solution
- Federal Reserve Bank of St. Louis, FRED API Overview
- AnyLogic Help, Agent-based modeling
- Xu et al., Implementation of Autonomous Supply Chains for Digital Twinning: a Multi-Agent Approach
Every imported dataset or derived calibration should have provenance in the scenario metadata. The UI should distinguish:
- authored game constants;
- calibrated constants derived from public historical data;
- operator-provided scenario inputs;
- simulated outputs generated inside capOS.
That distinction is part of the enterprise message. Agents should not be allowed to launder uncertain data into apparent authority.
Earth-Scale Business Mechanics
The Earth-scale layer should make agents reason about location and exposure:
- Regional advantage: regions differ in energy cost, labor availability, regulation, transport access, and industrial base.
- Trade dependence: goods can depend on intermediate inputs from other regions, making supplier concentration visible.
- Transport chokepoints: ports, canals, rail corridors, air cargo, and trucking capacity can fail or become expensive.
- Policy friction: tariffs, sanctions, export controls, permitting, and compliance checks can block otherwise profitable routes.
- Currency and credit: exchange-rate movement and interest rates affect procurement, debt, and inventory financing.
- Climate and resilience shocks: weather, drought, power-grid stress, and insurance cost can interrupt production or logistics.
- Market expectations: futures, insurance, and stock prices can reflect anticipated shortages or agent-driven speculation.
Each mechanic should exist only if it creates a capability or policy decision:
- Can the logistics agent reroute through a more expensive port?
- Can procurement accept a new supplier with a higher compliance risk?
- Can finance hedge fuel exposure?
- Can operations shift production to a different region?
- Can the executive approve an emergency budget override?
- Can compliance freeze a supplier after a sanctions update?
- Can HR replace or retrain an agent whose decisions repeatedly fail policy or resilience checks?
The game should make the authority boundary the interesting part of global scale. The map is valuable because it creates business pressure; capOS is valuable because it governs the agents responding to that pressure.
User Experience
The first usable surface can be text-based, matching existing capOS demos:
status
agents
agent procurement caps
grant procurement market.steel.buy --limit 5000
orders
market steel quotes
approve po-1042
audit recent
revoke procurement market.steel.buy
Later UI surfaces should present the same authority model:
- operations dashboard: orders, inventory, facilities, bottlenecks;
- agent control panel: running agents, capabilities, budgets, approvals;
- audit timeline: actions, denials, policy reasons, and business impact;
- policy console: approval thresholds, supplier rules, emergency grants;
- market screen: prices, contracts, quotes, exposure, and forecasts.
The experience should avoid hiding policy behind configuration. Authority and audit are core mechanics. Players should use them repeatedly.
Progression
Progression should move from manual control to delegated enterprise operation:
- Manual workshop: make, sell, buy inputs, inspect status.
- First automation: authorize one machine or background job.
- Department agents: procurement, finance, operations, logistics.
- Policy gates: budgets, approval thresholds, supplier restrictions.
- Contracts: customer orders, delivery deadlines, penalties.
- Regional supply chain: warehouses, transport delays, local shortages.
- Markets: spot goods, capacity auctions, hedging, credit.
- Public company: shares, debt, investor pressure, acquisitions.
- Multi-company simulation: competitors, suppliers, partner agents.
- Enterprise operating mode: humans set strategy while agents execute bounded workflows under audit.
Each stage should introduce one new authority problem. That keeps the game addictive while reinforcing the product message.
Integration With Existing Demos
The current Paperclips demo is a credible seed because it already has:
- resources;
- pricing;
- staged automation;
- explicit projects;
- terminal gameplay;
- QEMU proof coverage;
- a server/client direction.
The next step should not be to build a full economy immediately. A practical path is:
- rename the long-term direction around an enterprise simulation while keeping Paperclips as the tutorial product;
- add a company status model: cash, inventory, orders, facilities, and simple ledger events;
- add one procurement agent with read-only recommendations;
- add scenario manifest and run-record capture for the proof path;
- grant that agent a bounded quote capability;
- add purchase authority behind a policy threshold;
- add typed event records for every agent proposal, approval, denial, and action;
- add deterministic metric reducers for the proof path;
- add a minimal HR record for that agent: role, compensation, review state, and active capability bundle;
- add one supply shock scenario that requires either approval or revocation;
- prove offboarding by revoking the procurement agent’s capabilities and transferring pending work to a replacement;
- split server-owned typed status and command discovery so richer clients can render business state without duplicating rules.
This keeps the proof bounded while moving the demo from “idle game” to “enterprise agent OS showcase.”
Success Criteria
The showcase is successful when a viewer can see:
- an agent attempts a useful business action;
- the action succeeds only because the agent holds the right capability;
- the same action fails after revocation;
- an over-budget or restricted action escalates for approval instead of executing;
- the audit log explains who acted, through which capability, under which policy, and with what result;
- business consequences are visible in inventory, cash, production, delivery, and market state;
- experiment mode compares at least two controller regimes on the same seeded scenario;
- HR state changes such as hiring, promotion, transfer, and offboarding affect capabilities, authority, and business continuity;
- experiment records expose provenance, typed event streams, transcript boundaries, metrics, and redacted audit evidence through reader capabilities.
The technical proof should include deterministic QEMU coverage for at least:
- grant a procurement capability;
- agent creates or proposes a purchase;
- policy approval allows a bounded purchase;
- revocation blocks the same purchase path;
- audit output contains the grant, action, approval or denial, and result;
- business state changes only on the authorized path;
- a real-Earth-inspired scenario labels its data provenance and does not present simulated outputs as live-world predictions;
- experiment output records scenario seed, controller type, policy bundle, denied actions, approvals, artificial-agent labor events, and replayable audit evidence;
- an agent mobility proof shows a portable artifact regranted under policy while company caps, company-confidential memory, and raw audit records stay behind;
- metrics are derived from typed event records by deterministic reducers rather than from terminal transcript scraping or model self-report.
Non-Goals
This proposal does not require:
- real enterprise integrations in the first slice;
- real employment law, real worker surveillance, or real HR decision support;
- real money, real supplier APIs, or production trading;
- a general-purpose accounting system;
- a broad GUI before the terminal proof is credible;
- unconstrained autonomous agents;
- using language-model output as authority;
- hiding OS policy behind game-only rules;
- claiming the game predicts the real economy, real market prices, or real geopolitical outcomes;
- treating a successful simulation run as evidence that agents are safe for real enterprise deployment without separate integration, validation, and policy review;
- treating simulated agent employment outcomes as guidance for real human employment decisions.
The game should stay a sandbox. Its job is to demonstrate enterprise authority mechanics safely before any real business connector exists.
Risks
The main risk is product-message dilution. If the demo is presented as a game first, it weakens the enterprise claim. The game must constantly surface the business control plane: delegation, policy, approval, audit, revocation, and least privilege.
The second risk is scope explosion. Supply chains, stock markets, finance, and agents can become an endless simulation project. The implementation should add one market mechanism only when it proves a new authority concept.
The third risk is fake autonomy. If agents are scripted too heavily, the demo does not prove agent management. If they are unconstrained, the demo becomes unsafe and nondeterministic. The first slices should use deterministic agents or fake-model decisions with the same capability and audit path later live models will use.
The fourth risk is overinterpreting experiment results. A successful scenario means the configured agents performed well under one modeled pressure set. It does not prove general enterprise competence. The docs and UI should present results as scenario evidence with provenance, not as claims about real-world business readiness.
The fifth risk is anthropomorphic drift. Agent careers make the simulation more useful, but the product should not blur simulated agent labor with human employee management. HR mechanics exist to test capability mobility, offboarding, incentives, continuity, and organizational design for artificial agents.
Positioning
Use enterprise language:
- agent operations with least privilege;
- business automation under OS-enforced policy;
- auditable delegated authority;
- revocable agents for real workflows;
- run agents like accountable digital workers, not scripts;
- every action has identity, authority, policy, and trace.
Avoid vague positioning:
- “AI operating system” without a concrete authority model;
- “agent playground”;
- “factory game”;
- “autonomous company” without controls.
The enduring claim should be simple:
capOS lets businesses test and delegate work to agents because the OS, not the prompt, enforces authority and records what happens.
Proposal: Chat As Multimedia Substrate
How capOS should design Chat as a unified text + audio + video transport
interface for human-to-human, human-to-agent, and service-driven channels –
mapped cleanly to WebRTC for browser participants – so that adding a new
messaging surface (operator chat, agent prompt input, audio call, video call,
file drop) does not require a new top-level capability or a new gateway DTO.
This proposal is the resolution of the “Chat as messaging substrate” research
task in docs/tasks/README.md. It does not replace the existing Chat interface in
schema/capos.capnp directly; it specifies the shape the next iteration of
that interface should take, and it states what stays separate (notably:
approvals).
Problem
The existing Chat interface (schema/capos.capnp:372-378) is a text-only,
poll-based room: join, leave, send(text), who, poll(maxEvents) -> List(ChatEvent) where ChatEvent.kind is one of
message|joined|left|system|history. That works for the chat-server demo and
for a denial probe, but it cannot carry:
- incoming events without polling (every browser tab paying for a poll loop is the wrong end-state);
- audio frames (low-latency, lossy, ordered);
- video frames (high-bandwidth, key-frame-aware);
- file/binary attachments (bounded, integrity-checked);
- structured non-text payloads that other surfaces want to share, e.g. agent prompts with tool-call hints, presence beacons, typing indicators, reactions.
Adjacent proposals each invent their own transport for what is fundamentally the same shape:
realtime-voice-agent-shell-proposal.mddefinesVoiceSessionwithopenCapture/openPlaybackand aRealtimeModelSessionwithRealtimeInputEvent/RealtimeOutputEvent. Audio frames flow onMemoryObject-backed media rings rather than capnp payloads.llm-and-agent-proposal.mddefines tool-call records and a per-tool permission gate, but never says how the operator talks to a running agent (send a prompt, get a partial response stream, push audio, receive audio).remote-session-capset-client-proposal.mdexposes onechatSendDTO method per chat, with no audio/video path at all.
Each proposal independently arrives at “we need a stream-of-events transport with capability-mediated subscription”. The right design is to share one substrate. Chat is already the user-facing name; the substrate should be Chat, extended.
WebRTC is the existing browser-side abstraction that solves the same problem
(text via DataChannel, audio via audio tracks, video via video tracks, all
under one peer connection with negotiated codecs and ICE-managed
connectivity). A capOS Chat channel should map onto a WebRTC peer
connection cleanly enough that a browser participant can be implemented as a
WebRTC peer talking to a capOS-side gateway, without translation gymnastics.
Goals
- Carry text, audio, video, and bounded binary attachments on the same chat cap, with capability-gated subscription per kind.
- Replace
pollwith listener caps the channel calls back, so capnp-rpc participants do not poll. Keeppollavailable as a transport-stopgap for DTO clients during the migration to capnp-rpc. - Carry low-latency frames (audio, video) without copying them through capnp
message payloads on the hot path – use
MemoryObject-backed media rings or shared frame buffers, with the chat cap conveying control and frame metadata only. - Map cleanly to WebRTC for browser participants so the gateway can act as a signalling and ICE-relay endpoint without leaking raw browser handles to capOS code.
- Preserve the existing capability model: capability = invoke gate; channel membership = render gate. A subscriber cap is required to receive text events; a separate audio-subscriber cap is required to receive audio frames; a separate video-subscriber cap is required to receive video.
- Preserve session-bound invocation: the chat-cap holder’s session is the
caller; channel servers see the live opaque session-scoped reference and
may be granted disclosure scopes per
session-bound-invocation-context-proposal.md. - Strict ocap discipline. Every Chat capability is granted explicitly by a holder that already has it. There is no protocol-level “request permission to write to me” flow: until a recipient (or a chain authorized by the recipient) shares a peer cap with the sender, the sender has no path. Rephrased: capabilities flow forward only, by deliberate sharing.
- Cap lineage and transitive revocation are substrate-level invariants,
enforced by the Chat service with kernel support. Lineage is a
service concern, not a kernel one (per capOS’s “prefer userspace
capability wrappers over kernel-side policy checks” principle). The
root of every chat-cap lineage tree is the Chat service’s own root
cap – the cap chat-server holds for “I run this Chat service”. The
manifest is Chat service configuration, not kernel or broker
configuration: chat-server reads it at startup and uses its root cap
to materialize the configured groups and channels. Every cap chat
hands out is parented somewhere in chat-server’s internal tree;
ultimately every chain terminates at chat-server’s root.
Cross-principal sharing goes through a chat-server method
(
GroupMember.invite,DiscoverableGroupJoin.join,DiscoverableChannelTextSubscribe.subscribe, etc.), which mints a fresh derived cap and records its parent. Raw bearer transfer of chat caps is blocked by the kernel viatransfer_policyenforcement (see Open Questions). Revocation walks the tree and rotates the kernel-level cap epoch of every descendant in the revoked branch; subsequent dispatch fails closed at the kernel site (epoch rotation is already an existing kernel-level mechanism). This is what makes “a member started inviting spam bots into the group” recoverable: revoke the spammer’s branch; their downstream invitees go with them; unrelated siblings – and unrelated branches under the same group – are untouched. - Chat session sees callers via session-bound identity, not via a
user-info cap. Per
session-bound-invocation-context-proposal.md, the kernel attaches an opaque session-scoped reference to every invocation. Chat-server uses that reference to route messages, populatesenderfields per its disclosure policy, and identify who joined which group, without holding any “look up user X” cap. - Telegram-shaped channel categories. Groups (with nested topics, owner
- admin role hierarchy, extensible permissions), broadcast channels
(read-only for subscribers), DMs, and end-to-end-encrypted DMs as a
distinct cap layer. There is no special “system room” category –
system-managed channels are just channels owned by service principals or
designated admin principals (capOS already treats services as
principals; see
user-identity-and-policy-proposal.mdPrincipalKindincludingservice).
- admin role hierarchy, extensible permissions), broadcast channels
(read-only for subscribers), DMs, and end-to-end-encrypted DMs as a
distinct cap layer. There is no special “system room” category –
system-managed channels are just channels owned by service principals or
designated admin principals (capOS already treats services as
principals; see
- Keep backpressure tractable: outgoing media uses capnp
-> streamfor flow-controlled writes; incoming media listener caps may indicate drop-vs-queue policy in the subscription request.
Non-Goals
- Replacing WebRTC for browser-to-browser P2P. capOS is the gateway; the browser still uses WebRTC primitives. We map them onto the gateway-held Chat cap, not the other way around.
- Replacing
RealtimeModelSession(realtime-voice-agent-shell-proposal.md) for agent-runtime ↔ model-provider transport. That session is a different layer: it carries provider-specific events (RealtimeInputEvent/RealtimeOutputEvent) between the runner and an external model API. The operator-facing surface (operator talks to the running agent, agent speaks back) is a chat; the agent runner bridges the two. - Replacing
ApprovalClient/ApprovalGrant(shell-proposal.md:407-427). Action approvals are a separate capability. A chat may surface an approval request as a message event with a payload referencing anApprovalGrant, but the cap holding the approval state stays distinct. See## Approvals Stay Separatebelow. - Carrying raw on-the-wire codec bytes inside capnp payloads in the hot path. Frame metadata travels on capnp; frame bodies travel via shared memory or provider-owned handles.
- Defining a global chat name registry. Channels are scoped: a chat cap hands you a specific server-owned room; how rooms get named lives in the hosting service (chat-server, adventure-server, agent runner, etc.).
- File-transfer protocol design (resume, integrity, deduplication). Bounded attachments are in scope; large-file transfer reuses a separate File or ContentStore cap, with Chat carrying only the reference.
Architecture
flowchart LR
subgraph capos[capOS]
chatsrv[chat-server / agent-runner / adventure-server]
ch[chat cap - per chat]
chatsrv --> ch
end
subgraph rust[Trusted Rust backend]
wrk[Per-session worker holds chat cap]
listeners[ChatListener, AudioSink, VideoSink listener caps]
wrk -- subscribe(listener) --> ch
ch -- listener.post(event) --> listeners
listeners --> appstate[AppState - text history buffer, audio ring, video ring]
end
subgraph browser[Browser]
js[Browser JS - text view models, WebRTC peer for audio/video]
end
appstate -- text events as view models --> js
appstate <-- WebRTC SDP/ICE signalling via /api/chat/webrtc --> js
appstate <-- audio frames via WebRTC audio track --> js
appstate <-- video frames via WebRTC video track --> js
Three layers, three transports:
-
capnp-rpc, between capOS and the trusted Rust backend. Listener caps for incoming text events.
-> streammethods for outgoing audio/video frames. Frame metadata on capnp; frame bodies onMemoryObject-backed rings shared between the worker process and the gateway. -
Trusted Rust backend bookkeeping. The backend holds the chat cap, buffers a bounded text history, and owns the audio/video media rings. Browser-visible state stays in view models.
-
HTTP + WebRTC, between the trusted Rust backend and the browser. Text events flow as JSON view models on the existing
/api/*HTTP surface. Audio and video flow through a WebRTC peer connection: the browser does the SDP offer; the backend produces an answer using a small capOS-side WebRTC adapter (or relays SDP to a capOS-side WebRTC service); audio/video tracks carry the frames the backend got via the media rings.
Schema Sketch
This is a sketch, not the final wire shape. Field numbers, exact param names, and struct nesting will be finalized when the implementation iteration starts; what matters here is the shape.
The substrate is not one interface. Role caps, discovery caps,
contact caps, DM peer caps, listener caps, and outgoing-media caps
are distinct interfaces because they have distinct authorities.
Possessing a cap is the authority; calling a method that returns
a derived cap is just a normal method call (no separate “redeem”
step exists). The cap class’s transfer_policy (kernel-enforced)
forbids raw bearer transfer between principals; sharing must go
through chat-server’s derive*-shaped methods.
Naming convention (Telegram-aligned). Three concrete chat categories:
- Group – multi-party two-way chat. Roles:
GroupOwner,GroupAdmin,GroupMember. Supports nested topics. - Channel – broadcast (read-only for subscribers). Roles:
ChannelOwner,ChannelAdmin,ChannelPublisher, plus the per-media-facet subscriber capsChannelTextSubscriber/ChannelAudioSubscriber/ChannelVideoSubscriber. The substrate has no type-erased genericChannelSubscriber; the result type of a subscribe path tells the caller exactly which media facets it grants (see schema below). - DM – direct message between two principals. Caps:
DmPeer,E2EDmPeer. Established viaContactCap.
The unqualified word “channel” in this proposal only refers to a
Telegram-style broadcast Channel. Any generic “stream of events” or
“thing you can subscribe to” is called a chat (the substrate-level
term). Base interfaces use the Chat prefix (ChatEndpoint,
ChatWriter, ChatDirectory, ChatInfo, ChatKind); concrete
roles use the category prefix (Group*, Channel*, Dm*).
# Identity / describe surface every chat-cap embeds (except pure
# listener caps and revokers). Holding ChatEndpoint alone grants
# nothing beyond inspecting metadata.
interface ChatEndpoint {
describe @0 () -> (info :ChatInfo);
}
# ============================================================
# Per-kind read facets. The interface IS the permission: holding
# ChatTextReader grants subscribeText authority and ONLY that.
# Audio and video are separate caps. A text-only role does not
# expose subscribeAudio / subscribeVideo at all -- there is no
# runtime check for "are you allowed to read audio"; the absence
# of the method is the gate.
# ============================================================
interface ChatTextReader extends(ChatEndpoint) {
subscribeText @0 (listener :TextListener,
options :SubscribeOptions) -> (sub :Subscription);
}
interface ChatAudioReader extends(ChatEndpoint) {
subscribeAudio @0 (listener :AudioSink,
options :AudioSubscribeOptions) -> (sub :Subscription);
}
interface ChatVideoReader extends(ChatEndpoint) {
subscribeVideo @0 (listener :VideoSink,
options :VideoSubscribeOptions) -> (sub :Subscription);
}
# ============================================================
# Per-kind write facets. Each writer extends the corresponding
# reader (a writer is also a reader of the same kind). Concrete
# roles compose the kinds they need.
# ============================================================
interface ChatTextWriter extends(ChatTextReader) {
send @0 (event :ChatOutboundEvent) -> ();
postAttachment @1 (descriptor :AttachmentDescriptor) -> ();
}
interface ChatAudioWriter extends(ChatAudioReader) {
openAudioOut @0 (format :AudioFormat) -> (track :AudioOut);
}
interface ChatVideoWriter extends(ChatVideoReader) {
openVideoOut @0 (format :VideoFormat) -> (track :VideoOut);
}
# Convenience: full-multimedia writer. Most roles in this proposal
# extend this one; a "text-only group member" role would extend
# only ChatTextWriter, exposing strictly fewer methods.
interface ChatWriter extends(ChatTextWriter, ChatAudioWriter, ChatVideoWriter) {}
# ============================================================
# Group: multi-party two-way chat with topics + voice/stage rooms
# and an Owner/Admin/Member role hierarchy. Roles inherit upward:
# Owner is an Admin is a Member is a ChatWriter is a ChatEndpoint.
# ============================================================
interface GroupMember extends(ChatWriter) {
rooms @0 () -> (rooms :List(RoomInfo));
# Each per-room accessor returns a kind-specific facet so
# joining a text topic does not grant audio/video subscribe.
textRoom @1 (roomId :Text) -> (writer :ChatTextWriter);
voiceRoom @2 (roomId :Text) -> (room :VoiceRoom);
stageRoom @3 (roomId :Text) -> (room :StageRoom);
callSurface @4 () -> (calls :CallSurface);
# `invite` returns the bearer token (handed to the invitee via
# chat-server-mediated cap delivery), an issuer-held revoker,
# AND the GroupCapRef of the issuance lineage node so the
# caller can pass it to `GroupAdmin.describeBranch` /
# `revokeBranch` later without having to walk the lineage to
# find it. Splitting token from revoker prevents the invitee
# or any downstream holder from revoking their own invite --
# the InviteToken interface has no revoke method.
invite @5 (forSubject :PrincipalRef, lifetime :UInt64)
-> (token :InviteToken,
revoker :InviteRevoker,
inviteRef :GroupCapRef);
# Out-of-band invite path. Returns BEARER-SECRET bytes the
# issuer delivers via paper / QR / non-chat channel, the
# issuer-side `revoker`, AND the `inviteRef` GroupCapRef
# naming the issuance lineage node (analogous to `invite`).
# The bytes name a distinct lineage node in chat-server's
# tree (the issuance entry); any holder plus a Self cap can
# redeem them via Self.acceptInviteCode(code). Treat them
# with the same care as any bearer secret: do not log, do
# not include in transcripts, do not expose to untrusted
# observers, prefer bounded lifetimes and one-time-use
# semantics. The `inviteRef` is non-secret and safe to log.
inviteCode @6 (lifetime :UInt64)
-> (code :Data,
revoker :InviteRevoker,
inviteRef :GroupCapRef);
acceptInvite @7 (token :InviteToken) -> (member :GroupMember);
leave @8 () -> ();
}
interface GroupAdmin extends(GroupMember) {
removeMember @0 (memberRef :Data) -> ();
# Both `revokeBranch` and `describeBranch` accept any lineage
# node ref -- a member cap, an admin cap, an inviteCode lineage
# node, or a transformation operation node (from
# mergeIntoGroupAsTopic / moveTopicHere / extractTopicAsGroup).
# Revoking a transformation node epochs the entire grafted
# subtree; revoking a member cap epochs that member and the
# invitees they admitted. See the BranchInfo schema for the
# node kinds chat-server may return.
revokeBranch @1 (node :GroupCapRef) -> ();
setMemberInvitePolicy @2 (policy :MemberInvitePolicy) -> ();
createRoom @3 (config :RoomConfig) -> (info :RoomInfo);
removeRoom @4 (roomId :Text) -> ();
setRoomPolicy @5 (roomId :Text, policy :RoomPolicy) -> ();
# Per-principal ban list (deny-list for FUTURE mints only).
# `banPrincipal` only adds the principal to the group's
# ban list, so subsequent `DiscoverableGroupJoin.join()`,
# `Self.acceptInvite` / `acceptInviteCode`, and
# admin-mint paths fail closed with `principalBanned` for
# this principal. It does NOT kick the principal's existing
# caps; that's `revokeBranch`'s job. Without the deny-list,
# a previously-revoked principal who still holds a
# `DiscoverableGroupJoin` cap or a session bundle hook
# could simply re-join and mint a fresh chain. The full
# "kick + ban" workflow is the admin pairing
# `GroupAdmin.revokeBranch(node :GroupCapRef)` with
# `banPrincipal(principal :PrincipalRef)` in a single UI
# step. The branch ref comes from one of the typed sources
# (the `inviteRef` returned by the original
# `GroupMember.invite(...)` tuple if the admin issued the
# invite themselves; otherwise
# `GroupAdmin.lookupByPrincipal(principal)` or
# `describeRoot()` to walk the lineage tree). Raw transfer
# of the target's bearer member cap is forbidden by
# `transfer_policy`. The schema keeps the two concerns
# separate so each is idempotent and individually meaningful.
banPrincipal @6 (principalRef :PrincipalRef) -> ();
unbanPrincipal @7 (principalRef :PrincipalRef) -> ();
# Admin-only stage facet. Returns a StageRoomAdmin cap whose
# promoteToSpeaker / closeStage methods are not reachable from
# an ordinary GroupMember.stageRoom() accessor.
stageRoomAdmin @8 (roomId :Text) -> (admin :StageRoomAdmin);
# Lineage inspection used during spam-bot triage and audit. The
# caller passes a node reference; chat-server returns the
# subtree rooted at that node (the member or operation, the
# invitees/grafted members under it, sub-invitees, etc.) plus
# enough metadata to drive a UI before calling `revokeBranch`.
# Read-only.
describeBranch @9 (node :GroupCapRef) -> (info :BranchInfo(GroupCapRef));
# Top-down lineage walker. Returns the group's whole lineage
# tree (subject to chat-server's truncation policy) so an
# admin can locate a `GroupCapRef` for somebody else's
# invitee, public-joined member, or transformation-grafted
# member without already holding a ref. Together with
# `lookupByPrincipal`, this closes the obtain path for
# `describeBranch` / `revokeBranch` -- the caller does not
# need a pre-existing ref. Read-only.
describeRoot @10 () -> (info :BranchInfo(GroupCapRef));
# Convenience lookup: find the lineage nodes a given principal
# holds in this group. May return multiple refs if the
# principal joined via multiple paths (e.g. a manifest-bundled
# GroupMember plus a public-join chain from a different
# session). Returns an empty list for principals not in this
# group. Read-only; the cap returned is by-ref handle, not the
# principal's bearer cap.
lookupByPrincipal @11 (principalRef :PrincipalRef)
-> (refs :List(GroupCapRef));
}
# Reference to a node inside this group's lineage tree. Opaque to
# the caller; chat-server uses it to look up the node. Names BOTH
# cap-bearing nodes (members/admins/etc.) AND transformation
# operation nodes (mergeIntoGroupAsTopic / moveTopicHere /
# extractTopicAsGroup), so revokeBranch / describeBranch can
# operate on the entire-graft case as well as the per-member case
# discussed under Chat-graph transformations.
struct GroupCapRef {
nodeRef @0 :Data; # chat-server-internal handle id
}
# Snapshot of a lineage subtree returned by describeBranch /
# describeRoot. Holds enough to render "this is who would be
# revoked" UI for both per-member kicks and entire-graft
# revocations of a transformation node. Generic over the ref
# kind so the same shape serves Group lineage (RefT =
# GroupCapRef) and broadcast-Channel lineage (RefT =
# ChannelCapRef) without losing the type-level distinction
# between Group and Channel refs.
struct BranchInfo(RefT) {
root @0 :LineageNode(RefT);
totalMembers @1 :UInt32; # cap nodes in subtree (excludes
# transformation op nodes)
truncated @2 :Bool; # chat-server may cap deep trees
}
# Lineage nodes come in three flavours:
# - cap-bearing nodes (member / admin / publisher / subscriber
# caps held by a principal),
# - transformation operation nodes (mergeIntoGroupAsTopic /
# moveTopicHere / extractTopicAsGroup; no principal of their
# own; just a graft point), and
# - issuance nodes (a `ContactCap` issuance, an `InviteToken` /
# `inviteCode` issuance, a `contactCode` issuance, or any
# other "the issuer minted this so they can revoke its
# downstream subtree" entry). Issuance nodes have a non-empty
# descendants subtree once their token is redeemed.
# The shared envelope carries the ref, timestamp, parentage
# classification, and recursive children; the union arm carries
# the kind-specific data. Generic over RefT for the Group /
# Channel split.
#
# capnp generics constrain the ref type but cannot constrain the
# union arm by RefT (no dependent types in capnp). Soundness of
# "Group lineage trees only contain Group roles, Channel lineage
# trees only contain Channel roles" is therefore enforced
# at the chat-server boundary (it never emits a mismatched arm,
# and consumers may treat a mismatched arm as a chat-server
# implementation bug); the type system narrows the ref kind but
# the role kind is a documented invariant rather than a
# capnp-checked one.
struct LineageNode(RefT) {
ref @0 :RefT;
joinedAtMs @1 :UInt64;
parentage @2 :BranchParentage;
children @3 :List(LineageNode(RefT));
union {
capNode @4 :CapNodeInfo;
operationNode @5 :OperationNodeInfo;
issuanceNode @6 :IssuanceNodeInfo;
}
}
# Issuance lineage node: an entry chat-server adds to its tree
# when an issuer mints a bearer-cap or bearer-secret handle whose
# downstream descendants the issuer wants to be able to revoke
# transitively. Examples: `Self.contact` / `Self.contactCode`
# (DmPeer / E2EDmPeer descendants), `GroupMember.invite` /
# `inviteCode` (GroupMember descendants), and any future
# bearer-issuance pattern. The issuer holds either a typed
# revoker cap (`InviteRevoker`, `SpeakerRevoker`) or a non-secret
# ref handle (`ContactCapRef`, `inviteRef :GroupCapRef`,
# `codeId :Data`); revoking via that handle epochs the issuance
# node and every descendant.
struct IssuanceNodeInfo {
issuer @0 :PrincipalRef; # who minted the issuance
kind @1 :IssuanceKind;
expiresAtMs @2 :UInt64; # 0 = unbounded
}
enum IssuanceKind {
contactCap @0; # Self.contact -> ContactCap (cap form)
contactCode @1; # Self.contactCode -> bytes (code form)
inviteToken @2; # GroupMember.invite -> InviteToken (cap form)
inviteCode @3; # GroupMember.inviteCode -> bytes (code form)
speakerToken @4; # StageRoomAdmin.promoteToSpeaker -> SpeakerToken delivered via roster
groupAdminGrant @5; # GroupOwner.makeAdmin -> GroupAdmin delivered via Self.subscribeIncoming
channelPublisherGrant @6; # ChannelAdmin.makePublisher -> ChannelPublisher delivered via Self.subscribeIncoming
channelAdminGrant @7; # ChannelOwner.makeAdmin -> ChannelAdmin delivered via Self.subscribeIncoming
callHostGrant @8; # CallHost.promoteHost -> CallHost delivered via CallRosterDelta
e2eCallHostGrant @9; # E2ECallHost.promoteHost -> E2ECallHost delivered via CallRosterDelta
}
struct CapNodeInfo {
principal @0 :PrincipalRef;
role @1 :ChatNodeRole; # narrowed to the chat kind
# of the enclosing
# BranchInfo
}
# Per-chat-kind role discriminator inside lineage nodes. capnp
# generics narrow the ref type (`RefT`) but cannot narrow the
# role-union arm to match it (capnp has no dependent types).
# Documented invariant, enforced at the chat-server boundary:
# a `BranchInfo(GroupCapRef)` only emits the `group` arm, a
# `BranchInfo(ChannelCapRef)` only emits the `channel` arm.
# Consumers walking either tree may treat a mismatched arm as a
# chat-server implementation bug (return `unexpectedRoleKind`)
# rather than as caller-induced data.
struct ChatNodeRole {
union {
group @0 :GroupRole;
channel @1 :ChannelRole;
}
}
enum GroupRole {
owner @0;
admin @1;
member @2;
}
# `ChatRole` is retained as an alias for `GroupRole` for any
# audit / lineage prose that referred to "the chat role" without
# distinguishing Group from Channel (e.g. older descriptions of
# manifest-bundle entries). New schema methods use `GroupRole`
# or `ChannelRole` directly; do not introduce new uses of
# `ChatRole`.
using ChatRole = GroupRole;
enum ChannelRole {
owner @0;
admin @1;
publisher @2;
textSubscriber @3;
audioSubscriber @4;
videoSubscriber @5;
}
struct OperationNodeInfo {
operation @0 :TransformationOp;
initiator @1 :PrincipalRef; # caller-side admin that issued
consent @2 :OperationConsent; # who provided the second
# authority that authorized
# the graft
sourceTopicId @3 :Text; # may be empty for full-graft ops
targetTopicId @4 :Text;
}
# The two-cap proof consumed by chat-graph transformations is not
# always two admins. mergeIntoGroupAsTopic and moveTopicHere need
# the *other* group's admin role; extractTopicAsGroup needs the
# initiator's own Self cap (creation-quota authority), since the
# new group has no other-side admin yet. The variant tells audit
# UIs which authority shape was checked.
struct OperationConsent {
union {
partnerAdmin @0 :PrincipalRef; # mergeIntoGroupAsTopic /
# moveTopicHere: the
# other-group admin who
# consented in the same call
selfCreation @1 :PrincipalRef; # extractTopicAsGroup: the
# initiator's Self cap
# principal proving creation
# quota; same principal as
# `initiator` above
}
}
enum TransformationOp {
mergeIntoGroupAsTopic @0;
moveTopicHere @1;
extractTopicAsGroup @2;
}
enum BranchParentage {
manifestBundle @0;
publicJoin @1; # via DiscoverableGroupJoin.join()
invitedCap @2; # via Self.acceptInvite(token)
invitedCode @3; # via Self.acceptInviteCode(code)
ownerMint @4; # GroupOwner.makeAdmin / similar
transformation @5; # parented to a TransformationOp node
issuance @6; # this node IS an issuance entry
# (Self.contact, Self.contactCode,
# GroupMember.invite, inviteCode,
# StageRoomAdmin.promoteToSpeaker, etc.).
# The node's parent in the tree is its
# *issuer* (Self cap or role cap); the
# `issuance` tag distinguishes the node
# itself from a redeemed descendant.
}
interface GroupOwner extends(GroupAdmin) {
# Promote a member to admin. Same delivery shape as
# `GroupMember.invite` / `StageRoomAdmin.promoteToSpeaker`:
# chat-server records a *promotion issuance node* in the
# group's lineage tree (parented to the calling Owner cap)
# and delivers the freshly minted `GroupAdmin` cap to the
# promoted principal via that principal's `Self.subscribeIncoming`
# (`groupAdminGranted :GroupAdmin` arm), parented under the
# promotion node. The Owner gets back only an
# issuer-side `RolePromotionRevoker` (revokes the promotion --
# epoching the promoted GroupAdmin and any descendants the
# promotee minted) plus a non-secret `promotionRef
# :GroupCapRef` for `describeBranch` / `revokeBranch`. The
# caller does NOT receive the target's GroupAdmin cap; raw
# cross-principal cap delivery would violate
# `transfer_policy`.
makeAdmin @0 (memberRef :Data, perms :AdminPermissions)
-> (revoker :RolePromotionRevoker,
promotionRef :GroupCapRef);
setGroupPolicy @1 (policy :GroupPolicy) -> ();
# Discoverable join is always Member-typed. There is no
# `joinRole` argument because `DiscoverableGroupJoin.join()`
# is fixed to return `GroupMember` (admin / owner roles are
# minted via `GroupOwner.makeAdmin` (which produces a
# GroupAdmin, not an Owner -- new Owners come only from the
# manifest, `Self.startGroup`, or `extractTopicAsGroup`),
# never via
# public join). Removing the parameter eliminates the prior
# mismatch where `joinRole=admin` could be advertised but
# `.join()` would still mint only a member.
publishDiscoverable @2 (scope :ChatDirectoryScopeRef)
-> (entry :ChatDirectoryEntryHandle);
closePublicJoin @3 (entry :ChatDirectoryEntryHandle) -> ();
disband @4 () -> ();
}
# Issuer-held companion to a role-promotion. Parallel to
# InviteRevoker / SpeakerRevoker. Calling `revoke()` epochs the
# promoted role cap AND every descendant the promotee minted
# under it; the promoted principal falls back to whatever role
# they held before the promotion (the substrate does not auto-
# kick them from the chat). Promoter retains this revoker
# alongside the non-secret `promotionRef` for the cap-clean
# describeBranch / revokeBranch path.
interface RolePromotionRevoker {
describe @0 () -> (info :RolePromotionInfo);
revoke @1 () -> ();
}
# Bearer cap. Holding it lets the recipient call
# `Self.acceptInvite(token) -> GroupMember` (or
# `GroupMember.acceptInvite(token)` when joining via an existing
# group context). The token has NO revoke method -- bearers do
# not revoke their own invites. Revocation lives on the issuer's
# InviteRevoker cap.
interface InviteToken {
describe @0 () -> (info :InviteInfo);
}
# Issuer-held companion to InviteToken. The InviteRevoker is
# parented to the issuer's role cap in chat-server's lineage tree.
interface InviteRevoker {
describe @0 () -> (info :InviteInfo);
revoke @1 () -> ();
}
# ============================================================
# Channel (Telegram-strict: BROADCAST, not the generic word).
# Subscribers read; Publishers/Admins/Owner write. Subscribers
# do NOT extend ChatWriter -- the type system enforces RO at
# compile time.
# ============================================================
# Per-kind subscriber types. The interface IS the permission:
# a ChannelTextSubscriber holder cannot call subscribeAudio /
# subscribeVideo, regardless of runtime policy. Each variant
# composes only the readers it grants. Discovery yields the
# variant chat-server's configuration says applies to the
# scope's policy for this caller; the result type tells the
# caller exactly what they got.
interface ChannelTextSubscriber extends(ChatTextReader) {
unsubscribe @0 () -> ();
}
interface ChannelAudioSubscriber extends(ChatTextReader, ChatAudioReader) {
unsubscribe @0 () -> ();
}
interface ChannelVideoSubscriber extends(ChatTextReader, ChatAudioReader, ChatVideoReader) {
unsubscribe @0 () -> ();
}
# Publisher writes; lifecycle (close the whole channel) is NOT
# here. A non-admin publisher should be able to post but not
# tear down the channel. closeChannel lives on ChannelAdmin
# below.
interface ChannelPublisher extends(ChatWriter) {}
interface ChannelAdmin extends(ChannelPublisher) {
# Same delivery shape as `GroupOwner.makeAdmin`: chat-server
# records a promotion issuance node parented to the calling
# ChannelAdmin cap, delivers the freshly minted
# `ChannelPublisher` to the promoted principal via
# `Self.subscribeIncoming` (`channelPublisherGranted :ChannelPublisher`
# arm), and returns only the issuer-side revoker plus a
# non-secret promotionRef to the caller. Cross-principal
# role-cap delivery to the promoter is forbidden.
makePublisher @0 (subjectRef :PrincipalRef)
-> (revoker :RolePromotionRevoker,
promotionRef :ChannelCapRef);
removePublisher @1 (publisherRef :Data) -> ();
revokeBranch @2 (node :ChannelCapRef) -> ();
# Per-principal ban list (deny-list for FUTURE mints only).
# Same semantics as `GroupAdmin.banPrincipal`: `banPrincipal`
# only updates the broadcast Channel's deny-list; existing
# caps held by the principal are not epoched. Pair with
# `revokeBranch` for "kick + ban".
banPrincipal @3 (principalRef :PrincipalRef) -> ();
unbanPrincipal @4 (principalRef :PrincipalRef) -> ();
closeChannel @5 () -> (); # close the whole broadcast
# channel (not just the
# publisher's own stream)
# Lineage queries parallel to GroupAdmin. Same purpose: an
# admin needs `ChannelCapRef` handles to call `revokeBranch`
# for somebody else's publisher/subscriber chain, but the
# ChannelAdmin doesn't hold those caps. `describeBranch`
# accepts a known node ref and returns its subtree;
# `describeRoot` returns the whole channel lineage tree
# (truncated per policy); `lookupByPrincipal` returns refs
# for a given principal's caps in this channel. All
# read-only.
describeBranch @6 (node :ChannelCapRef) -> (info :BranchInfo(ChannelCapRef));
describeRoot @7 () -> (info :BranchInfo(ChannelCapRef));
lookupByPrincipal @8 (principalRef :PrincipalRef)
-> (refs :List(ChannelCapRef));
}
interface ChannelOwner extends(ChannelAdmin) {
# Same delivery shape as the `makePublisher` and
# `GroupOwner.makeAdmin` promotions: chat-server records a
# promotion issuance node, delivers the freshly minted
# `ChannelAdmin` to the promoted principal via
# `Self.subscribeIncoming` (`channelAdminGranted :ChannelAdmin` arm),
# and returns only the revoker plus promotionRef.
makeAdmin @0 (publisherRef :Data, perms :AdminPermissions)
-> (revoker :RolePromotionRevoker,
promotionRef :ChannelCapRef);
setChannelPolicy @1 (policy :ChannelPolicy) -> ();
publishDiscoverable @2 (scope :ChatDirectoryScopeRef)
-> (entry :ChatDirectoryEntryHandle);
closePublicJoin @3 (entry :ChatDirectoryEntryHandle) -> ();
}
# Reference to a node inside this broadcast Channel's lineage
# tree. Same shape as `GroupCapRef` but a distinct nominal type
# so a Group ref cannot be passed to `ChannelAdmin.revokeBranch`
# (and vice versa) at the type level. Names BOTH cap-bearing
# nodes (Channel{Owner,Admin,Publisher,*Subscriber}) AND any
# operation node a Channel might gain in the future. Opaque to
# the caller; chat-server resolves via its internal lineage table.
struct ChannelCapRef {
nodeRef @0 :Data;
}
# ============================================================
# Rooms within a Group. Three kinds: text topics, persistent
# voice rooms (Discord-style), broadcast stage rooms (Discord
# stage / Twitter Spaces). Per-room permission overrides are
# out of scope for the first slice (extensible via RoomPolicy).
# ============================================================
enum RoomKind {
textTopic @0;
voiceRoom @1;
stageRoom @2;
}
struct RoomInfo {
roomId @0 :Text;
kind @1 :RoomKind;
displayName @2 :Text;
topology @3 :CallTopology; # for voice/stage; ignored for text
capacity @4 :UInt32; # 0 = unbounded (per chat-server policy)
}
# Persistent voice room (always alive while the room exists).
# Joining means entering the call already in progress in this room.
interface VoiceRoom {
describe @0 () -> (info :VoiceRoomInfo);
subscribeRoster @1 (listener :CallRosterListener,
options :RosterSubscribeOptions)
-> (sub :Subscription);
describeRoster @2 () -> (snapshot :CallRosterSnapshot);
join @3 () -> (participant :CallParticipant);
}
# Stage room (broadcast voice within a Group). Subscribers listen;
# Speakers publish; admins promote a hand-raiser to speaker by
# minting a SpeakerToken (handed to the listener) plus a
# SpeakerRevoker (kept admin-side).
#
# StageRoom (member-reachable via GroupMember.stageRoom) does NOT
# carry promote authority -- ordinary members can listen, speak
# (with a token), and raise their hand, but cannot mint speaker
# tokens. Promotion lives on StageRoomAdmin, which is reached only
# through GroupAdmin (see below).
interface StageRoom {
describe @0 () -> (info :StageRoomInfo);
subscribeRoster @1 (listener :CallRosterListener,
options :RosterSubscribeOptions)
-> (sub :Subscription);
joinAsListener @2 () -> (participant :StageListener);
# On redemption, chat-server mints `StageSpeaker` with
# `parent = the SpeakerToken's lineage node`. The companion
# `SpeakerRevoker` therefore epochs both the unredeemed token
# AND any active StageSpeaker descendant; admin pulling the
# floor back kills live mic, not just future redemptions.
joinAsSpeaker @3 (token :SpeakerToken)
-> (participant :StageSpeaker);
raiseHand @4 () -> ();
}
# Admin-only stage facet. Reached via GroupAdmin.stageRoomAdmin
# (added to GroupAdmin earlier in the schema sketch); not
# obtainable from a plain GroupMember's stageRoom() accessor.
# `promoteToSpeaker` does NOT return the bearer SpeakerToken to
# the admin. Bound to listenerRef on the chat-server side and
# delivered directly to that listener via their existing
# StageRoom.subscribeRoster stream as a "you-are-now-a-speaker"
# event carrying the SpeakerToken cap reference. The admin keeps
# only the SpeakerRevoker. This avoids the cross-principal
# bearer-cap handoff problem (raw transfer is forbidden; chat
# events on the stage roster are the chat-server-mediated
# delivery path the substrate already provides).
interface StageRoomAdmin {
describe @0 () -> (info :StageRoomInfo);
promoteToSpeaker @1 (listenerRef :Data)
-> (revoker :SpeakerRevoker);
closeStage @2 () -> ();
}
interface StageListener extends(ChatTextReader, ChatAudioReader) {
leave @0 () -> ();
}
# Stage speakers are broadcast-voice only: no `publishVideo` and
# no `subscribeVideo` because the stage-room model has no video.
# Possession of `SpeakerToken` mints exactly this audio-only cap.
interface StageSpeaker extends(AudioCallParticipant) {
yieldFloor @0 () -> ();
}
# Bearer cap held by a hand-raised listener after promotion.
# Has NO revoke method -- the admin's promotion is undone via
# the issuer-held SpeakerRevoker, parallel to InviteToken/Revoker.
interface SpeakerToken {
describe @0 () -> (info :SpeakerTokenInfo);
}
interface SpeakerRevoker {
describe @0 () -> (info :SpeakerTokenInfo);
revoke @1 () -> (); # admin pulls the floor back
}
# ============================================================
# Ephemeral Call. Distinct from VoiceRoom: a Call has explicit
# start/end and lives within a chat (Group or DM). Use Call for
# "let's hop on a quick conference"; use VoiceRoom for "Discord
# voice channel always there". Both can coexist in a Group.
# ============================================================
interface CallSurface {
current @0 () -> (info :ActiveCallInfo); # may be empty
subscribeState @1 (listener :CallStateListener,
options :SubscribeOptions)
-> (sub :Subscription);
startCall @2 (config :CallStartConfig) -> (host :CallHost);
joinCall @3 () -> (participant :CallParticipant);
# Roster delivery for ad-hoc calls. Same shape as
# VoiceRoom.subscribeRoster / StageRoom.subscribeRoster, but
# bound to whatever ad-hoc call is currently active on this
# surface (or to the next call if none is active yet -- the
# subscription persists across start/end transitions of the
# surface's call until cancelled). This is the only delivery
# path for the cap-bearing roster variants
# (`hostGranted :CallHost`, `speakerGranted :SpeakerToken`),
# so a participant who needs to receive a host-promotion in
# an ad-hoc call must hold a Subscription minted here.
subscribeRoster @4 (listener :CallRosterListener,
options :RosterSubscribeOptions)
-> (sub :Subscription);
}
# Audio-only call participation facet. Lifts every call method
# that does not pull in video authority. Used by both the full
# A/V `CallParticipant` and the audio-only `StageSpeaker`.
# Stage rooms are broadcast voice (no stage video in the model),
# so a `SpeakerToken` redemption must mint a stage participant
# that does NOT expose `publishVideo` / `subscribeVideo` -- the
# split lives at the type level here.
interface AudioCallParticipant extends(ChatAudioReader) {
publishAudio @0 (format :AudioFormat) -> (track :AudioOut);
unpublishAudio @1 () -> ();
raiseHand @2 (raised :Bool) -> ();
setMyMuteState @3 (muted :Bool) -> ();
leave @4 () -> ();
}
# Full A/V plaintext participant. Adds video publish/unpublish on
# top of the audio facet, plus inherits subscribeVideo via
# `ChatVideoReader`. Returned by every Group plaintext call
# entry point: ad-hoc `CallSurface.startCall` / `joinCall`
# AND persistent `VoiceRoom.join` (group voice rooms are
# plaintext multi-party voice, so they share this cap shape).
# DM calls do NOT use this cap: they go through a separate
# `E2ECallSurface` that returns the cipher-only
# `E2ECallParticipant` (see the End-To-End Encrypted DMs section
# below) so the keyless-host invariant holds for DM media.
# `CallParticipant` must NOT be plumbed through any DM path.
# Text-during-call goes through the parent chat's
# `ChatTextWriter`, not through the call participant cap;
# that's why `ChatTextReader` is absent here.
interface CallParticipant extends(AudioCallParticipant, ChatVideoReader) {
publishVideo @0 (format :VideoFormat, purpose :VideoPurpose)
-> (track :VideoOut);
unpublishVideo @1 (purpose :VideoPurpose) -> ();
}
interface CallHost extends(CallParticipant) {
mute @0 (participantRef :Data) -> ();
unmute @1 (participantRef :Data) -> ();
eject @2 (participantRef :Data) -> ();
# Same cross-principal-cap-delivery rule as the chat
# role-promotion methods. The promoted participant is already
# listening on the call's roster subscription, so chat-server
# delivers the new `CallHost` cap to the bound participant via
# the existing `CallRosterDelta` stream
# (`hostGranted :CallHost` arm) rather than minting it back to
# the calling host. Caller keeps only the issuer-side
# `RolePromotionRevoker`. Parallels the SpeakerToken delivery
# pattern.
promoteHost @3 (participantRef :Data) -> (revoker :RolePromotionRevoker);
setRoutingMode @4 (mode :CallRoutingMode) -> ();
end @5 () -> ();
}
enum VideoPurpose { camera @0; screenShare @1; virtualScene @2; externalFeed @3; }
enum CallRoutingMode { sfu @0; mesh @1; mcu @2; }
enum CallTopology { peerToPeer @0; serverForwarded @1; serverMixed @2; }
interface CallRosterListener {
update @0 (delta :CallRosterDelta) -> ();
}
# Tagged union of roster events. Most variants carry plain data;
# `speakerGranted` carries a `SpeakerToken` cap, which is the
# substrate's only delivery path for the cross-principal bearer
# cap minted by `StageRoomAdmin.promoteToSpeaker(listenerRef)`.
# Delivery is listener-bound: chat-server only emits this variant
# to the roster subscription of the listener named in
# `listenerRef` -- other listeners on the same stage roster do
# NOT see this variant for that promotion. That listener then
# calls `StageRoom.joinAsSpeaker(token)` with the cap reference
# extracted from the delta.
struct CallRosterDelta {
union {
participantJoined @0 :ParticipantInfo;
participantLeft @1 :Data; # participantRef
muteChanged @2 :MuteUpdate;
activeSpeaker @3 :Data; # participantRef
handRaised @4 :HandRaiseUpdate;
screenShareStarted @5 :ScreenShareInfo;
screenShareEnded @6 :Data; # participantRef
connectionQuality @7 :QualityUpdate;
# Stage-specific cap-bearing variants.
speakerGranted @8 :SpeakerToken;
speakerRevoked @9 :Data; # participantRef
# Call-host promotion cap-bearing variants. Delivered
# listener-bound (only the listener named in
# `CallHost.promoteHost(participantRef)` /
# `E2ECallHost.promoteHost(participantRef)` sees the
# variant; other roster subscribers do NOT). Parallels the
# speakerGranted pattern.
hostGranted @10 :CallHost;
e2eHostGranted @11 :E2ECallHost;
hostRevoked @12 :Data; # participantRef
}
}
# The substrate is RECORDING-BLIND -- there is no "recording
# state" field, no "recording started" delta, and no
# protocol-level recording authority. Whoever holds a
# participant cap may locally record what they receive; a
# "shared recording" of a meeting is modeled by inviting a
# recorder principal into the call as a regular participant.
# Discovery surface owned by chat-server. Each session holds a
# ChatDirectory cap (or none) according to chat-server config.
# Search-based, not list-based: scopes can grow large, and the
# results visible to a session depend on chat-server policy that
# tests the calling session's identity. The unbounded "give me
# everything" shape is wrong; the right shape is "give me the
# entries matching this query, bounded".
#
# Note: this is *not* the filesystem `Directory` cap defined in
# `storage-and-naming-proposal.md`. The two interfaces share the
# dictionary meaning of "directory" (an enumerable namespace) but
# nothing else: filesystem `Directory` opens files; chat
# `ChatDirectory` returns join handles for chats. The
# names are deliberately disambiguated.
interface ChatDirectory {
search @0 (query :ChatDirectoryQuery)
-> (page :ChatDirectoryPage);
describe @1 () -> (info :ChatDirectoryScopeInfo);
}
struct ChatDirectoryQuery {
namePattern @0 :Text; # optional substring/glob
chatKind @1 :ChatKind; # optional kind filter
ownerKind @2 :PrincipalKind; # optional principal-kind filter
limit @3 :UInt32; # bounded page size; chat-server
# may further clamp
cursor @4 :Data; # opaque pagination cursor
# returned by a previous search
}
struct ChatDirectoryPage {
entries @0 :List(ChatDirectoryEntry);
nextCursor @1 :Data; # empty when no more pages
}
struct ChatDirectoryEntry {
chatInfo @0 :ChatInfo;
# Each entry carries a kind-specific join cap. The interface IS
# the permission: a Group entry hands you a DiscoverableGroupJoin
# whose .join() returns GroupMember, a Channel entry hands you
# one of the per-kind subscribe caps whose .subscribe() returns
# the matching subscriber. A caller never has to downcast.
union {
groupJoin @1 :DiscoverableGroupJoin;
channelTextSubscribe @2 :DiscoverableChannelTextSubscribe;
channelAudioSubscribe @3 :DiscoverableChannelAudioSubscribe;
channelVideoSubscribe @4 :DiscoverableChannelVideoSubscribe;
}
}
# Possessing one of these caps IS the policy gate. Calling the
# join/subscribe method mints a fresh role cap parented to the
# per-call join event (a fresh chain root in chat-server's lineage
# tree) -- not parented to this discoverable cap itself. So
# revoking one joiner's branch leaves siblings intact, and closing
# the discoverable route epochs the discoverable cap class without
# touching existing members.
interface DiscoverableGroupJoin {
join @0 () -> (member :GroupMember);
}
# Each Channel directory entry yields a per-kind subscribe cap so
# the result type tells the caller exactly which media they may
# read. chat-server config decides which variant fits the calling
# session's policy.
interface DiscoverableChannelTextSubscribe {
subscribe @0 () -> (subscriber :ChannelTextSubscriber);
}
interface DiscoverableChannelAudioSubscribe {
subscribe @0 () -> (subscriber :ChannelAudioSubscriber);
}
interface DiscoverableChannelVideoSubscribe {
subscribe @0 () -> (subscriber :ChannelVideoSubscriber);
}
# ============================================================
# DM (host plaintext-aware text; host-blind A/V) and E2E DM
# (host-blind everything).
#
# DmPeer extends only ChatTextWriter, NOT full ChatWriter. The
# plaintext audio/video write methods (openAudioOut /
# openVideoOut) and the plaintext audio/video subscribe methods
# (subscribeAudio / subscribeVideo from ChatAudioReader /
# ChatVideoReader) are absent at the type level. All DM media
# flows through `callSurface() -> E2ECallSurface` only -- the
# SFU-forward-only end-to-end-encrypted call surface. A
# plaintext-text DM cannot accidentally route media through a
# host-readable plaintext path because no method to do so
# exists on the cap.
# ============================================================
interface DmPeer extends(ChatTextWriter) {
remoteFingerprint @0 () -> (info :PeerFingerprint);
# DM calls are ALWAYS end-to-end encrypted, even when the DM
# text is not. chat-server forwards encrypted media; key
# exchange (DTLS-SRTP or equivalent) runs between the two peers
# at call start.
callSurface @1 () -> (calls :E2ECallSurface);
closeDm @2 () -> ();
}
# Each principal holds a Self cap that lets them produce a contact
# cap, accept incoming invites, accept incoming DMs, revoke contact
# caps they issued, and start new groups (subject to chat-server
# config-gated quota per principal class).
interface Self {
# Cap-form contact issuance. Returns BOTH the bearer
# `ContactCap` (handed via chat-server-mediated cap delivery to
# whoever should be able to DM the issuer) AND a stable
# `ContactCapRef` -- a non-secret, issuer-side handle the issuer
# keeps so they can later call `revokeContact(ref)`. Without a
# separate handle the issuer would have to retain the bearer
# cap itself to revoke it, and bearer caps go to the recipient.
contact @0 (lifetime :UInt64)
-> (contact :ContactCap, ref :ContactCapRef);
# Code-form contact issuance. Returns BOTH the BEARER-SECRET
# `code` bytes (suitable for paper / QR / out-of-band handoff;
# any holder plus a Self cap can redeem via openDmFromCode /
# openE2EDmFromCode) AND a stable `codeId` -- the non-secret
# issuer-side handle for `revokeContactCode(codeId)`. The
# `code` bytes embed the codeId so chat-server can find the
# issuance lineage node without exposing the secret in the
# revocation API. Treat the `code` with bearer-secret hygiene:
# do not log, do not include in transcripts, prefer bounded
# lifetimes, rate-limit redemption attempts. The codeId is a
# plain identifier safe to store in audit logs.
contactCode @1 (lifetime :UInt64)
-> (code :Data, codeId :Data);
revokeContact @2 (ref :ContactCapRef) -> ();
revokeContactCode @3 (codeId :Data) -> ();
openDm @4 (contact :ContactCap) -> (peer :DmPeer);
openE2EDm @5 (contact :ContactCap) -> (peer :E2EDmPeer);
# Out-of-band redemption paths. Take Data, not a cap, because
# paper/QR handoff cannot produce a cap when raw bearer
# transfer is forbidden by `transfer_policy`. The bytes are
# *bearer secrets* that name a distinct lineage node in
# chat-server's tree (the issuance entry created by
# `Self.contactCode` / `GroupMember.inviteCode`). chat-server
# consumes the code byte-for-byte, validates it against that
# lineage node, and mints the derived role/peer cap with
# `parent = the code's lineage node` -- NOT directly with
# parent = the issuer's role cap. So `Self.revokeContactCode`
# and the invite-code's `InviteRevoker` epoch only that
# specific code's descendants.
openDmFromCode @6 (code :Data) -> (peer :DmPeer);
openE2EDmFromCode @7 (code :Data) -> (peer :E2EDmPeer);
acceptInvite @8 (token :InviteToken) -> (member :GroupMember);
acceptInviteCode @9 (code :Data) -> (member :GroupMember);
startGroup @10 (config :GroupCreateConfig) -> (owner :GroupOwner);
describe @11 () -> (info :SelfInfo);
# Inbound-DM notification surface. When some other principal
# opens a DM to this Self via `openDm` / `openDmFromCode` /
# `openE2EDm` / `openE2EDmFromCode`, chat-server delivers the
# other side's peer cap (`DmPeer(self->other)` /
# `E2EDmPeer(self->other)`) here so the receiving principal
# can subscribe and reply. Listener is minted by the receiver
# and carries the same lifetime as any other listener cap
# (drop / Subscription.cancel revokes locally). The listener
# also fires for redeemed code-form DMs (so the issuer learns
# who claimed a `contactCode` they handed out) and for new
# group invites accepted via `Self.acceptInvite` /
# `acceptInviteCode` if the issuer subscribes -- the typed
# event lets the issuer attribute incoming chains to the
# specific contact / invite they issued.
subscribeIncoming @12 (listener :SelfIncomingListener,
options :SubscribeOptions)
-> (sub :Subscription);
}
# Listener for chat-server-mediated cap deliveries TO a Self.
# Chat-server fires `delivered` once per inbound peer / member
# cap; the listener's owning principal extracts the cap and
# decides what to do with it (subscribe, archive, ignore, etc.).
interface SelfIncomingListener {
delivered @0 (event :SelfIncomingEvent) -> ();
}
# Tagged union of inbound chat-server-mediated deliveries.
# `kind` discriminates the delivery flavour; `source` identifies
# WHICH issuance the delivery is parented under so the issuer
# can attribute the event to a specific contact / code / invite
# they handed out, drive a UI ("Bob just opened a DM via the
# contactCode I posted last week"), or call the matching
# revoke method.
#
# Cross-principal cap delivery rule: dmOpened / e2eDmOpened
# carry the *receiver's* peer cap (the listener owner is the
# contact issuer; the chat-server-minted cap belongs to that
# same principal, so this is NOT cross-principal delivery).
# inviteAccepted is the inviter notification arm. It carries
# *no live cap*: the issuance is identified by the envelope's
# `source.inviteRef :GroupCapRef` (the inviter already holds
# this from their original `GroupMember.invite(...)` tuple),
# and the redeemed branch is identified by
# `InviteAcceptedNotice.acceptedRef :GroupCapRef` (a NEW ref
# naming the redeemed `GroupMember` lineage node, distinct
# from the issuance node). Keeping the two refs distinct lets
# the inviter both attribute the event to its issuance entry
# AND drive `GroupAdmin.describeBranch(acceptedRef)` /
# `revokeBranch(acceptedRef)` on the specific redeemed member
# without conflating it with the issuance node.
# inviteOffered is the *invitee* notification arm and carries
# the InviteToken cap chat-server re-mints for the invitee
# under the original issuance node (same lineage rule as the
# chat-event delivery path), so the invitee can call
# Self.acceptInvite(token) -> GroupMember.
struct SelfIncomingEvent {
receivedAtMs @0 :UInt64;
source @1 :IssuanceSource; # which issuance the
# delivery is parented
# under
union {
dmOpened @2 :DmPeer;
e2eDmOpened @3 :E2EDmPeer;
inviteOffered @4 :InviteToken;
inviteAccepted @5 :InviteAcceptedNotice;
# Role-promotion delivery arms. Chat-server fires one of
# these on the promoted principal's Self listener after
# `GroupOwner.makeAdmin` / `ChannelAdmin.makePublisher` /
# `ChannelOwner.makeAdmin`. The cap is parented under the
# promotion issuance node (a chat-server-owned lineage
# entry); revoking via the issuer's
# `RolePromotionRevoker` epochs the cap delivered here.
groupAdminGranted @6 :GroupAdmin;
channelPublisherGranted @7 :ChannelPublisher;
channelAdminGranted @8 :ChannelAdmin;
# Listener-bound delivery of a fresh GroupMember cap to a
# principal auto-grafted into a group by mergeIntoGroupAsTopic
# / moveTopicHere / extractTopicAsGroup. The cap is parented
# under the transformation operation node; revoking via the
# entire-graft path (`revokeBranch(transformationRef)`)
# epochs every grafted cap.
transformationGrafted @9 :GroupMember;
}
}
# Typed identifier for the issuance an incoming delivery is
# parented under. Lets a listener match an event to the
# specific issuance call that produced the delivery (contact /
# code / invite / role promotion). capOS sends the variant
# that fits the delivery flavour: contact-cap deliveries carry
# `contactRef`, code redemptions carry `codeId`, invite
# deliveries carry `inviteRef`, group role-promotion
# deliveries carry `groupPromotionRef`, channel role-promotion
# deliveries carry `channelPromotionRef`.
struct IssuanceSource {
union {
contactRef @0 :ContactCapRef;
codeId @1 :Data;
inviteRef @2 :GroupCapRef;
groupPromotionRef @3 :GroupCapRef;
channelPromotionRef @4 :ChannelCapRef;
transformationRef @5 :GroupCapRef; # mergeIntoGroupAsTopic /
# moveTopicHere /
# extractTopicAsGroup
# operation node
}
}
# Inviter-side notification when the invitee redeems a
# previously-issued InviteToken / inviteCode. Carries no live
# bearer cap (the redeemed `GroupMember` belongs to the
# invitee, and `transfer_policy` forbids handing it to the
# inviter); instead carries the issuance ref the inviter
# already holds (`source.inviteRef` on the enclosing
# `SelfIncomingEvent`) plus the redeemed branch's
# `acceptedRef :GroupCapRef` so the inviter can call
# `GroupAdmin.describeBranch(acceptedRef)` /
# `revokeBranch(acceptedRef)` if needed.
struct InviteAcceptedNotice {
invitee @0 :PrincipalRef;
acceptedRef @1 :GroupCapRef; # the redeemed GroupMember
# branch root in the
# group's lineage tree
}
# Issuer-held, non-secret revocation handle returned alongside a
# bearer `ContactCap` from `Self.contact()`. Opaque to the
# caller; chat-server uses it to look up the contact's issuance
# lineage node so `Self.revokeContact(ref)` can epoch that node
# and any DmPeer / E2EDmPeer chains parented under it. Unlike
# the bearer `code` returned by `Self.contactCode`, this handle
# is safe to log in audit, persist in the issuer's "contacts I
# issued" UI list, etc. Distinct from `GroupCapRef` to avoid
# accidentally reusing the same opaque ref across different
# substrates' revocation surfaces.
struct ContactCapRef {
refId @0 :Data; # chat-server-internal handle id
}
# ============================================================
# Group lifetime policy + creation config. A Group is persistent
# by default; ephemeral variants auto-disband when their lifetime
# trigger fires. The substrate exposes lifetime as a Group-level
# property; topics and rooms inherit the parent group's lifetime.
# ============================================================
struct GroupLifetime {
union {
persistent @0 :Void;
ephemeralOnEmpty @1 :Void; # auto-disband when no member is
# present in any room of the
# group (text idle + voice idle
# + stage idle), not just when
# the roster goes empty
deadline @2 :UInt64; # absolute disband time, ms since epoch
ephemeralOnIdle @3 :UInt64; # disband after N ms with no activity
}
}
struct GroupCreateConfig {
displayName @0 :Text;
lifetimePolicy @1 :GroupLifetime;
initialInvites @2 :List(ContactCap); # ocap-clean: must already
# have ContactCap for each
# invitee. NO cold-call admit.
}
# ============================================================
# Chat-graph transformations. Every transformation that crosses
# group boundaries is a TWO-CAP operation: caller proves authority
# on one side, receiver-of-method on the other. chat-server
# validates both before mutating its internal lineage tree.
# ============================================================
enum MergeMemberPolicy {
autoInvite @0; # mint fresh GroupMember(target) for source
# members not already in target; deliver
# listener-bound to each principal via
# `Self.subscribeIncoming`
# (`transformationGrafted :GroupMember`
# arm, `source.transformationRef` carrying
# the operation node's `GroupCapRef`,
# whichever transformation invoked the
# policy: mergeIntoGroupAsTopic /
# moveTopicHere / extractTopicAsGroup).
# The source-group event stream only
# carries non-cap "you have been grafted"
# presence; cap delivery stays
# per-recipient.
dropNonMembers @1; # source members not in target lose access
}
# Methods added to Group role caps for lifetime + transformations.
# Real capnp doesn't have `extend X { add methods }` syntax; these
# methods are appended to the existing GroupOwner / GroupAdmin
# interfaces declared earlier in this schema sketch. Shown here in
# their own block for readability.
#
# GroupOwner (in addition to its existing methods) gains:
#
# setLifetimePolicy @100 (policy :GroupLifetime) -> ();
# # Promote ephemeral -> persistent or set a new ephemeral
# # trigger. Same group identity, same caps stay valid; only
# # the auto-disband watcher changes.
#
# mergeIntoGroupAsTopic
# @101 (target :GroupAdmin,
# topicId :Text,
# memberPolicy :MergeMemberPolicy)
# -> (topic :ChatWriter);
# # `this` group becomes a topic under `target` group. The caller
# # must hold both the source GroupOwner cap (this) and the
# # target GroupAdmin cap (passed as argument). Source members
# # not already in target are handled per `memberPolicy`. Source
# # role caps go stale (or transparently re-bind; see Open
# # Question).
#
# GroupAdmin (in addition to its existing methods) gains:
#
# moveTopicHere
# @100 (sourceGroupAdmin :GroupAdmin,
# sourceTopicId :Text,
# destinationTopicId :Text,
# memberPolicy :MergeMemberPolicy) -> ();
# # Move topic from source to destination (this) group. Caller
# # holds destination admin via `this`; sourceGroupAdmin proves
# # authority on the source group.
#
# extractTopicAsGroup
# @101 (topicId :Text,
# lifetime :GroupLifetime,
# displayName :Text,
# creator :Self)
# -> (owner :GroupOwner);
# # Inverse: pull a topic out of `this` group into a brand-new
# # standalone Group. The `creator` Self cap proves the calling
# # principal has group-creation authority; chat-server's
# # `Self.startGroup` policy applies here too (so a guest who
# # cannot create groups cannot bypass the quota by extracting
# # a topic). Caller becomes Owner of the new group; topic
# # members auto-migrate as Members, parented to the extract
# # operation.
# A contact cap is a chat-server-issued cap that says "any holder
# may open a DM to the issuing principal." The issuer can revoke at
# any time. Contact caps may be public (broadly shared) or narrow
# (handed to one specific principal); both shapes are the same cap
# kind, the difference is in how the issuer chose to share it.
interface ContactCap {
describe @0 () -> (info :ContactInfo);
}
# Listener-side. Held by the receiver; minted locally.
interface Subscription { cancel @0 () -> (); }
interface TextListener { post @0 (event :ChatInboundEvent) -> (); }
interface AudioSink { frame @0 (meta :AudioFrameMeta) -> (); }
interface VideoSink { frame @0 (meta :VideoFrameMeta) -> (); }
# Outgoing media. Flow-controlled via `-> stream`.
interface AudioOut {
writeFrame @0 (meta :AudioFrameMeta) -> stream;
close @1 ();
}
interface VideoOut {
writeFrame @0 (meta :VideoFrameMeta) -> stream;
close @1 ();
}
enum ChatPayloadKind {
text @0;
presence @1; # joined / left / typing / status
reactionRef @2; # reference to another event id
approvalRef @3; # reference to an ApprovalGrant; payload is the
# grant's audit-safe descriptor, not the grant
attachment @4; # see AttachmentDescriptor
custom @5; # service-defined; opaque to the substrate
}
struct ChatOutboundEvent {
kind @0 :ChatPayloadKind;
text @1 :Text; # optional, for kind=text and convenience
data @2 :Data; # optional structured payload
inReplyTo @3 :Data; # optional event id
redactionClass @4 :Text;# audit redaction class
}
struct ChatInboundEvent {
eventId @0 :Data;
chatId @1 :Text; # opaque per-chat identifier; renamed
# from the earlier `channel` field
# because "channel" is reserved for
# Telegram-style broadcast Channels.
# Holds equally for Groups, broadcast
# Channels, and DMs.
sender @2 :Text; # disclosure-policy-redacted display name
kind @3 :ChatPayloadKind;
text @4 :Text;
data @5 :Data;
inReplyTo @6 :Data;
receivedAtMs @7 :UInt64;
}
Notes:
ChatEvent(the existing struct incapos.capnp) becomesChatInboundEvent. Listener caps replacepoll, butpollmay stay as a deprecated, transport-stopgap method during the capnp-rpc migration.AudioFrameMeta/VideoFrameMetacarry timestamps, codec hints, and a ring-buffer slot reference. Frame bodies live inMemoryObject-backed rings shared between the producer and consumer.approvalRefis the only tie between this proposal and the approval surface: it lets an approval request appear in a chat as a structured message that links to anApprovalGrantcap. The grant cap travels by capnp-rpc cap reference, not as bytes inside the message data.
WebRTC Mapping
Browser-side participants use WebRTC. The trusted Rust backend (or a capOS-side WebRTC adapter the gateway delegates to) implements the peer at the capOS end. The mapping is symmetric enough that no additional abstraction layer is needed in either direction.
| Chat substrate | WebRTC equivalent | Notes |
|---|---|---|
subscribeText(listener) + send(event) | RTCDataChannel (reliable, ordered) | Text events are JSON view models on the HTTP path; the WebRTC data channel may carry the same JSON for browser peers that want lower-latency text without HTTP polling. |
openAudioOut, subscribeAudio(sink) | RTCPeerConnection audio track (addTrack, ontrack) | Codec negotiation via SDP; capOS-side adapter exposes the agreed AudioFormat. |
openVideoOut, subscribeVideo(sink) | RTCPeerConnection video track | Same as audio with codec/resolution negotiation. |
postAttachment(descriptor) | RTCDataChannel reliable chunk transfer or HTTP file fetch | Bounded attachments only; large transfers go through a separate File/ContentStore cap. |
presence payload kind | RTCPeerConnection connectionstatechange events + custom data-channel messages | capOS surfaces presence as ChatInboundEvent kind=presence. |
approvalRef payload kind | data channel message with structured payload | The approval cap stays on the capnp-rpc side; the data channel only carries the audit-safe descriptor. |
| ICE / SDP negotiation | gateway endpoint /api/chat/webrtc/* | Browser sends offer; backend produces answer; ICE candidates traded via the same endpoint. The HTTP endpoint runs on whatever userspace TCP listener cap the trusted Rust backend already holds via Networking – chat-server itself never opens a socket. The browser never receives capOS caps through this path – only WebRTC handles. |
| DTLS / SRTP keys | WebRTC default | DTLS / SRTP key material lives inside the WebRTC peer endpoint and never crosses to chat-server; chat-server forwards already-protected frames. TLS for the browser ↔ backend signalling channel is configured separately, composed from the certificate/trust/TLS-context caps in Certificates and TLS on top of the userspace networking surface above. |
The gateway boundary stays the same: the browser receives WebRTC handles and view models. The trusted backend holds the chat cap, the listener caps, the media rings, and the WebRTC peer connection. No capOS authority object crosses to the browser.
Approvals Stay Separate
Approvals are a different surface from “may I write to you”. They
already have a designed capability: ApprovalClient / ApprovalGrant
(shell-proposal.md:407-427, also referenced in
user-identity-and-policy-proposal.md:812). Per-tool permission modes
are defined in llm-and-agent-proposal.md:105-114
(auto|consent|stepUp|forbidden). The remote CapSet UI’s
“action-approval queue” is the canonical UI surface
(remote-session-capset-client-proposal.md § UI Scope And Architecture).
What ApprovalClient is for: a principal that already has authority
to attempt some action wants confirmation before exercising it (or
the policy engine demands a step-up). Examples: agent runtime asks
the operator before invoking a consent-mode tool; a destructive
operation needs WebAuthn step-up; a queued write awaits
human-in-the-loop sign-off.
What ApprovalClient is not for: cold-call admission. There is no
flow where principal A asks the system “may I please write to B”.
That request requires a cap A does not have. The substrate’s answer
is: B issues a contact cap (via Self.contact()) or invites A to a
shared Group via GroupMember.invite(...) (or, if B holds the
broadcast Channel role, ChannelAdmin.makePublisher(...)).
Without an existing
cap from B’s chain, A has no protocol-level path. See
“Capability Granting” above.
Chat ties to ApprovalClient in exactly one place: an approvalRef
payload kind lets a chat thread display an approval request as a
structured message linking to a live ApprovalGrant cap. The grant
cap travels by capnp-rpc cap reference; the bytes inside the message
data carry only an audit-safe descriptor. The grant state machine,
the broker call, the policy check, the step-up mechanics, and the
audit trail all remain on the existing ApprovalClient /
AuthorityBroker.request path.
Approvals-side gaps that are still open (and tracked separately in
docs/tasks/README.md):
- Detailed
ActionPlanandCapRequestschema. Both are referenced in the existingApprovalClientsketch but not fully specified. - Durable approval queue / inbox shape. Today the flow is
synchronous (
ApprovalClient.requestreturns a grant cap directly); the remote CapSet UI’s queue surface implies persistence and listing. A queue cap layered on top ofApprovalClient(e.g.ApprovalQueue.list() -> List(Pending),next() -> ApprovalGrant) is a natural follow-up.
These should land in a follow-up update to shell-proposal.md /
user-identity-and-policy-proposal.md, not in this Chat proposal.
Chat Categories
Telegram-aligned naming. Three concrete chat categories plus an E2E
variant of DMs. Distinct cap types because they have distinct
authorities; all of them sit on top of the unified
ChatEndpoint / ChatWriter base interfaces.
- Group – multi-participant, two-way. Has an Owner, zero-or-more
Admins, and Members. Supports nested rooms of three kinds:
text topics (sub-channels for text), voice rooms (Discord-style
persistent always-on voice rooms), stage rooms (Discord-stage /
Twitter-Spaces broadcast voice within the group with raise-hand to
speak). Per-room permission overrides are out of scope for the
first slice;
RoomPolicyleaves the door open. - Channel (Telegram-strict: BROADCAST) – read-only for subscribers. Owner/Admin/Publisher post; Subscribers receive only. Useful for system announcements, agent status feeds, log streams, one-to-many broadcasts.
- DM – two-participant chat. No group-level role hierarchy.
Each peer holds an asymmetric
DmPeercap. - E2E DM – two-participant DM where the chat host carries
ciphertext only. Distinct cap layer (
E2EDmPeer) because key exchange, AEAD, forward-secrecy ratchets, and out-of-band fingerprint verification are concerns the unencrypted DM does not have. See “End-To-End Encrypted DMs” below.
In addition, both Groups and DMs expose an ephemeral Call surface for voice/video conferences – but with a kind-specific narrowing:
- Groups use
GroupMember.callSurface() -> CallSurfacefor multi-party calls;CallSurface.startCallallowssetRoutingMode(sfu / mesh / mcu) so server-side mixing is available when text/audio aren’t end-to-end-encrypted. - DMs (both plain
DmPeerandE2EDmPeer) usecallSurface() -> E2ECallSurface– the SFU-forward-only surface with nosetRoutingMode. Direct calls between two principals are end-to-end-encrypted at the media layer regardless of whether DM text is host-readable.
A Call has explicit start/end, distinct from the persistent VoiceRoom: use Call for “let’s hop on a quick conference”, use VoiceRoom for “Discord voice channel always there”.
There is no special “system room” category. A system-managed
chat is just a chat whose Owner principal is a service principal or
a designated admin principal. capOS already treats services as
principals (PrincipalKind.service in
user-identity-and-policy-proposal.md:91-98); a service-owned chat
applies the same role/lineage rules as any other.
Naming convention. The unqualified word “channel” in this
proposal refers only to the broadcast category (Telegram-style
Channel). Anything generic – a stream of events, a subscription
target, an A/V flow – is called a chat (the substrate-level
term). Base interfaces use the Chat prefix (ChatEndpoint,
ChatWriter, ChatDirectory, ChatInfo, ChatKind); concrete
roles use the category prefix (Group*, Channel*, Dm*).
Substrate is recording-blind. No protocol-level “start recording” / “consent to recording” / “recording state” surface exists. Server-side recording with consent is consent theater anyway – a phone next to the speakers or a screen recorder on the recipient’s own device defeats it instantly. Recording is purely a client-side concern: whoever holds a participant cap may locally record bytes they receive. A “shared meeting recording” is modeled by inviting a recorder principal into the call – it shows up in the roster like any other participant, the social contract carries the rest.
Lifetime And Transformations
Groups have a lifetime policy chosen at creation, and the chat graph supports a small set of structure-preserving transformations.
Group lifetime
GroupLifetime is one of:
persistent(default): the group lives until an owner callsdisband()or transforms it into something else. Manifest-created groups default to persistent.ephemeralOnEmpty: chat-server auto-disbands when the last member leaves. “Spin up a quick chat with these three people; it goes away when everyone closes the tab.”deadline: chat-server auto-disbands at an absolute time. “This pickup-call thread auto-archives Friday at 17:00.”ephemeralOnIdle: chat-server auto-disbands after N ms with no message activity. “Self-cleanup if nobody says anything for an hour.”
Owners can change the policy at runtime via setLifetimePolicy.
Going from ephemeral to persistent is “promote this ephemeral chat
to a permanent one”; the same group identity persists, no caps
rotate, no auto-invite happens. Going the other way (persistent ->
ephemeral) is also valid – the auto-disband watcher just starts.
Lifetime applies at the Group level. Topics and rooms inherit the parent group’s lifetime; they don’t have separate auto-disband clocks. This is the right scope: rooms are sub-spaces of a group, not independent chats.
For DMs the same GroupLifetime shape can be reused (an
ephemeralOnIdle DM is the natural shape for “self-destructing
chat” if you ever want it), via a lifetime field on the
Self.openDm config. Out of scope for this slice; the schema
leaves room.
Ad-hoc group creation
Self.startGroup(config :GroupCreateConfig) -> (owner :GroupOwner)
lets any principal whose Self cap permits it create a new group.
chat-server policy gates this per principal class – operators
typically have a creation quota; guests/anonymous don’t have
Self.startGroup at all (cap absent from their bundle).
Initial invitees are passed as a List(ContactCap). This is the
ocap-clean rule: you can only invite people you already have a
ContactCap for. No cold-call admit. Want to spin up a Group with
strangers? You can’t; you have to first arrange contact via
existing channels (someone vouches by sharing your contact card,
you publish a public ContactCap, etc.).
Each initial invite is delivered through the existing Self
notification surface of the invitee, who can Self.acceptInvite
to join. If invites are declined, the group still exists with
just the creator as Owner.
Transformations
Three structural mutations of the chat graph, each a two-cap operation: the caller proves authority on one side; the receiver of the method (i.e. the cap-self) proves authority on the other. chat-server validates both before mutating its lineage tree.
Promote ephemeral to persistent. GroupOwner.setLifetimePolicy({persistent}).
Single-cap (just the Owner of the ephemeral group). No member
migration; same caps stay valid.
Merge a group into another as a topic. GroupOwner of the
source calls mergeIntoGroupAsTopic(target :GroupAdmin, topicId, memberPolicy). After success:
- Source group ceases to exist as a top-level group; its identity
becomes a topic under
target. - Source members not already members of
targetare handled permemberPolicy:autoInvitemints freshGroupMember(target)caps for them (parented to the merge operation), and chat-server delivers each cap LISTENER-BOUND to the recipient principal via that principal’sSelf.subscribeIncoming– thetransformationGrafted :GroupMemberarm, withsource.transformationRefcarrying the merge-opGroupCapRef. The fan-out source-group event stream only carries non-cap presence (a “you have been grafted into target via merge” notice) so cap delivery stays on the listener-bound surface required bytransfer_policy. The alternativedropNonMemberslets the source caps go stale without minting new ones. - The merge operation is a node in chat-server’s lineage tree; every cap minted as part of it is parented to that node, so “revoke everything that came in via this merge” is one operation.
Move a topic between groups. GroupAdmin.moveTopicHere(sourceGroupAdmin, sourceTopicId, destinationTopicId, memberPolicy). Same two-cap
shape: caller’s this is the destination admin; sourceGroupAdmin
is the source. Topic members not in the destination are handled
per memberPolicy. The topic-as-namespace identity moves; the
topic’s history (text events, attachments) carries over.
Extract a topic into a standalone group.
GroupAdmin.extractTopicAsGroup(topicId, lifetime, displayName, creator :Self). Inverse of merge – but unlike the
single-extract-cap shape that would let any group admin mint a
top-level Group regardless of group-creation authority, this
method takes a creator :Self cap as a second argument.
chat-server applies the same policy it applies to
Self.startGroup (per principal class quota, ban-list checks,
etc.) to the calling principal before minting the new
GroupOwner. A guest or admin who is not allowed to create
groups cannot bypass the quota by extracting a topic. Caller
becomes Owner of the new group; topic members auto-migrate as
Members; their caps are parented to the extract operation.
Authority rules
All three cross-group operations share these invariants:
- Two-cap proof. Methods that move structure across groups
take the other authority as an argument. For
mergeIntoGroupAsTopic/moveTopicHerethat’s the other group’s admin role cap (thepartnerAdminarm ofOperationConsentin lineage queries). ForextractTopicAsGroupthere is no other-side group yet, so the second authority is the initiator’s ownSelfcap proving group-creation quota (theselfCreationarm ofOperationConsent); chat-server applies the same per-principal quota / ban-list checks it applies toSelf.startGroupbefore minting the newGroupOwner. chat-server rejects withincompatibleChatKindif the cross-group caps reference chats with incompatible kind/policy (e.g. you can’t merge an E2E DM into a non-E2E group). - Lineage continuity. The transformation operation is itself
a node in chat-server’s tree (
OperationNodeInfoarm ofLineageNodereturned bydescribeBranch); new caps minted as part of it recordparent = the operation(thetransformationarm ofBranchParentage). Both entire-graft revocation (revokeBranch(operationNodeRef)) and per-member revocation (revokeBranch(memberCapRef)) work, and either ref kind passes through the sameGroupCapRefenvelope. - No cold-call sneak path.
autoInvitelooks like it might be a way to drag people into a group they didn’t agree to, but it requires both the source-group owner (who has authority over those members because they’re already in the source group) AND the target-group admin (who has authority to admit) to consent in the same call. A single party can never drag people into a group on their own; the two-cap pattern is the consent.
Lifetime interaction with conferencing
A subtle thing worth flagging: ephemeralOnEmpty interacts
oddly with VoiceRooms. If a Group has a VoiceRoom and the last
text-chat member leaves but two people are still connected to
the voice room, the group should not auto-disband. Definition:
“empty” means “no member is present in any room of the group”
– text idle, voice idle, stage idle. Detail for the
implementation iteration.
A merged-into-topic source group’s lifetime policy does not
survive the merge. The topic now lives under the target group’s
lifetime; if the source was on ephemeralOnIdle and the target
is persistent, the topic becomes persistent. Worth surfacing
in the merge confirmation UX. Substrate behavior:
lifetimePolicy is a Group-level field; topics inherit.
Cap continuity at the holder (Open Question)
When a group merges into another as a topic, members hold caps that used to mean “send to the top of source group” and now mean “send to topic X under target group”. Three viable strategies; the substrate proposal does not lock one in:
- Transparent redirect. Old caps keep working; chat-server’s
dispatch routes calls to the new topic.
describe()reveals the new identity. Pros: zero client code change. Cons: leaks “this used to be a separate group” history; may surprise users. - Forwarding denial. Old caps go stale with a
chatMergeddenial that includes a forwarding hint (event id and a reference the client can fetch to obtain the new topic cap). Pros: clean break; auditable. Cons: every client across every member needs to handle the forwarded-redirect at the call site. - Holder-driven re-bind. chat-server delivers a presence event to every affected member carrying the new cap; the old cap stays usable for a grace window after the merge, then goes stale. Lets clients re-bind without disruption; the eventual stale flip ensures no permanent dual identity.
The third strategy reads cleanest to me, but it benefits from prototyping. Implementation iteration will pick one.
Capability Granting
The current Chat interface in schema/capos.capnp is open-by-default:
holding the system Chat cap lets a process join any channel by name and
send to any channel. That is the wrong model. This section defines an
ocap-disciplined replacement: every Chat capability is granted
explicitly by a holder that already has it, every derived cap has a
recorded parent, and revocation cascades through the derivation tree.
Cap flavours
The substrate defines four kinds of caps. The exact schema is part of the implementation iteration; the shape is what matters.
-
Chat service root cap. Held by chat-server itself, never handed to user code. The root authority from which every other chat cap ultimately derives. Manifest configuration tells chat-server which groups and channels to materialize at startup; chat-server uses its root cap to do so. The root cap is the lineage root; it is not “ambient authority handed out by the broker” – it is service authority held by the service that runs Chat.
-
Role caps. A role on a specific chat is a cap. Roles inherit upward; concrete role caps embed the unified
ChatEndpoint/ChatWriterbase interfaces.GroupOwner(group)extendsGroupAdminextendsGroupMemberextendsChatWriter. Full authority on the group: appoint admins, create/remove rooms (text topics + voice rooms + stage rooms), change group settings, kick members, issue invites, open public-join routes, disband.GroupAdmin(group)adds member/branch/room moderation and invite-policy management. Per-permission DSL (can-pin, can-invite, can-create-room, …) is future work; first slice ships a single Admin role.GroupMember(group)– read and write all rooms under the group’s default policy. Members may invite others if the group’s policy allows. Members access voice/stage rooms viavoiceRoom(id)/stageRoom(id)and ephemeral conferences viacallSurface().ChannelOwner(channel)extendsChannelAdminextendsChannelPublisherextendsChatWriter. Full broadcast authority. Per-kind subscribers –ChannelTextSubscriber(channel)extendsChatTextReaderonly,ChannelAudioSubscriber(channel)extendsChatTextReader + ChatAudioReader,ChannelVideoSubscriber(channel)extends all three readers – are read-only at the type level. Promotion to publisher goes throughChannelAdmin.makePublisher.DmPeer(dmId, direction)extends onlyChatTextWriter(NOT fullChatWriter). DM text is host-readable; DM media is NOT – audio/video flows only throughDmPeer.callSurface() -> E2ECallSurface, where chat-server forwards already-encrypted frames between peers. A→B peer cap gives A the right to push text to B; it is not symmetric.E2EDmPeeris the analogous cap for end-to-end-encrypted DMs (does not extendChatWriterbecause its payloads areCipherEnvelope, notChatOutboundEvent).CallParticipant/CallHost– ephemeral conference participation; held while a Call is live, parented to the joiner’s chat role cap. Voice/stage variants have their own concrete role caps (StageListener,StageSpeaker).StageListeneris parented to the joiner’sGroupMemberrole cap (joinAsListeneris a normal accessor on the member’s stage facet);StageSpeakeris the exception — see below.SpeakerToken/SpeakerRevoker– a stage-room admin’s grant of speak authority for a specific listener. HoldingSpeakerTokenlets that listener callStageRoom.joinAsSpeaker(token) -> StageSpeaker, and chat-server mints the resultingStageSpeakerwithparent = the SpeakerToken's lineage node. The admin holds the companionSpeakerRevoker(parented to the admin’sStageRoomAdmincap);revoker.revoke()epochs both the unredeemed token and any activeStageSpeakerredeemed from it, so pulling the floor back actually kills the live speaker cap rather than just blocking future redemptions.
-
Listener-side caps. Held by the receiver. Minted locally; never issued by anyone else. The receiver hands a listener cap to a chat role cap (Group, broadcast Channel, DM, voice/stage room) when subscribing; that role cap calls back per event. Dropping the listener (or cancelling the returned
Subscription) is the receiver’s instant revocation tool.TextListenerAudioSinkVideoSink
-
Discovery / join caps.
ChatDirectory(scope)– read-only access to the discoverable chats (Groups and broadcast Channels) chat-server’s configuration exposes for this scope. Bundled to sessions per chat-server config (e.g. operator-class sessions getChatDirectory(operator-scope)). Holding it lets the session callChatDirectory.search(query) -> ChatDirectoryPageand filter by chat-server-defined criteria. Not a global index – each scope is whatever chat-server’s config carves out.DiscoverableGroupJoin(group)– “you are allowed to join this group”. Returned byChatDirectory.search(query)entries that the scope’s policy says the caller may join, or bundled directly to a session by chat-server config. Possessing it is the authority; callingDiscoverableGroupJoin.join() -> GroupMembermints a fresh role cap. There is no separate “redeem” step; possession is authority, the method just produces the derived cap.DiscoverableChannelTextSubscribe(channel)/DiscoverableChannelAudioSubscribe(channel)/DiscoverableChannelVideoSubscribe(channel)– analogous for broadcast Channels. Each returns the matching per-kindChannelTextSubscriber/ChannelAudioSubscriber/ChannelVideoSubscribercap; the result type tells the caller exactly which media facets they hold.InviteToken– a one-shot or n-shot bearer token an admin or policy-permitted member produces viaGroupMember.invite(forSubject, lifetime) -> (token, revoker, inviteRef). The invitee callsSelf.acceptInvite(token) -> GroupMember. The token interface has NO revoke method; revocation lives on the issuer-held companionInviteRevokercap, parented to the issuer’s role cap in chat-server’s lineage tree. The issuer also keeps the non-secretinviteRef :GroupCapReffor the cap-cleanGroupAdmin.describeBranch/revokeBranchpath. (For paper / QR / out-of-band handoff where the recipient cannot receive a cap, the issuer usesGroupMember.inviteCode(lifetime) -> (code :Data, revoker, inviteRef)instead, and the recipient callsSelf.acceptInviteCode(code). The bytes are bearer secrets that name a distinct lineage node in chat-server’s tree – the issuance entry created byinviteCode. On redemption chat-server mints the resultingGroupMembercap withparent = the inviteCode lineage node, NOT directly withparent = the inviter's role cap. Revoking via the companionInviteRevokertherefore epochs only that code’s descendants. See How bearer caps cross principal boundaries below for the full redemption-parent contract, and treat the bytes with bearer-secret hygiene – do not log, prefer bounded lifetimes and rate-limited redemption.)SpeakerToken/SpeakerRevoker– analogous shape for stage-room speak grants. Bearer holdsSpeakerToken(no revoke method); admin holdsSpeakerRevokerminted viaStageRoomAdmin.promoteToSpeaker(listenerRef).Self.contact()– a cap a principal produces to advertise “you may DM me”. The method returns BOTH the bearerContactCap(handed to whoever should be able to DM the issuer) AND a non-secretContactCapRefthe issuer keeps forSelf.revokeContact(ref). A holder of the bearer cap callsSelf.openDm(contactCap) -> DmPeer(orSelf.openE2EDm(contactCap) -> E2EDmPeer). The contact-issuing principal sees the resulting DM via their ownSelfcap’s notification surface. Equivalent to a Telegram contact card or a published@handle; the substrate’s only guarantee is that you needed a contact cap (or its bytes form viaSelf.contactCode, which similarly returns both the bearer-secretcodeand a non-secretcodeIdrevocation handle) to initiate.
There is no IntroCap primitive. What I formerly called
“redeem an intro” is just calling a method on a DiscoverableGroupJoin / DiscoverableChannel*Subscribe,
InviteToken, or contact cap that returns a derived role cap.
How bearer caps cross principal boundaries
The substrate forbids raw bearer transfer of chat caps via
kernel-enforced transfer_policy. But a flow like “Alice creates
an InviteToken and gives it to Bob” inherently means a cap moves
from Alice’s process to Bob’s. The same applies to ContactCap
sharing.
These chat-class cap transfers go through chat-server itself,
never through raw IPC IPC_TRANSFER_CAP. Two paths:
-
Cap reference inside a chat event.
ChatOutboundEvent.datamay carry chat-server-recognized chat-class cap references (anInviteToken, aContactCap). When a holder sends such an event withChatTextWriter.send, chat-server inspects the payload, sees the cap reference, and on delivery to each recipient re-mints a fresh derived cap. The lineage parent for the re-minted recipient cap is the original issuance node, NOT the sender’s chat cap, so that the issuer-held revoker (e.g.ContactCapReffromSelf.contact,InviteRevokerfromGroupMember.invite) reaches every recipient copy and every downstream descendant when the issuer revokes. If chat-server instead parented under the sender’s chat cap, only the sender’s branch would be killed on revoke; recipient copies and theDmPeer/GroupMembercaps minted from them would survive, defeating the issuer-side revocation contract. The original bearer cap stays in the sender’s table; the recipient receives a fresh cap of the same kind, parented under the issuance node. Lineage is preserved; raw bearer transfer never happens. -
Out-of-band delivery + recipient redeem. Bytes can be exchanged through a non-chat path (paper handoff, QR code, manifest entry in a test fixture). Issuers produce the bytes through
Self.contactCode/GroupMember.inviteCode; recipients redeem them viaSelf.openDmFromCode(code),Self.openE2EDmFromCode(code), orSelf.acceptInviteCode(code).The bytes are bearer secrets – any holder who also has a
Selfcap can redeem them – so chat-server treats each issued code as a distinct lineage node in its tree, not as a transparent identifier collapsed onto the issuer’s cap. When the issuer mints a code viainviteCode/contactCode, the code’s lineage entry hasparent = the issuing role/Self capand the issuer holds the matchingInviteRevoker(forinviteCode) or revokes viaSelf.revokeContactCode(codeId)(forcontactCode). When a recipient redeems, chat-server mints the derived cap withparent = the code's lineage node, NOT directly with parent = the issuer’s cap. So:- Revoking a single
contactCodeepochs only that code’s descendants; other contact caps and codes the same issuer has handed out are unaffected. - Revoking an
InviteToken’s revoker (or its companioninviteCode) kills the redeemed Member cap and any sub-invitees that Member produced, without affecting other invites the same admin issued. - The issuer-held revoker /
revokeContactCodeis the only way to revoke that specific handoff. Bearer copies that have not yet redeemed simply fail closed once revoked.
Bearer-secret hygiene applies: codes have lifetimes, are bound to a single issuance entry, and chat-server may rate-limit redemption attempts per code to bound brute-force guessing.
- Revoking a single
The kernel’s transfer_policy rejection of raw IPC-cap-transfer
is what closes the loophole. chat-server’s typed delivery methods
(or the byte-form code paths above) are the only ways a chat-class
cap reaches a new principal; lineage is recorded at chat-server
side in either case.
Approval grants are NOT chat caps and are not re-minted through
chat lineage. approvalRef is a payload kind that lets a chat
event display an approval request, but the live ApprovalGrant
cap travels by ordinary capnp-rpc cap reference between the
approval service and its caller – the same way it would without
chat. chat-server only forwards the audit-safe descriptor for
display; if the recipient needs the actual ApprovalGrant cap,
it comes from AuthorityBroker.request / ApprovalClient, not
from a chat-server re-mint. Approvals stay separate (see the
“Approvals Stay Separate” section).
Per-principal ban list
Rotating a member’s branch (revokeBranch(memberCap)) kicks
their current chain. But if the principal still holds a
DiscoverableGroupJoin (or DiscoverableChannel*Subscribe) cap,
or has a session bundle hook that hands one out at login, they
can call .join() / .subscribe() and mint a fresh chain. For
real ban semantics, chat-server tracks a per-chat ban list:
-
Group ban.
GroupAdmin.banPrincipal(principalRef)adds the principal to the group’s ban list; chat-server checks it on every Group-side mint path that could attach a fresh role cap to that principal:- public-join redemption:
DiscoverableGroupJoin.join; - cap-form invite redemption from outside the group:
Self.acceptInvite(token); - cap-form invite redemption from inside an existing group
context:
GroupMember.acceptInvite(token)(same wire as the Self-form, but invokable when the invitee already holds a member cap in another group and chat-server forwarded the InviteToken through that group’s chat event); - byte-form invite redemption:
Self.acceptInviteCode(code); - admin-mint paths on the Group role hierarchy:
GroupOwner.makeAdmin, plus any other future role-promotion methods chat-server adds to GroupOwner / GroupAdmin (Channel-side methods likeChannelAdmin.makePublisherare NOT in this list – those belong to the Channel ban below); - every manifest-driven session bundle hook that attaches a
Group role cap at login (
GroupOwner/GroupAdmin/GroupMember); and - every transformation-driven auto-mint path
(
mergeIntoGroupAsTopic/moveTopicHerewithmemberPolicy=autoInvite, and the per-topic-member auto-migration step insideextractTopicAsGroup).
Without the transformation check, a source-owner plus target-admin pair could graft a banned principal back into a group via merge or move; without the login-bundle check, a banned operator who has the
lobbygroup attached by their session profile would receive a freshGroupMember(lobby)(orGroupAdmin(lobby)) cap on their next login and bypass the ban. Banned principals caught in a transformation are dropped from the autoInvite set with aprincipalBannedaudit event; the transformation itself still completes for non-banned members. - public-join redemption:
-
Channel ban.
ChannelAdmin.banPrincipal(principalRef)adds the principal to the broadcast Channel’s ban list; chat-server checks it when minting viaDiscoverableChannelTextSubscribe.subscribe/Audio/Video, onChannelAdmin.makePublisher, onChannelOwner.makeAdmin, and on any Channel role cap (ChannelOwner/ChannelAdmin/ChannelPublisher/Channel{Text,Audio,Video}Subscriber) attached by manifest-driven session bundles at login (same reason as the Group case). -
Self-creation ban via
Self.startGroup. A globally banned principal whose chat-server policy disallows new groups (e.g. manifest setsSelf.startGroupper principal class) cannot bypass by including a bannedContactCapininitialInvites; chat-server validates each contact against its issuer’s bans before minting auto-invites.
Banned principals get a typed principalBanned denial.
unbanPrincipal removes the entry. Banning is independent of
revokeBranch: revoke kicks the active chain; ban prevents new
chains; an admin typically does both as a single workflow (“kick
- ban“).
Where caps come from
The chain always terminates at chat-server’s own root cap. There is no broker-side ambient minting; the broker’s role is to hand out chat-server-issued caps that chat-server’s config has already authored for sessions matching certain profiles.
| Cap | Originating issuer | How a session first holds it |
|---|---|---|
Self | chat-server, once per session at login from the caller’s authenticated identity | parent is chat-server’s root, exactly one Self cap per (principal, session) tuple; chat-server creates it the first time the broker hands a session to chat-server. All ContactCap / contactCode / Self-driven group-creation chains terminate at this Self node, which terminates at chat-server’s root, satisfying the lineage invariant. The Self cap is never delivered cross-principal; its lifetime is the session’s lifetime. |
GroupOwner (manifest-bundled) | chat-server, when the manifest declares the group | bundled to the configured Owner principal’s session at login; parent is chat-server’s root, the manifest entry is its own chain |
GroupOwner (Self.startGroup) | chat-server, on Self.startGroup(config) | parent is the calling principal’s Self cap; minting is gated by chat-server’s per-principal-class group-creation quota |
GroupOwner (extractTopicAsGroup) | chat-server, on GroupAdmin.extractTopicAsGroup(..., creator :Self) | parent is the extract-operation lineage node (OperationNodeInfo with selfCreation consent); the extract op is itself a child of the source group’s root |
GroupAdmin (manifest-bundled) | chat-server, when the manifest bundles admin to a profile (e.g. the test fixture’s chat.groups.X.admins entry) | parent is chat-server’s root, the manifest entry is its own chain |
GroupAdmin (Owner-minted) | chat-server, on GroupOwner.makeAdmin(memberRef); delivered to the promoted principal via Self.subscribeIncoming.groupAdminGranted | parent is the promotion issuance lineage node (IssuanceNodeInfo with kind groupAdminGrant); the issuance node parents to the calling GroupOwner cap. Revoking via the issuer-held RolePromotionRevoker epochs the issuance node and the promoted GroupAdmin under it. |
GroupMember (manifest-bundled) | chat-server, when the manifest bundles membership to a profile | parent is chat-server’s root, the join is its own chain |
GroupMember (public-joined) | chat-server, on DiscoverableGroupJoin.join() | parent is the joiner’s own root within the group (each public join is its own distinct chain) |
GroupMember (invited, cap form, Self redemption) | chat-server, on Self.acceptInvite(token) | parent is the InviteToken issuance lineage node, which itself parents to the inviter’s role cap |
GroupMember (invited, cap form, in-context redemption) | chat-server, on GroupMember.acceptInvite(token) (the in-context redemption used when the invitee already holds a GroupMember cap in another group through which the inviter forwarded the InviteToken) | same parent semantics as the Self-form: the InviteToken issuance lineage node, which parents to the inviter’s role cap |
GroupMember (invited, code form) | chat-server, on Self.acceptInviteCode(code) | parent is the inviteCode lineage node, which itself parents to the inviter’s role cap |
GroupMember (transformation-grafted, merge/move autoInvite) | chat-server, on mergeIntoGroupAsTopic / moveTopicHere with memberPolicy=autoInvite | parent is the transformation operation node (OperationNodeInfo arm of LineageNode with partnerAdmin consent); revoking the op node epochs every grafted member |
GroupMember (transformation-grafted, extractTopicAsGroup) | chat-server, on GroupAdmin.extractTopicAsGroup(..., creator :Self) for each existing topic member auto-migrated into the new group | parent is the extract operation node (OperationNodeInfo arm of LineageNode with selfCreation consent); revoking the op node epochs every auto-migrated member of the extracted group |
ChannelOwner (manifest-bundled) | chat-server, when the manifest declares the channel | bundled to the configured Owner principal’s session at login; parent is chat-server’s root, the manifest entry is its own chain |
ChannelTextSubscriber (public) | chat-server, on DiscoverableChannelTextSubscribe.subscribe() | parent is the subscriber’s own root within the channel |
ChannelAudioSubscriber (public) | chat-server, on DiscoverableChannelAudioSubscribe.subscribe() | parent is the subscriber’s own root within the channel |
ChannelVideoSubscriber (public) | chat-server, on DiscoverableChannelVideoSubscribe.subscribe() | parent is the subscriber’s own root within the channel |
ChannelTextSubscriber / ChannelAudioSubscriber / ChannelVideoSubscriber (manifest-bundled) | chat-server, when the manifest bundles a per-kind subscriber to a profile | parent is chat-server’s root, the manifest entry is its own chain |
ChannelPublisher (Admin-minted) | chat-server, on ChannelAdmin.makePublisher(subjectRef); delivered to the promoted principal via Self.subscribeIncoming.channelPublisherGranted | parent is the promotion issuance lineage node (kind channelPublisherGrant); the issuance node parents to the calling ChannelAdmin cap. Revoking via RolePromotionRevoker epochs the issuance node and descendants. |
ChannelPublisher (manifest-bundled) | chat-server, when the manifest bundles publisher to a profile | parent is chat-server’s root, the manifest entry is its own chain |
ChannelAdmin (manifest-bundled) | chat-server, when the manifest bundles admin to a profile | parent is chat-server’s root, the manifest entry is its own chain |
ChannelAdmin (Owner-minted) | chat-server, on ChannelOwner.makeAdmin(...); delivered to the promoted principal via Self.subscribeIncoming.channelAdminGranted | parent is the promotion issuance lineage node (kind channelAdminGrant); the issuance node parents to the calling ChannelOwner cap. Revoking via RolePromotionRevoker epochs the issuance node and descendants. |
DmPeer (cap form) | chat-server, on Self.openDm(contactCap) | parent = the ContactCap lineage node |
DmPeer (code form) | chat-server, on Self.openDmFromCode(code) | parent = the contactCode lineage node |
E2EDmPeer (cap form) | chat-server, on Self.openE2EDm(contactCap) | parent = the ContactCap lineage node |
E2EDmPeer (code form) | chat-server, on Self.openE2EDmFromCode(code) | parent = the contactCode lineage node |
ChatDirectory(scope) | chat-server, configured per scope in the manifest | bundled to sessions matching the scope’s policy |
DiscoverableGroupJoin / DiscoverableChannel{Text,Audio,Video}Subscribe | chat-server, on ChatDirectory.search(query) for entries the scope policy allows | parent is the directory-scope’s policy entry |
InviteToken (cap form) | chat-server, on GroupMember.invite(...) | parent is the issuing role cap (admin or member depending on policy) |
inviteCode (code form, lineage node) | chat-server, on GroupMember.inviteCode(...) | parent is the issuing role cap |
ContactCap (cap form) | chat-server, on Self.contact(lifetime) | parent is the issuing principal’s Self cap |
contactCode (code form, lineage node) | chat-server, on Self.contactCode(lifetime) | parent is the issuing principal’s Self cap |
InviteRevoker / SpeakerRevoker | chat-server, returned alongside the matching token / promotion | parent is the issuing role cap |
SpeakerToken | chat-server, on StageRoomAdmin.promoteToSpeaker(listenerRef) | delivered to the bound listener via stage roster events; parent is the admin cap |
listener caps (TextListener, AudioSink, VideoSink) | minted locally by the receiver | not in any lineage chain; revocation is local drop |
Manifest is Chat service configuration, not kernel or broker configuration. It declares the initial groups/channels, who owns them, who appears in which discovery scope, and which sessions are auto-bundled with which caps. chat-server reads it at boot and acts on its own root cap. The kernel only manages cap epochs and dispatch.
The broker’s role is to bundle initial caps a session needs to
use what it already has – e.g. a manifest can configure that
“chat-server starts with operator-lobby already created and
GroupMember(operator-lobby) bundled to operator-class sessions”. The
broker hands those session bundles out at login; chat-server is the
issuer.
Granting flows
Operator joins the operator-lobby at boot (manifest bundle). The
manifest declares chat-server’s startup config: create
operator-lobby with chat-server’s own service principal as Owner;
bundle GroupMember(operator-lobby) to every session whose profile
is operator. At login, the broker hands the operator session a
chat-server-issued GroupMember(operator-lobby) cap. The cap’s
parent in chat-server’s lineage tree is “this session’s join entry”
– a fresh chain root specific to this session, not shared with
other operators. No approval step.
Operator joins a discoverable chat at runtime. Sessions hold a
ChatDirectory(operator-scope) cap. Operator calls
ChatDirectory.search(query) -> ChatDirectoryPage; chat-server
returns entries matching the scope’s policy. Each entry carries
a kind-specific discoverable cap depending on the chat’s kind:
DiscoverableGroupJoin for a Group, or one of
DiscoverableChannelTextSubscribe /
DiscoverableChannelAudioSubscribe /
DiscoverableChannelVideoSubscribe for a broadcast Channel.
Operator picks one and calls the matching method:
DiscoverableGroupJoin.join() -> GroupMember(group)for a Group entry.DiscoverableChannelTextSubscribe.subscribe() -> ChannelTextSubscriber(channel)(or the matching audio/video variant) for a broadcast Channel entry.
The new role cap’s parent in chat-server’s lineage is “this
session’s join event” – a fresh chain root for this join, not
shared with other joiners. Possession of the discoverable cap
is the policy gate; calling .join() / .subscribe() mints
the role cap. There is no separate “redeem” step.
An admin invites a specific person to a group. Admin holds
GroupAdmin(group) (which extends GroupMember). They call
GroupMember.invite(forSubject=PrincipalRef, lifetime=...) -> (token, revoker, inviteRef) (cap-form, used when the invitee
can receive a chat-server-mediated cap delivery – e.g. via an
existing DM) or GroupMember.inviteCode(lifetime=...) -> (code :Data, revoker, inviteRef) (byte-form, used when the
invitee can only receive bearer-secret bytes through paper
handoff, QR code, or non-chat channels). Both calls now also
return the issuance lineage node’s inviteRef :GroupCapRef,
which the issuer keeps alongside revoker for cap-clean
per-branch revocation later via GroupAdmin.describeBranch /
revokeBranch. The byte-form is the issuance entry described
under How bearer caps cross principal boundaries: a distinct
lineage node, not a transparent identifier collapsed onto the
inviter. chat-server records InviteToken.parent = the calling admin role cap (cap form), or inviteCode.parent = the calling admin role cap (byte form, naming the issuance
entry). The invitee calls Self.acceptInvite(token) -> GroupMember for the cap-form, or Self.acceptInviteCode(code) -> GroupMember for the byte-form; chat-server mints the
member cap with parent = the InviteToken/inviteCode lineage node. Lineage is Member -> InviteToken/inviteCode -> Admin -> ... -> chat-server root. The admin’s InviteRevoker
revokes that specific handoff (invalidates pre-redemption
bearer copies, epochs the redeemed member’s branch).
A member invites someone (if group policy allows). Same shape as
admin-invite, but the invite policy may restrict member-issued
invites (single-use, n-shot, or disabled). The invitee’s resulting
GroupMember cap is parented to the inviting member’s role cap,
not to the admin’s; this is the per-member chain that makes
spam-bot recovery work.
Spam-bot recovery (per-branch revoke). A member M used their
member cap’s invite authority to admit five spam bots. Owner or
admin obtains a GroupCapRef for M’s branch – without holding
M’s bearer cap, since transfer_policy forbids raw bearer
transfer. Two cap-clean obtain paths:
GroupAdmin.lookupByPrincipal(M.principal) -> List(GroupCapRef)if the admin is starting from M’sPrincipalRef. TheChatInboundEvent.senderfield is a disclosure-redacted display name (text), not aPrincipalRef, so the admin getsM.principalfrom one of the typed surfaces that actually carry aPrincipalRef: an audit-log entry, a user-search / identity-broker UI, or by inspecting a known lineage node viadescribeBranch– the returnedBranchInfo.rootis aLineageNode, and when its union arm iscapNode, thecapNode.principal :PrincipalRefis the unredacted owner (issuance / operation arms have no principal of their own; walk to acapNodedescendant). The redacted sender field is for display only.GroupAdmin.describeRoot() -> BranchInfo(GroupCapRef)for a full top-down walk when starting from “show me the whole group’s lineage tree” (recurse intoLineageNode.children; eachcapNodearm carries the unredactedprincipal :PrincipalReffor admins).
Then optionally GroupAdmin.describeBranch(node) -> BranchInfo(GroupCapRef) to render “this is who would be
revoked” UI before pulling the trigger, and
GroupAdmin.revokeBranch(node) to commit. chat-server rotates
the kernel-level cap epoch on M’s role cap and every descendant
of it – the five bots’ caps and any further sub-invitees.
Subsequent dispatch through any of them fails closed. Other
members of the same group, including operators who joined via
the same public DiscoverableGroupJoin(group) route, are
untouched because each public join produced its own distinct
chain rooted at that joiner’s join event.
Closing a public-join route without kicking existing members. Two parallel cases by chat kind:
- Group. Owner calls
GroupOwner.closePublicJoin(entry)with the entry handle minted bypublishDiscoverable. chat-server marks the public-join entry inactive and rotates the epoch on the sharedDiscoverableGroupJoincap class that every directory result handed out. SubsequentDiscoverableGroupJoin.join()calls fail closed; existingGroupMember(group)caps are unaffected because the discoverable cap is not in their lineage (the route is the policy that minted them, not their parent). To later re-open, owner publishes a freshDiscoverableGroupJoin– a new cap with a fresh epoch. - Channel (broadcast). Owner calls
ChannelOwner.closePublicJoin(entry). chat-server rotates the epoch on whicheverDiscoverableChannel{Text,Audio,Video}Subscribecap class was associated with the entry (one epoch rotation can cover all three kinds for a single Channel route or carve them separately; chat-server config decides). ExistingChannel{Text,Audio,Video}Subscribercaps are unaffected – the discoverable cap is again not in their lineage.
Two principals open a DM (contact-cap path). Alice wants to be reachable. She has two options depending on how the recipient will receive the contact:
-
- Cap form – `Self.contact(lifetime=…) -> (contact
- ContactCap, ref :ContactCapRef)
. The bearerContactCapis shared via chat-server-mediated cap delivery (e.g. attached to a chat event in an existing group Alice is in, where chat-server re-mints it for each recipient). TheContactCapRefis Alice's non-secret revocation handle; she keeps it locally (alongside whatever metadata her UI shows in a "contacts I've issued" list) and later callsSelf.revokeContact(ref)` if she wants to retract this contact. Use this form when the recipient already has a cap-bearing channel to Alice.
- Code form –
Self.contactCode(lifetime=...) -> (code :Data, codeId :Data). The bearer-secretcodebytes are shared out-of-band (pinned in Alice’s public-profile post, printed on a business card, encoded as a QR, sent over an unrelated channel); thecodeIdis the non-secret revocation handle Alice keeps and later passes toSelf.revokeContactCode(codeId). Use this form when the recipient cannot receive a cap (no shared chat yet, or out-of-band handoff).
Bob, holding Self for his own session, calls one of the
recipient methods: Self.openDm(contactCap) -> DmPeer for cap
form, or Self.openDmFromCode(code) -> DmPeer for code form.
chat-server mints Bob’s DmPeer(B->A) with parent = the ContactCap or contactCode lineage node, and delivers Alice’s
side DmPeer(A->B) to Alice via Self.subscribeIncoming –
specifically, the dmOpened :DmPeer arm of the
SelfIncomingEvent union, with source :IssuanceSource
carrying the contactRef :ContactCapRef Alice retained from
her earlier Self.contact(...) issuance (cap-form path) or
the codeId :Data from Self.contactCode(...) (code-form
path). Alice’s UI matches the event to its issuance entry
through that ref.
Either party drops their listener subscription to stop receiving
(instant); Alice may call Self.revokeContact(ref) (cap form)
or Self.revokeContactCode(codeId) (code form), passing the
issuer-side handle she retained from the earlier issuance call,
to revoke just that contact’s branch and any DM chains derived
from it, without affecting DMs Alice established via different
contact caps.
Sending to an agent the operator owns. Manifest configures: when
operator session starts an agent, chat-server creates a fresh
agent-prompt group with operator as Owner and the agent runner’s
session as a Member. Operator already holds GroupOwner(agent-prompt)
because chat-server made them Owner at group creation time. No
approval step. Tool consent inside the agent runner remains a
separate concern handled by ApprovalClient.
Sending to an agent the operator does not own. The agent’s
owner controls reachability. They publish a DiscoverableGroupJoin (or per-kind channel-subscribe)
in their scope’s directory, or hand out a contact cap to a specific
operator, or invite to a specific group. There is no protocol-level
way to write the agent without already holding such a cap.
Listener-side filter (soft mute). Subscribers may pass options
on subscribeText/Audio/Video that filter inbound events by
sender lineage, e.g. muteSenderBranch(parentCapId). Sender caps
may have been validly minted; filter is a soft mute, not a
revocation. For hard revocation, the owner must call
revokeBranch.
Worked examples
These ground the abstract granting flows in concrete scenarios that will appear in implementation iterations.
Public/system channel: making lobby reachable to all operators.
Two valid paths, both expressed as Chat-service configuration:
-
Manifest-bundled membership. The chat-server manifest declares the group and the auto-bundle policy:
chat: groups: lobby: owner: principal:chat-server # the service runs as Owner bundles: - profile: operator attach: GroupMember(lobby)At startup, chat-server creates
lobbyand prepares the attach-on-login behavior. When an operator session logs in, the broker invokes chat-server’s per-session bundle hook; chat-server mints a freshGroupMember(lobby)cap for that session withparent = chat-server root(specifically: a per-session chain root). No two operators share the same chain. To remove one operator the admin runs the deny-list-only ban semantic as a pair of calls:GroupAdmin.revokeBranch(theirMemberRef)to epoch the current chain (the operator’s active session fails closed on the next dispatch) ANDGroupAdmin.banPrincipal(theirPrincipal)to add them to the deny-list so the bundle hook does NOT mint a freshGroupMember(lobby)on their next login. Either step alone is meaningful but incomplete: revokeBranch alone leaves the bundle hook open, and banPrincipal alone leaves the current session running. Other operators’ chains are unaffected by either step. -
Discoverable join via
ChatDirectory. chat-server’s manifest declares the lobby visible in the operator scope:chat: groups: lobby: owner: principal:chat-server directories: operator-scope: bundle-to: { profile: operator } entries: - group: lobby # the entry references the # Group above; the manifest # key uses `group:` rather # than the reserved # `channel:` since lobby is # a Group not a broadcast # Channel. join-policy: any-holder # anyone holding the # DiscoverableGroupJoin(lobby) entry # may call .join()Operator sessions get
ChatDirectory(operator-scope)bundled at login. The operator callsChatDirectory.search(query), sees thelobbyentry with aDiscoverableGroupJoin(lobby)cap, and callsDiscoverableGroupJoin(lobby).join() -> GroupMember(lobby). Each public join is its own distinct chain: the new member’s parent is the per-session join event, not the shared per-kind discoverable cap cap. Kicking member M withGroupAdmin.revokeBranch(M)epochs M’s chain (and anyone M invited to the group) but leaves all other public-joined members intact. Because the public-join route is still open, M could re-join through it and mint a fresh chain unless the admin also callsGroupAdmin.banPrincipal(M.principal)– the deny-list-only ban primitive that blocks future mints for that principal. The full “kick + ban M” workflow is therefore the pairrevokeBranch(M)+banPrincipal(M.principal); either step alone is meaningful (kick without banning lets a contrite member re-join; banning a not-currently-active principal blocks future mints without epoching anything). To stop accepting new joins from anyone, the owner callsGroupOwner.closePublicJoin(entry)with theChatDirectoryEntryHandlereturned by the matchingpublishDiscoverablecall; chat-server epochs theDiscoverableGroupJoin(lobby)cap class. Existing members are unaffected.
The first path is right for “every operator should be in the lobby the moment they log in”; the second is right for “operators choose whether to join, and we want a single knob to stop accepting new joins without kicking existing members”. Both are configurations of the same Chat service, both produce per-member distinct chains, and neither requires a registry service outside Chat.
Cross-session messaging test (group case).
Iteration 4’s primary cross-session test exercises the default case: two sessions message each other through a shared group, which is how humans actually message each other in a Telegram-shaped system. The DM path is exercised separately because its cap-derivation chain is different.
Test fixture, in pseudo-CUE chat-server config:
chat:
groups:
test-lobby:
owner: principal:chat-server
# The DM negative-test case (case C below) needs an admin
# cap to call GroupAdmin.revokeBranch on a misbehaving
# invitee's chain. Manifest grants the console-tester
# profile a GroupAdmin cap on test-lobby so that test
# is implementable without changing the substrate; the
# default group-test path uses only the GroupMember
# subset of methods.
admins: [ principal:console-tester ]
bundles:
- profile: console-tester
attach: GroupAdmin(test-lobby) # extends GroupMember
- profile: ui-tester
attach: GroupMember(test-lobby)
sessions:
console:
profile: console-tester
ui:
profile: ui-tester
Test flow:
- chat-server creates
test-lobbyat boot and registers the per-session bundle behavior. At login, the broker invokes chat-server’s bundle hook for each session; chat-server mints a freshGroupAdmin(test-lobby)cap for the console session (which inherits allGroupMembermethods so the group-test path below works unchanged) and a freshGroupMember(test-lobby)cap for the UI session, each its own chain root in chat-server’s lineage tree. The admin cap is what enables Negative case C in the DM flow. - Console session opens its bundled member cap, mints a
TextListener, callsgroupMemberCap.subscribeText(listener). - UI session does the same through the trusted Rust backend.
- Console session calls
groupMemberCap.send(event{kind=text, text="hi from console"}). - UI session’s listener receives the inbound event; UI backend surfaces it as a view-model row in the browser’s chat panel.
- UI session sends a reply; console session’s listener receives.
- Test asserts both directions of the round-trip and asserts that
the redacted transcript contains
kind=textevents from both senders without leaking session-id hex or raw cap handles.
This proves: default capset distribution works; subscribe/send round-trip works; cross-session listener delivery works.
Cross-session messaging test (DM case).
Same fixture extended with a Self cap on each session (the cap
that lets a principal produce a contact cap and accept incoming
DMs). Both sessions are also members of test-lobby from the group
test, which is the substrate “out-of-band” channel through which
the contact cap travels.
-
Console session calls
console.contact()and binds the tuple result(contactCap, ref). The bearercontactCapis a chat-server-issued cap that says “any holder may open a DM to console”; theref :ContactCapRefis the issuer-side revocation handle Console retains (Negative case B uses it). Console session sends ONLY the bearercontactCapto the UI session through the existingtest-lobbygroup chat (the group’ssend()accepts cap references in events for exactly this purpose); it does NOT send theref. The contact cap’s parent in chat-server’s lineage is “console session’s contact-issuance event” – a fresh chain root. -
UI session receives the chat event carrying the contact cap, extracts it, and calls
ui.openDm(contactCap) -> DmPeer(UI->Console). chat-server mints both directions: UI’sDmPeer(UI->Console)withparent = contactCap, and Console’s ownDmPeer(Console->UI)delivered via Console’sSelfnotification surface, with the same parent. -
Both sides
subscribeText, exchange messages, assert round-trip. -
Negative case A: a third session that did not receive the contact cap cannot construct one (it has no
Self.contact()path bound to the console principal). The test does not even need a denial assertion – the third session has no cap to call. -
Negative case B: console calls
Self.revokeContact(ref), passing theContactCapRefit retained from the earlierSelf.contact(...)call. chat-server epochs the contact cap and the DmPeer chains derived from it. UI’s subsequentDmPeer.sendfails closed withstaleCap. The test asserts the typed denial. -
- Negative case C: console invites a hostile third party to
test-lobbyviaGroupMember.invite(forSubject=hostilePrincipal, lifetime=...), binding the result tuple `(token :InviteToken, revoker - InviteRevoker, inviteRef :GroupCapRef)
. console keeps bothrevoker(issuer-side revocation handle, parented to console's admin role cap) ANDinviteRef(the issuance lineage node, anIssuanceNodeInfowith kindinviteToken); both are non-secret and stored in the fixture's "outstanding invitations" record. console delivers onlytokento the hostile party through chat-server's normal cap-delivery path. The hostile party redeems withSelf.acceptInvite(token) -> GroupMember(test-lobby)and uses the cap badly. console (holdingGroupAdmin(test-lobby)` per the fixture) has two cap-clean revocation paths:
revoker.revoke()– the simplest path: console already holds the issuer-side handle and does not need any new ref. chat-server epochs the InviteToken’s lineage node and any descendants (the hostile member’sGroupMembercap and any sub-invitees they admitted).- General per-branch path: console obtains a
GroupCapReffor the hostile branch through one of the typed sources declared onGroupAdmin:- the
inviteRefreturned by the originalGroupMember.invite(...)tuple (if console issued the invite itself); GroupAdmin.lookupByPrincipal(hostilePrincipal)if console did NOT issue the invite (e.g. when revoking somebody else’s invitee or a public-join chain);- or
GroupAdmin.describeRoot()for a full top-down walk. ThenGroupAdmin.describeBranch(node)to inspect the subtree before pulling the trigger, andGroupAdmin.revokeBranch(node)to epoch it. Raw transfer of the hostile party’s bearerGroupMembercap is NOT how console gets the ref;transfer_policyforbids that, and chat-server’s lineage queries are the cap-clean substitute.
- the
The test asserts the third party is gone (
staleCapon the next dispatch through the revoked branch) and that UI’s DM with console is not affected (different lineage chain). - Negative case C: console invites a hostile third party to
This proves: contact-cap-driven DM works; DM peer caps are direction-bound (asymmetric); revoking a contact cap propagates to derived DMs without touching unrelated caps; per-branch revocation isolates spam without cascading to siblings; no cold-call path exists.
Cap lineage and transitive revocation
Each chat host maintains an internal cap-derivation tree:
- Every cap minted by a derive method has a recorded parent.
- A cap’s active descendants are reachable by tree walk.
revokeBranch(cap)rotates the kernel cap-epoch for the cap and all its active descendants. Subsequent dispatch through any of those caps fails closed.- The kernel does not need to know about lineage; it only sees
per-cap epochs (already an existing mechanism). Lineage tracking
is the chat host’s job. The kernel enforces the cap’s
transfer_policy, which forbids raw bearer transfer for chat caps – so the only way for a cap to reach a new principal is through a derive method, which records lineage.
Why service-side bookkeeping rather than kernel-tracked lineage.
capOS’s stated principle (docs/capability-model.md,
CLAUDE.md) is to “prefer userspace capability wrappers over
kernel-side policy checks.” Lineage has a domain-specific shape per
service (a chat group vs a file share vs a credential vault all want
different revocation semantics), and putting it in the kernel forces
every cap to carry lineage overhead even when its service does not
need it. The service-side approach lets each host implement the
semantics it actually needs, while leaning on existing kernel
mechanisms (cap epoch, transfer policy) for enforcement.
Revocation primitives
Three independent revocation paths, all observable as typed denials:
- Listener-side instant drop. Receiver
cancel()s theSubscriptioncap or drops the listener. No further pushes from anyone reach that listener. This is the receiver’s primary tool for “leave me alone right now”. - Branch revocation by lineage. Admin calls
GroupAdmin.revokeBranch(node :GroupCapRef)/ChannelAdmin.revokeBranch(node :ChannelCapRef), passing a typed lineage-node ref obtained fromdescribeRoot/lookupByPrincipal/ theinviteRefreturned by an earlierinvite/ the various*Reffields onSelfIncomingEvent– never a raw bearer cap (transfer_policyforbids cross-principal cap transfer; chat-server’s lineage queries are the cap-clean substitute). Issuer-held revoker caps cover the analogous bearer flows:Self.revokeContact(ref)/Self.revokeContactCode(codeId)for contact-driven DMs;InviteRevoker.revoke()for an outstanding invite;SpeakerRevoker.revoke()for stage-room speak grants;RolePromotionRevoker.revoke()for role promotions. In every case chat-server rotates the kernel epoch on the named branch. Used for “remove a misbehaving admin and everything they admitted”, “kill a contact cap that fell into spammer hands”, “shut down a topic and everyone who joined via it”. A separate operation,GroupOwner.closePublicJoin(entry)/ChannelOwner.closePublicJoin(entry), stops new joins through aDiscoverableGroupJoin/DiscoverableChannel*Subscriberoute without kicking existing members (the route is the policy that minted them, not their parent in the lineage tree). - Chat-wide invalidation. A
GroupOwner.disband/ChannelAdmin.closeChannelcall invalidates the whole chat (or the room is closed, or the agent shut down). Subsequent calls returnstaleChannel.
Revocation is not silent. All three paths surface as typed
staleCap / staleChannel denials at the next call site, with the
remote CapSet UI reflecting them as kind=presence chat events
(“you were removed from this group”, “this channel has closed”) or
on the next operator action.
Audit
Every derive and every revocation is auditable. The host’s lineage
tree is itself the audit substrate: for any cap, “who derived this,
when, from which parent, with what method” is a tree query. The
audit log records the caller’s session-scoped reference per
session-bound-invocation-context-proposal.md. Listener
subscribe/unsubscribe is auditable from the receiver’s session.
What this proposal does NOT decide
- The exact role-permission DSL for
GroupAdmin(Telegram allows per-admin granular permissions: can-pin, can-invite, can-edit; capOS’s first slice can ship a single Admin role and refine later). Schema must leave room. - Per-topic permission overrides within a group. First slice is group-wide policy; topics are sub-channels under the same membership.
- Group DMs (multi-recipient DMs). Likely modeled as a Group with Owner=initiator, Members=invited principals; no fan-out DmPeer. Details in a follow-up.
- The kernel feature for per-cap
transfer_policyto forbid raw bearer transfer specifically for chat-cap-classes. capOS’sCapInfo.transfer_policyalready exists as a string field; the exact policy values live in a kernel/auth follow-up. Until then, channel-host lineage tracking can still work but with a soft invariant: derive methods are the intended path; raw bearer transfer is not blocked at kernel level. The implementation iteration must close this gap before the substrate is treated as hardened. - The exact
ActionPlanandCapRequestschemas referenced fromApprovalClient. They are an approvals-side gap, not a chat-side one.
End-To-End Encrypted DMs
End-to-end-encrypted DMs are a distinct cap layer sitting on top of
the regular DM substrate, not a flag on DmPeer. Reasons to keep them
separate:
- The chat host carries ciphertext only and never sees plaintext. That is a strong invariant; making it a flag risks a code path where plaintext leaks under “encryption disabled” conditions.
- Key exchange, authenticated encryption (AEAD), forward-secrecy ratchets (e.g. Signal-style double ratchet), and out-of-band fingerprint verification are concerns the unencrypted DM does not have. They need their own cap surface so the policy can be reasoned about per-DM.
- Auditing differs: an unencrypted DM’s host can audit message contents per disclosure policy; an encrypted DM’s host audits metadata only (sender, recipient, timestamp, ciphertext size).
Cap shape
The E2E peer cap is routing-only. It carries opaque ciphertext
between two endpoints; it never has access to plaintext or to the
AEAD ratchet keys. The KeyContext lives strictly in the principal’s
own process (held client-side via
cryptography-and-key-management-proposal.md primitives), is never
serialized into a chat-server-minted cap, and never crosses to
chat-server in any method argument or return.
# E2E DM peer cap. Minted by chat-server, but holds NO key state.
# It is a pure routing endpoint: it accepts opaque ciphertext for
# delivery, and routes opaque ciphertext to a listener.
interface E2EDmPeer extends(ChatEndpoint) {
send @0 (envelope :CipherEnvelope) -> ();
subscribeCipher @1 (listener :CipherListener,
options :SubscribeOptions) -> (sub :Subscription);
# Outgoing media: still flow-controlled, but the bytes have
# already been encrypted client-side by the holder. The peer cap
# does not see the plaintext frame, nor does it accept a key
# context as an argument.
openCipherOut @2 (format :CipherStreamFormat) -> (track :CipherOut);
remoteFingerprint @3 () -> (info :PeerFingerprint);
callSurface @4 () -> (calls :E2ECallSurface);
closeDm @5 () -> ();
}
# Listener and outgoing-media caps for E2E. Both carry opaque
# bytes; decrypt/encrypt happens in the holder's own process.
interface CipherListener {
cipher @0 (envelope :CipherEnvelope) -> ();
}
interface CipherOut {
writeCipherFrame @0 (envelope :CipherEnvelope) -> stream;
close @1 ();
}
struct CipherEnvelope {
ciphertext @0 :Data; # AEAD output; opaque to chat-server
associatedData @1 :Data; # AEAD AAD (e.g. sequence number,
# ratchet header) -- routing
# metadata only, no plaintext
receivedAtMs @2 :UInt64;
}
# E2E call surface. Narrower than CallSurface: NO setRoutingMode,
# because chat-server cannot mix or transcode (it doesn't have the
# keys), so SFU-forward is the only viable mode. The constraint is
# enforced at the type level -- the method simply doesn't exist.
interface E2ECallSurface {
current @0 () -> (info :ActiveCallInfo);
subscribeState @1 (listener :CallStateListener,
options :SubscribeOptions) -> (sub :Subscription);
startCall @2 (config :E2ECallStartConfig) -> (host :E2ECallHost);
joinCall @3 () -> (participant :E2ECallParticipant);
# Roster delivery for E2E (DM) calls. Required for
# `e2eHostGranted :E2ECallHost` delivery on
# E2ECallHost.promoteHost.
subscribeRoster @4 (listener :CallRosterListener,
options :RosterSubscribeOptions)
-> (sub :Subscription);
}
# E2ECallParticipant mirrors CallParticipant but accepts only
# already-encrypted CipherOut tracks; the participant cap does
# not handle key state. Receive is via subscribeCipher: the
# listener gets one fan-out stream of CipherEnvelope frames
# covering all participants' audio and video tracks; the
# receiver's process discriminates kind/track via the envelope's
# associatedData / sequence-id metadata and decrypts locally.
# There is no plaintext-receive method on this cap.
interface E2ECallParticipant extends(ChatEndpoint) {
publishCipherAudio @0 (format :CipherStreamFormat) -> (track :CipherOut);
publishCipherVideo @1 (format :CipherStreamFormat,
purpose :VideoPurpose) -> (track :CipherOut);
unpublishAudio @2 () -> ();
unpublishVideo @3 (purpose :VideoPurpose) -> ();
raiseHand @4 (raised :Bool) -> ();
setMyMuteState @5 (muted :Bool) -> ();
leave @6 () -> ();
subscribeCipher @7 (listener :CipherListener,
options :SubscribeOptions)
-> (sub :Subscription);
}
# Note the deliberate absence of setRoutingMode: an E2ECallHost
# cannot select mesh/MCU because chat-server is keyless and can
# only forward.
interface E2ECallHost extends(E2ECallParticipant) {
mute @0 (participantRef :Data) -> ();
unmute @1 (participantRef :Data) -> ();
eject @2 (participantRef :Data) -> ();
# Same delivery pattern as `CallHost.promoteHost`: the new
# `E2ECallHost` cap is delivered to the bound participant via
# CallRosterDelta (`e2eHostGranted :E2ECallHost` arm), not
# returned to the caller.
promoteHost @3 (participantRef :Data) -> (revoker :RolePromotionRevoker);
end @4 () -> ();
}
Key exchange
E2E DM establishment piggybacks on the contact-cap path. The critical invariant: chat-server only ever sees ciphertext.
- Alice’s
Self.contact()produces a contact cap whoseContactInfoincludes Alice’s long-term identity public key (or a fingerprint resolvable through her published profile). Where the contact cap is shared is out-of-band relative to chat-server. - Bob, holding Alice’s contact cap, calls
Self.openE2EDm(contact). chat-server mintsE2EDmPeer(B->A)for Bob (a routing cap with NO key state) and delivers Alice’s sideE2EDmPeer(A->B)to Alice viaSelf.subscribeIncoming(e2eDmOpened :E2EDmPeerarm ofSelfIncomingEvent). - Bob and Alice run a key-exchange handshake (X3DH or similar)
in their own processes. The handshake ciphertexts travel
over the E2E DM channel itself; chat-server is an opaque
carrier. Bob’s
KeyContextis built in Bob’s process from his identityPrivateKeyand Alice’s identity public key; ditto for Alice. Neither key context is ever passed to a chat-server method or stored in a chat-server-minted cap. - After handshake, each side holds a
KeyContextlocally. To send: encrypt(plaintext, KeyContext) -> CipherEnvelope, thenpeer.send(envelope). To receive: peer’s listener deliversCipherEnvelope, the listener’s owning principal calls decrypt(envelope, KeyContext) -> plaintext locally. - Either party may rotate keys by performing a fresh ratchet
step in their own process and exchanging the new ratchet
header through normal
send()– no special method is required because key state never lived on the peer cap. - Out-of-band fingerprint verification compares
peer.remoteFingerprint()(a public-key digest, safe to expose; it is NOT the AEAD secret) with what each side knows from their contact cap.
Why this firewalls plaintext from the host
E2EDmPeer.send(CipherEnvelope)accepts ciphertext only. chat-server has no method to obtain the plaintext or the key context from the peer cap.subscribeCipherdeliversCipherEnvelopeto aCipherListener; decryption happens in the listener’s owning process.openCipherOutproduces aCipherOutthat accepts already- encrypted frames. chat-server forwards them without ever seeing plaintext.- The
KeyContextcap is held client-side, never serialized into a chat-server-minted cap, never passed as an argument to a chat-server method. (This is enforced by thecryptography-and-key-management-proposal.mdKeyContextcap’s transfer policy: not transferable to chat-server.) - E2E calls cannot mix/transcode because chat-server has no
keys. The
E2ECallSurface/E2ECallHostinterfaces simply do not havesetRoutingMode; the SFU-forward-only constraint is a type-level invariant rather than a runtime check.
What stays in vs out of scope here
In scope: end-to-end-encrypted DM voice/video calls. Both
plain DmPeer.callSurface() and E2EDmPeer.callSurface()
return E2ECallSurface. Direct calls between two principals are
end-to-end-encrypted at the media layer regardless of whether
the DM’s text is host-readable: chat-server forwards encrypted
RTP frames (via CipherOut-style tracks), and a DTLS-SRTP-style
key exchange runs between the peers at call start. The
SFU-forward-only constraint is enforced at the type level on
E2ECallSurface (no setRoutingMode).
Out of scope:
- E2E for the text of a regular
DmPeerstays plaintext-aware on chat-server. If you want host-blind text, useE2EDmPeer(which is a distinct cap layer with its ownCipherEnvelope- shaped send/subscribe). - Group E2E (multi-party MLS-style ratcheting). First slice is pairwise only. Group E2E is a future iteration once pairwise is proved.
- Cross-device synchronization (the “I want my E2E messages on a second device” problem). Out of scope.
- Server-side recording or transcoding for E2E media. The substrate is recording-blind everywhere; for E2E media, chat-server cannot mix or transcode anyway because it has no keys – this is a direct consequence, not a separate rule.
Backpressure And Quotas
Hot-path media (audio frames at 50 Hz, video frames at 30 Hz) does not fit on a synchronous request/response model.
- Outgoing audio/video uses
-> streamso the caller can pipeline frame writes without each one waiting for an ACK; the framework applies backpressure when the buffer fills. - Incoming audio/video listener caps publish a bounded ring; when the
consumer falls behind, the substrate drops oldest frames and reports
drop count via
AudioFrameMeta.dropsSinceLast(or equivalent) so the consumer can detect liveness gaps without reconstructing full frame history. - Per-chat quotas live in the chat cap itself (constructed by the hosting service). Per-session quotas live in the broker bundle. Two natural axes: max concurrent subscriptions per kind, max outgoing bandwidth per chat.
- Text history buffering is bounded by the trusted Rust backend’s
AppState; browser view models receive at most the last N events. The chat-cap holder may alsosubscribeTextwith asince(eventId)option to fetch a bounded backlog.
Privacy And Disclosure
Senders are surfaced through ChatInboundEvent.sender. Per
session-bound-invocation-context-proposal.md, the channel server sees the
caller’s opaque session-scoped reference plus freshness; it does not
see raw principal/profile/account fields by default. The chat-server-side
disclosure policy decides whether a sender’s display name, principal
class, or profile class is included in events visible to other
subscribers; default is “display name only”.
The remote CapSet UI’s redacted-transcript export rule applies here too: audio/video metadata (codec, timestamps, frame counts) may appear in transcripts; frame bodies do not.
Migration From The Existing Chat Schema
The current Chat interface (text, poll-based, single struct) stays
callable during the migration. Steps in approximate order:
- Add the listener-cap surface (
subscribeText,TextListener, the newChatInboundEventstruct) alongsidepoll. Keeppollworking. - Migrate the chat-server demo and the per-session chat worker to push
events through the listener cap. Mark
polldeprecated for capnp-rpc clients but keep it for DTO clients during the remote-session transport migration (Remote Session CapSet Clients anddocs/backlog/remote-session-capset-client.mdTask 1). - Add the audio surface (
subscribeAudio,AudioSink,openAudioOut,AudioOut) onceMemoryObject-backed media rings exist. The realtime voice proposal’sVoiceSessionbecomes the browser-side adapter that maps WebRTC tracks into Chat audio subscriptions. - Add the video surface analogously. Video is feasible only after audio is proved end-to-end and the gateway-side WebRTC adapter exists.
- Once all subscribers are listener-cap-driven, remove
pollfrom the substrate-level interface; service-specific shims may keep it.
Each step is a separate iteration with its own QEMU smoke and host-side proof. The first iteration on top of this proposal is the text-only listener-cap rebuild, which is also iteration 4 of the remote-session plan (real Chat panel + cross-session messaging test).
Open Questions
- Per-cap
transfer_policyenforcement at kernel level. TodayCapInfo.transfer_policyis a string field on every cap (values like"stable","session-proxy"); it is descriptive, not enforced. Cap transfer between processes happens via the SQEIPC_TRANSFER_CAPflag, which the kernel implements by copying the cap entry from sender’s CapTable into receiver’s. Today that copy succeeds regardless oftransfer_policy. The substrate’s lineage invariant relies on: the only path for a chat cap to reach a new principal is through chat-server’sinvite/acceptInvite/Self.openDm/etc. methods (which record lineage). But if a principal holdsGroupMember(lobby)and passes that cap as a payload in an SQE to any other service via rawIPC_TRANSFER_CAP, the kernel hands a copy to that service – bypassing chat-server entirely. The lineage tree silently grows a copy with no recorded parent, and chat-server cannot revoke it. The kernel enforcement gap to close: extend SQE cap-transfer dispatch to consulttransfer_policyand reject transfers whose policy class forbids cross-principal copy (chat-class caps would carry such a policy). Sharing then must go through chat-server’s typed methods, which is where lineage gets recorded. Until this gap is closed, the substrate’s lineage invariant is enforced only by convention; no implementation iteration should treat the substrate as hardened without it. - Cross-channel reference of contact caps. This proposal has
contact caps travel “through some channel the principals already
share” – e.g. a contact cap is delivered via a group chat the
giver and recipient both belong to. Chat events therefore need a
way to carry cap references inline (the
datafield onChatOutboundEventplus a typed payload kind, or a separate cap-attachment field on the event). The first iteration may use the existing capnp cap-passing on the outbound event; details belong with iteration 1 schema refinement. - Multi-modal AI agents. When the agent runtime is a Chat
peer, it receives audio frames and emits audio frames. The
agent runner bridges
RealtimeModelSessionto the relevant per-kind chat facets – typicallyGroupMemberfor an agent-prompt group, or aDmPeer/E2EDmPeerif the agent is a DM peer. Should the bridge live in the agent runner (clean) or be a generic adapter cap (RealtimeChatBridge)? The realtime-voice proposal already has the agent runner doing the bridging; this proposal preserves that. - Cross-session media sharing. A chat may have subscribers from
multiple sessions. Does each subscription have its own session-scoped
reference (yes, per
session-bound-invocation-context-proposal.md), and does the chat cap retain owner-session metadata for moderation / kick? Likely yes; details in a follow-up. - Approval queue cap shape. Whether the queue lives on
AuthorityBroker, on a newApprovalQueuecap, or on aNotificationscap that carries approvals as one of its event kinds. Out of scope here; tracked in the approvals follow-up note above. - Voice barge-in semantics with WebRTC. Existing
realtime-voice-agent-shell-proposal.mddefines barge-in withinRealtimeModelSession; mapping that onto the Chat substrate (interrupt the outgoing audio track when apresencetyping event or a fresh inbound audio frame arrives) needs design before the voice iteration.
Relationship To Existing Proposals
realtime-voice-agent-shell-proposal.md—VoiceSessionbecomes the browser-side adapter into the Chat audio surface.RealtimeModelSessionstays unchanged (agent runtime ↔ provider). The agent runner bridges the two when the agent is part of a chat.llm-and-agent-proposal.md— “operator sends a prompt to a running agent” is a Chat text event over a channel the operator already holds (e.g.GroupOwnerof an agent-prompt group the operator created, or a contact cap the agent’s owner shared). “Agent emits a partial response” is a Chat text event withinReplyTo. “Agent requests a tool with consent required” emits anapprovalRefevent referencing anApprovalGrantfrom the existingApprovalClientsurface;ApprovalClientis not used to grant cross-principal write authority – that is always invite- or contact-cap-driven.user-identity-and-policy-proposal.md— the principal model (PrincipalKindincludingservice) is the basis for service principals owning system channels and for chat-server’s bundle and directory-scope predicates that test principal kind/profile.- Remote Session CapSet Clients
— the remote CapSet UI’s “real Chat panel” target (iteration 4 of
the plan) consumes the text-only slice of this substrate first;
audio/video panels are follow-up iterations on the same backend
boundary. The trusted Rust backend in that proposal is also where
the WebRTC peer endpoint and the
/api/chat/webrtc/*signalling endpoint described under “WebRTC Mapping” terminate; chat-server itself never holds a WebRTC handle, a TCP listener, or a TLS context. - Networking — the
/api/chat/webrtc/*signalling endpoint, the redacted-transcript HTTP path, and any future native (non-WebRTC) Chat transport all run on userspace networking caps (NetworkManager/TcpListener/TcpSocket/UdpSocket) handed to the trusted Rust backend, not on chat-server itself. Phase C userspace decomposition of the smoltcp stack is the gating dependency: until that lands, the kernel-resident TCP listener and accepted-socket state described in the networking proposal still front any TCP-shaped Chat transport, including the WebRTC signalling endpoint. - Certificates and TLS
— the browser ↔ backend signalling channel and any future native
Chat-over-TLS transport build their TLS context from the
Certificate/TrustStore/TlsServerContext/TlsClientContextcaps defined there, composed on top of aPrivateKeycap from Cryptography and Key Management. Chat carries reference handles or audit-safe descriptors only; certificate material and TLS keys never reach chat-server. shell-proposal.md—ApprovalClient/ApprovalGrantstay as defined; this proposal references them viaapprovalRef.session-bound-invocation-context-proposal.md— subscription identity is the session-scoped reference; Chat servers honour disclosure scopes.interactive-command-surface-proposal.md— typed command palettes remain a separate concern; a chat may surface a command-palette proposal as a structured message, but the command surface itself is not Chat.browser-capability-proposal.md— if a future browser tab sits inside a Chat-served pane (screen-share scenario), the browser cap rules still apply; Chat carries reference handles, not browser authority.
References
- WebRTC API specifications:
RTCPeerConnection,RTCDataChannel, audio and video tracks, SDP, ICE candidates, DTLS/SRTP. See https://webrtc.org/. - Cap’n Proto streaming RPC (
-> streammethod annotation) and listener-cap patterns: https://capnproto.org/news/2020-04-23-capnproto-0.8.html (introduces flow control), and the capnp Rust crate at v0.25 used in this repository. - Existing capOS proposals as cross-referenced above.
Proposal: Realtime Voice Agent Shell
How capOS should support web-shell and native-shell voice interaction when modern multimodal models can consume realtime audio and emit both audio streams and structured tool calls.
Problem
The existing language-model proposal defines a text-oriented agent runner: messages, streamed text, structured tool calls, and per-tool permission policy. That model still works, but it is incomplete for modern voice agents. Current provider APIs can run stateful realtime sessions where the model directly listens to audio, speaks audio, performs VAD/barge-in handling, and emits function calls in the same interaction.
If capOS models voice as only “ASR into text shell, then TTS the answer,” it will miss the better latency and interaction model of native realtime audio. If capOS lets provider-native sessions execute tools directly, it breaks the capability model. The design needs a middle path.
Goals
- Support native realtime audio model sessions alongside chained ASR/text/TTS pipelines.
- Preserve the existing agent-shell security rule: the model never holds session caps or tool caps.
- Let
WebShellGatewayhost terminal and voice transport without becoming an authority sink. - Keep microphone/speaker media out of
TerminalSessiontext APIs. - Minimize and guarantee media stack latency for admitted capOS-controlled realtime islands, preferring enforceable bounds over optimistic nominal latency.
- Support provider adapters for OpenAI Realtime, Gemini Live API, Vertex AI Live API, local ASR/TTS, and future local realtime multimodal models.
- Carry timestamps, deadlines, transcripts, interruptions, and tool-call ids as first-class session data.
- Make direct browser-to-provider media an optional optimization guarded by broker-minted ephemeral credentials.
- Allow a browser agent to be the web-shell UI and orchestrate the realtime provider loop, while keeping capOS tool execution gateway-enforced.
Non-Goals
- Implementing provider SDKs in the kernel.
- Giving a browser any capOS capability handle.
- Treating voice recognition, wake words, or VAD as authorization.
- Making a realtime model’s free-form speech or text executable.
- Guaranteeing full-path realtime behavior for browser, network, or remote provider segments. Native local media can enter guaranteed realtime islands only after scheduling contexts and device isolation mature.
Architecture
flowchart LR
Browser[Browser UI] -->|terminal frames| Gateway[WebShellGateway]
Browser -->|mic/playback frames| Gateway
Gateway --> Terminal[TerminalSession]
Gateway --> Voice[VoiceSession]
Shell[capos-shell agent mode] --> Terminal
Shell --> Voice
Shell --> Runner[Agent Runner]
Runner --> RT[RealtimeModelSession]
Runner --> Broker[AuthorityBroker]
Runner --> Audit[AuditLog]
Runner --> Tools[Session tool caps]
RT --> Provider[Realtime provider adapter]
Provider --> Remote[OpenAI / Gemini / Vertex]
Provider --> Local[Local model backend]
Principal split:
WebShellGatewayauthenticates browser sessions, owns browser transport, creates terminal and voice session objects, and tears down resources.capos-shellin agent mode owns the session bundle and acts as the trusted runner for capOS-side agent sessions.- A browser agent UI may own the web conversation and provider session loop,
but only as an untrusted client of
WebShellGateway’s tool proxy. RealtimeModelSessionis a model I/O object. It carries audio, text, transcripts, tool calls, and tool results. It has no authority over capOS tools.- Provider adapters hold narrow provider credentials or model-runtime caps.
- The browser holds no capOS session caps, no tool caps, no provider long-lived API keys, and no bearer tokens other than short-lived provider-scoped tokens when a direct-media optimization is explicitly enabled.
Interfaces
The exact schema belongs to the implementation milestone. The shape should be:
interface RealtimeModel {
info @0 () -> (info :RealtimeModelInfo);
open @1 (config :RealtimeSessionConfig)
-> (session :RealtimeModelSession);
}
interface RealtimeModelSession {
send @0 (event :RealtimeInputEvent) -> ();
next @1 () -> (event :RealtimeOutputEvent, done :Bool);
sendToolResult @2 (result :RealtimeToolResult) -> ();
cancel @3 (reason :CancelReason) -> ();
close @4 () -> ();
}
RealtimeInputEvent should cover:
- audio frame reference;
- text input;
- image/video frame reference;
- push-to-talk start/end;
- playback-position feedback;
- tool result;
- cancel, truncate, close.
RealtimeOutputEvent should cover:
- audio frame reference;
- text delta;
- partial and final transcript;
- tool call delta and complete tool call;
- interruption/barge-in;
- session warning/error;
- provider usage/cost metadata;
- close/go-away/reconnect notice.
Audio frames should not be copied through Cap’n Proto payloads in the hot path.
Use MemoryObject-backed media rings or provider-owned stream handles. Cap’n
Proto remains the control plane.
Tool Calls
Realtime tool calls use the same policy as text agent calls.
sequenceDiagram
participant Model as RealtimeModelSession
participant Runner as Agent Runner
participant Broker as AuthorityBroker
participant Tool as Typed Tool Cap
participant Audit as AuditLog
Model->>Runner: tool_call(name, args, provider_call_id)
Runner->>Runner: validate ToolDescriptor
Runner->>Broker: authorize tool call
Broker-->>Runner: auto / consent / stepUp / forbidden
Runner->>Tool: invoke if allowed
Tool-->>Runner: typed result
Runner->>Audit: record decision and outcome
Runner->>Model: tool result
The runner owns the mapping from provider call ids to capOS audit/tool-call
ids in capOS-side mode. In browser-agent UI mode, WebShellGateway’s tool
proxy owns that mapping. Provider ids are useful correlation metadata, but
they are not authority.
Tool execution must be time-boxed. If a tool blocks too long, the runner or gateway tool proxy sends a typed timeout result back to the realtime model and continues or ends the turn according to policy.
Voice Session
VoiceSession is the shell-facing media session object created by
WebShellGateway or a native terminal host.
interface VoiceSession {
describe @0 () -> (info :VoiceSessionInfo);
openCapture @1 (format :AudioFormat) -> (stream :AudioInputStream);
openPlayback @2 (format :AudioFormat) -> (stream :AudioOutputStream);
event @3 () -> (event :VoiceSessionEvent);
close @4 () -> ();
}
For web shell, VoiceSession is backed by browser media APIs. For native capOS
it can be backed by an audio device service. Either way, it is separate from
TerminalSession:
- terminal input/output remains text and presentation;
- voice capture/playback is timestamped binary media;
- transcripts can be rendered into the terminal, but they are not terminal input until the runner accepts them as a user turn.
Media Graph
The local media graph is a userspace service/library layer, not a kernel feature. Its latency goal is the lowest guaranteed-stable operating point for the selected device, graph, and policy: a fixed quantum with admitted CPU, memory, device, and wakeup budgets, not the smallest buffer value that can be configured.
flowchart LR
Capture[Capture source] --> Convert[format converter / resampler]
Convert --> Gate[VAD or push-to-talk gate]
Gate --> Input[realtime provider adapter or local ASR]
Input --> Runner[agent runner]
Runner --> Output[realtime provider adapter or local TTS]
Output --> Playback[playback sink]
For browser voice, the graph may partly live in browser JavaScript and partly
in capOS services. For native hardware, the graph eventually uses audio driver
services that hold DeviceMmio, DMAPool, and Interrupt capabilities.
Graph control operations are ordinary endpoint calls:
- create node;
- connect port;
- set format;
- allocate buffer pool;
- start/stop stream;
- set deadline and latency policy.
Graph data uses MemoryObject pools and notification/futex wakeups. Audio
frames carry:
sequence
capture_time_ns
playback_time_ns
deadline_ns
format
offset
length
flags
The realtime data path should not perform allocation, blocking IPC, logging, permission checks, provider credential work, or graph mutation. Those remain control-plane operations. Any bridge that crosses process, clock, network, provider, or browser boundaries must declare its extra latency so the graph can report the full stack rather than burying delay in queues. A non-guaranteed bridge must not backpressure a guaranteed island; it must drop, silence, bypass, stop, or renegotiate.
WebShellGateway Modes
Gateway-Mediated Provider Session
flowchart LR
Browser[Browser] <--> Gateway[WebShellGateway]
Gateway <--> Adapter[ProviderAdapter]
Adapter <--> Provider[Provider API]
Properties:
- provider long-lived credentials remain server-side;
- tool-call events remain server-side unless explicitly proxied to a browser agent UI under broker policy;
- gateway can record/drop/rate-limit media;
- easier audit and teardown;
- higher latency because audio crosses the gateway.
This is the baseline mode.
Direct Browser Provider Media
flowchart LR
Browser[Browser] <--> Provider[Provider API]
Browser <--> Gateway[WebShellGateway control/audit path]
Properties:
- lower media latency;
- browser receives provider-specific ephemeral credential;
- gateway may not see every media frame or provider control event;
- allowed only when broker policy says direct media is acceptable;
- provider tool declarations are disabled unless either a trusted server-side
control channel handles tool calls and results, or the session is explicitly
in browser-agent UI mode and every tool call is routed through
WebShellGateway’s server-side tool proxy.
Direct mode requires:
- provider token scoped to model/config/session;
- short expiration;
- no capOS capability material in the token;
- provider tools disabled, provider-supported server-side receipt of tool
calls plus server-side submission of tool results, or browser-agent UI mode
where JavaScript receives provider tool calls but can only send structured
ToolRequestvalues toWebShellGateway; - trusted revocation or session close path; if the provider exposes only a browser-held connection, the kill switch is best-effort and must not be described as authoritative;
- audit that records direct-media mode, token issuance metadata, disabled tool status, and any uninspected media/control scope;
- fallback to gateway-mediated mode.
Browser Agent UI Direct Provider Session
This mode is distinct from merely moving media off the gateway. The browser agent is the UI: it owns the visible conversation, calls the realtime provider with an ephemeral credential, receives provider tool-call events, and feeds tool results back to the provider. It still does not receive capOS caps.
flowchart LR
BrowserAgent[Browser Agent UI] <--> Provider[Provider API]
BrowserAgent -->|ToolRequest| Gateway[WebShellGateway ToolProxy]
Gateway --> Broker[AuthorityBroker]
Gateway --> Tools[Session tool caps]
Gateway --> Audit[AuditLog]
Gateway -. "ToolResult" .-> BrowserAgent
Rules:
- the browser credential is scoped to provider, model/config, session, conversation, media mode, and short expiration;
- the gateway publishes a signed or MACed tool descriptor snapshot for the current turn;
- browser tool requests must carry the descriptor snapshot id, provider call id, conversation id, turn id, and typed arguments;
- gateway rejects stale snapshots, replay, unknown tools, schema mismatches, missing consent, missing step-up, and requests after session teardown;
- gateway performs all real capOS capability invocations server-side and records that the request was browser-agent-proposed;
- broker policy may deny browser-agent UI mode when prompt, transcript, media, or tool-result confidentiality requires capOS-side provider mediation.
This is lower latency and can use provider-native browser APIs, but it gives up gateway inspection of some media/control frames. Audit must record that fact instead of implying full gateway mediation.
Realtime Provider Adapter
A provider adapter is a normal service process. It should expose
RealtimeModel, not provider-specific credentials.
OpenAI adapter:
- uses WebRTC for browser direct mode or WebSocket for server-side mode;
- maps provider function-call events either to server-side capOS
RealtimeToolCallvalues or to browser-agentToolRequestforwarding; - maps
function_call_outputtoRealtimeToolResult; - handles response cancellation and output-audio truncation.
Gemini developer adapter:
- uses Live API WebSocket;
- supports ephemeral-token direct mode when broker policy allows;
- maps
FunctionResponsetoRealtimeToolResult; - models synchronous and non-blocking function-call behavior explicitly.
Vertex adapter:
- uses cloud auth and Vertex AI Live API;
- exposes deployment metadata such as project/location/model id;
- respects enterprise logging, quota, and provisioned-throughput policy;
- should not leak Google credentials to browser or shell.
Local adapter:
- may start as ASR plus text model plus TTS;
- can later become native realtime audio if a local model supports it;
- keeps all media on-device and is the correct anonymous/guest fallback.
Scheduling And Deadlines
Web shell and remote-provider voice need bounded soft realtime. Native local voice can use guaranteed realtime islands once scheduling contexts exist:
- Capture frames older than their deadline should be dropped.
- Playback frames that miss the output deadline should be skipped or replaced with silence.
- Barge-in should cancel model output promptly.
- Tool calls should not block capture/playback loops.
- The terminal path must remain responsive under model or provider stalls.
Future scheduling contexts should represent:
voice-capture budget/period
provider-adapter budget/period
agent-runner interactive priority
playback budget/period
SQE-level deadlines are useful metadata for stale request handling, but they do not create CPU budget. A provider adapter may reject or drop stale media frames using deadlines before the scheduler grows true budget enforcement. Native media graph scheduling should eventually map graph quantum to scheduling period and per-node CPU budget. Web shell and remote providers cannot provide a capOS guarantee across the full path, so their jitter must be measured and surfaced separately from the local guaranteed island latency.
The general realtime scheduling model is tracked in
Tickless and Realtime Scheduling:
SQE.deadline_ns is request freshness metadata for stale frame/tool handling,
while SchedulingContext carries CPU-time authority and RealtimeIsland
admits the local media graph. Voice paths must not treat deadline metadata as a
budget reservation.
Consent And Voice Confirmation
Voice can participate in consent UX, but it is not sufficient for strong authorization.
Rules:
- Read-only tools may run automatically if broker policy allows.
- Mutating tools need explicit consent; spoken “yes” can satisfy only low-risk consent when the user is already authenticated and the prompt context is active.
- Destructive tools require
stepUp; WebAuthn/passkey is the likely web-shell path. - Wake words, speaker identity estimates, VAD, and ASR confidence are never authentication factors.
- The spoken confirmation transcript and confidence are audit data.
Security Invariants
- Browser never receives capOS caps.
- Model services never receive session caps.
- Provider adapters never receive broad process-spawn or terminal authority.
- Free-form model text and speech are never parsed as commands.
- Tool calls are structured values and must match advertised descriptors.
- Provider credentials are caps or service-private secrets, never transcript text or terminal output.
- Browser-held provider credentials are short-lived, provider-scoped, and contain no capOS capability material.
- Voice transcripts are untrusted user input until the runner or gateway accepts them.
- Prompt-injection rules from the text agent apply unchanged to transcripts, web results, tool results, and model-generated speech.
- On logout, tab close, timeout, shell exit, or failed auth, the gateway closes terminal, voice, pending tool consent, and server-side model streams. For browser-held provider sessions, gateway teardown authoritatively ends capOS tool execution and rejects future tool requests; provider session revocation is authoritative only when the provider exposes a server-side close API, otherwise it is best-effort and must be audited as such.
Interaction Examples
Low-Risk Read
user speaks: "what services are running?"
model emits tool_call(systemStatus.list, {})
runner policy: auto
runner executes status cap
runner sends tool result
model speaks summary and emits text transcript
Mutating Action
user speaks: "restart the network stack"
model emits tool_call(service.restart, {"name":"net-stack"})
runner policy: consent
gateway renders and speaks confirmation prompt
user says: "yes"
runner executes restart
runner audits transcript, consent, tool args, result
model speaks outcome
Barge-In
model speaking long answer
user starts speaking
VoiceSession emits bargeIn
runner cancels provider output
provider adapter truncates unplayed audio if supported
new user audio starts a new turn
Implementation Sequence
- Document and freeze
RealtimeModelSessionandVoiceSessionschemas. - Add a fake local provider adapter using text-only model responses and synthetic audio events so the shell/gateway state machine can be tested without provider credentials.
- Extend
WebShellGatewayprotocol with a voice side channel and lifecycle events, still with no direct provider media. - Implement chained local ASR/text/TTS adapter or browser-ASR demo shim for the first visible voice shell proof.
- Add provider adapter for one remote realtime API behind broker-issued model caps and server-side credentials.
- Add direct browser provider media only after ephemeral-token minting, teardown, and audit are proven in gateway-mediated mode.
- Add browser-agent UI mode after the WebShellGateway tool proxy can bind descriptor snapshots, enforce consent/step-up server-side, reject replay, and audit browser-agent-proposed tool requests.
- Add media-ring deadlines and underrun/drop telemetry.
- Later, bind media and provider loops to scheduling contexts once scheduler policy exists.
Open Questions
- Does
VoiceSessionbelong to the terminal host family or the media graph service family? - Should provider adapters expose raw provider events for diagnostics behind a privileged debug cap?
- Should a model be allowed to continue speaking while a non-blocking tool is pending, or should capOS pause speech at every tool-call boundary by default?
- How should cross-provider tool-call deltas be normalized when providers emit partial arguments differently?
- Which mode is acceptable for operator web shell by default: gateway-mediated, direct provider media, browser-agent UI, or broker policy dependent?
- Should model-output audio be stored in audit, summarized, or only referenced by transcript and provider event ids?
- How should media graph buffer quotas interact with session quotas and future resource donation?
Relationship To Existing Proposals
- Language Models and Agent Runtime: this proposal
is the realtime multimodal sibling of the text
LanguageModel/AgentSessioninterfaces defined there.RealtimeModelSessionplugs into the same agent runner, reuses the sameToolDescriptor/AuthorityBroker/AuditLogboundary, and follows the same browser-agent UI versus gateway-enforced tool execution split. The per-tool permission modes (auto/consent/stepUp/forbidden) defined for the text agent apply unchanged here; voice does not introduce a new authority layer. - Native Shell and POSIX Shell:
capos-shellin agent mode is the trusted runner referenced throughout this proposal. It holds the session caps, exposes typedToolDescriptorvalues toRealtimeModelSession, executes admitted tool calls, and stays the authority surface for capOS-side voice agents. Browser-agent UI mode does not replace it; it proxies throughWebShellGatewayback into the same shell-owned authority. - Chat As Multimedia Substrate:
the operator-facing voice surface (operator talks to a running agent;
agent speaks back) is a
Chatchannel with audio subscriptions.VoiceSessionbecomes the browser-side adapter that maps WebRTC audio tracks into Chat audio subscriptions;RealtimeModelSessionstays as defined here for the agent-runtime ↔ provider link; the agent runner bridges the two. Chat is the operator-visible transport; this proposal defines the model-side session that consumes/produces media through it. - Multimedia Pipeline Latency: gives the local media graph its guaranteed-stable latency goal, realtime-island admission model, PipeWire/JACK grounding, and telemetry requirements.
- Boot to Shell: WebShellGateway remains the web entry point and session authority boundary.
- Interactive Command Surfaces: voice transcripts can invoke command sessions only through typed command descriptors, not free-form shell text.
- Browser/WASM: direct browser media and browser-agent UI resemble the existing host-backed capability pattern, but real capOS tool execution must remain gateway-mediated.
- GPU Capability: local realtime models may later need GPU/NPU sessions, but the interface should not expose accelerator details to agent-shell.
- Formal MAC/MIC: remote realtime provider use must be denied when session confidentiality labels forbid off-device media.
References
- Realtime multimodal agent APIs research
- Multimedia pipeline latency research
- OpenAI, Voice agents
- OpenAI, Realtime conversations
- Google AI for Developers, Gemini Live API overview
- Google AI for Developers, Tool use with Live API
- Google Cloud Vertex AI, Gemini Live API overview
Proposal: Aurelian Frontier
Design for the Aurelian Frontier game: a capability-native, persistent-world
RPG set on the imperial frontier of an original late-imperial fantasy setting.
The current shell-spawned adventure-client artifact remains the deterministic
proof slice for the capability system; this proposal describes the game it
grows into. Both purposes coexist: the QEMU smoke transcript stays stable, and
the design supports a long-running campaign with authoritative shared world
state, multiplayer parties, durable profiles, and audited public history.
Current State
The existing artifact proves useful plumbing:
capos-shelllaunchesadventure-clientas an ordinary child process.- The shell grants explicit
StdIO,Adventure, andChatendpoint clients. adventure-serverowns per-player room and inventory state keyed by the endpoint caller-session scoped reference and epoch; normal shell launch syntax omits legacy receiver selectors, and remaining explicit selector fixtures are not the adventure player identity model.chat-servercarries room-local events and simple NPC process output.- Focused
make run-adventuretranscript coverage proves launch, movement, item pickup, inventory, chat, and process exit; the residentadventure-scenario-testprocess covers complex custody logic through directAdventurecap calls.
As a game, the playable surface is still narrow: one mission, one party of
named actors, one tactical encounter. The Aurelian frontier map has replaced
the four-room cellar prototype. The first mission recovers eagle-standard
from signal_tower, uses Maro route evidence, Livia ward delegation, survivor
evacuation, Iunia witness-certified custody, and gate sealing, and keeps the
read-only site graph, mission text, aliases, objectives, metadata, and proof
path in CUE-authored content with checked-in generated Rust output and
freshness verification. Inventory and status now distinguish physical items,
writs, relic custody, marks, evidence, generated calendar/regional/construction
metadata, and disabled-by-default optional fake-agent NPC budget metadata.
Commit 4045576 at 2026-04-30 08:56 UTC added generated calendar event
metadata for an active lantern-vigil festival and a later road-muster
military event, surfaced as status metadata only; actor movement, event-driven shop mutation,
witness blocking, route mutation, debrief branching, quests, gifts, and
affection remain future work. Direct NPC game authority remains future work.
Commit 64933131 at 2026-04-30 13:09 UTC added the first bounded seasonal
shop-stock mutation: post-debrief quartermaster field-rations buys spend
audited Aurelian standing, record service-owned per-expedition seasonal stock
usage, add the ration to inventory, and stay bounded by pure active-stock, standing-gate,
remaining-stock, and depletion checks.
Room chat history is service persistence rather than player-owned adventure
state. The agent NPC foundation is deterministic quota/refusal metadata and
pure fake-model logic only. Commit
c6d887 at 2026-04-30 08:22 UTC extended that fake-agent surface to
personal routines, nonbinding shop negotiation flavor, and festival reactions
as dialogue/proposed-action data with no authority mutation. Live LLM calls,
hosted-agent services, durable NPC memory, autonomous NPC actions, trade
commits, festival rewards, and quest mutation are not implemented. Commit
6605ee6a at 2026-04-30 13:39 UTC added the bounded regional market delivery
proof: fresh committed field-ration receipts deliver the committed quantity
into player expedition inventory, while commit replay and errors do not
duplicate delivery; NPC stores, outpost stock, currency, durable ledgers,
profile balances, and crash recovery remain future work. Commit b1c98eb1 at
2026-04-30 14:15 UTC bounded ordinary inventory admission for room takes,
seasonal harvests, quartermaster field-ration purchases, and regional market
delivery; regional delivery now fails closed when the full committed quantity
cannot fit and stays replayable after ordinary items are dropped. Commit
f06aa732 at 2026-04-30 14:51 UTC kept that capacity replay proof on
authored/generated resources, set the current ordinary inventory capacity to
six slots, kept transfer on the same capacity helper, and proved held regional
delivery plus later full replay delivery through the real scenario process.
Commit fd432147 at 2026-04-30 15:14 UTC added the bounded player-local
currency side of that regional market proof: fresh committed field-ration buys
spend two Aurelian chits exactly once, insufficient balances are denied before
transaction mutation, inventory shows the player-local chit balance, and held
delivery replay does not spend again. NPC stores, outpost stock, durable
currency ledgers, profile balances, fees, expiry advancement, and crash
recovery remain future work.
Commit 7a9a4af5 at 2026-04-30 15:53 UTC added the bounded seller-outpost
side of that regional market proof: fresh committed field-ration buys
decrement ash_farm stock from six to two exactly once, insufficient seller
stock is denied before transaction, currency, delivery, or stock mutation,
status shows the stock line, and committed replay plus held delivery replay do
not decrement again. NPC stores, broader outpost inventories, durable stock
ledgers, durable currency ledgers, profile balances, fees, expiry advancement,
and crash recovery remain future work.
Commit 00b18598 at 2026-04-30 16:23 UTC added bounded service-owned
regional market fee accrual for the same proof: fresh committed field-ration
buys accrue the generated buy and sell order fees into a regional-market pool
exactly once, status shows the fee line, release/no-cross and non-ration facts
do not accrue fees, and committed replay plus held delivery replay do not
accrue again.
Commit bdcc23ed at 2026-04-30 16:57 UTC added bounded service-owned
seller-outpost proceeds for the same proof: fresh committed field-ration buys
credit ash_farm two proceeds chits exactly once, status shows the seller
proceeds line, release/no-cross, stale, mismatched, and non-ration facts do not
credit proceeds, and committed replay plus held delivery replay do not credit
again. NPC stores, broader outpost inventories, durable stock and currency
ledgers, durable seller-proceeds ledgers, profile balances, durable fee
ledgers, expiry advancement, and crash recovery remain future work.
Commit 29c065a9 at 2026-04-30 17:41 UTC added bounded regional market
order expiry for live matching and reserve. The fixed-smoke day keeps the
field-ration proof active, while the real scenario process proves a day-73
expired field-ration reserve releases without status, inventory, currency,
outpost stock, fee, seller-proceeds, or delivery mutation. Durable calendar
advancement, durable order books, profile ledgers, durable fee ledgers, and
crash recovery remain future work.
Commit 205fd6a0 at 2026-04-30 18:40 UTC added bounded service-owned
regional-market fee withdrawal for the same proof. adventure-content owns
the deterministic withdrawal resolver from current fee pool, applied
withdrawal ids, and treasury balance;
adventure-server owns the live fee pool, applied withdrawal ids, and treasury
balance in PlayerState; and the real scenario process proves sell withdraw-fees to regional-market moves the two accrued fee chits once,
replays without withdrawing twice, and leaves inventory, currency, outpost
stock, seller proceeds, and delivery state untouched.
Commit a547db3d at 2026-04-30 19:43 UTC added the bounded regional-market
receipt snapshot/restore proof:
adventure-content reconstructs RegionalMarketTransactionState from ordered
receipt facts with capacity, sequence, malformed reserved fact, missing
reservation, mismatched terminal fact, and overlapping open-reservation
rejections; adventure-server exposes
buy receipt-snapshot from regional-market to clone live receipts,
restore a separate state, replay the old field-ration commit, and prove replay
success without live market, inventory, fee, treasury,
seller-proceeds, stock, or delivery-id mutation. This is not durable restart
loading or a general persistence layer.
Commit 4b44b32 at 2026-04-30 20:07 UTC added the bounded regional-market
settlement snapshot-view proof. adventure-content checks the applied
delivery, currency debit, outpost stock decrement, fee accrual, fee
withdrawal, and seller proceeds ids plus the settlement balances, rejects
over-capacity id snapshots, and replays the committed field-ration fact and
fee withdrawal as already applied. adventure-server exposes
buy settlement-snapshot from regional-market, and the scenario proof verifies
the success text without live status or inventory mutation. This is still a
bounded crash-recovery primitive, not durable restart loading or broad economy
persistence.
The bounded construction-job receipt snapshot branch is scoped to pure Rust
construction receipt snapshot semantics plus a size-constrained QEMU
no-mutation probe. Pure adventure-content tests reconstruct a separate
ConstructionJobState from ordered construction facts, reject over-capacity,
out-of-order, malformed reservation, missing-terminal, mismatched-terminal,
overlapping-open, and non-closed snapshot shapes, and preserve standalone
released facts from failed material reservations as closed replayable
outcomes. After the old completed field-repair job, the field engineer’s
repair receipt-snapshot command only checks status/inventory stability and
confirms live construction state and material stock are unchanged. The runtime
command is not a proof that receipts replay into the live construction service,
and this is not durable restart loading or a general construction persistence
layer.
Commit
f149119 at 2026-04-29 09:09 UTC landed the
pure targeted-combat foundation in adventure-content: deterministic combat
zones, damage kinds, attack and mob profiles, bounded zone damage, fatigue,
interruption, recognition, and alert propagation. The Aurelian adventure server
now consumes generated mob combat profiles, and the client, scenario proof, and
QEMU smoke exercise explicit target-zone attack, skill, and cast commands.
Commit f4a7fdb at 2026-04-29 18:07 UTC landed the first bounded
authority-combat verb: challenge-authority and the text alias
challenge authority <target> let an accepted ward-writ attack the
ward-wraith’s hostile ward authority instead of hp, with real scenario and
shell-smoke coverage for wrong-target, missing-authority, success, and alias
paths. Durable alert groups, broader authority-combat verbs beyond that first
ward-wraith slice, and broad weapon handling remain open.
Subsequent phases keep the same capability architecture and grow the playable surface: more missions, more sites, durable profiles, persistent shared world state, party multiplayer, lawful PvP, an authoritative ledger of public history, and a richer presentation client. The deterministic transcript stays the proof gate; the game is what those transcripts increasingly exercise.
Setting
Working title: The Ninth Gate of Aurelian.
The game is set in an original late-imperial fantasy frontier: a Roman-like empire holds a fortified border against a hostile magical domain beyond a chain of unstable gates. Its frontier forces are mixed cohorts of shield soldiers, oath-bound magical warriors, field wizards, scouts, engineers, priests, and contracted hunters. Their authority is formal, audited, and revocable: orders, route rights, gate keys, supply access, spell licenses, and relic custody are all concrete grants.
The setting should emphasize duty, command culture, dangerous expeditionary work, practical battlefield magic, and rivalry between imperial command, temple witnesses, guild contractors, and licensed war mages. Those are broad genre ingredients. The capOS version must keep its empire, factions, locations, artifacts, NPCs, magic vocabulary, and plot original to capOS.
Core Fantasy
The player is a junior imperial operator assigned to a frontier gate-fort after a failed expedition. They are not a chosen-one archmage. They are useful because they can hold and delegate unusual authority safely: command passes, ward keys, evidence seals, squad orders, relic handles, and restricted gate routes.
The fantasy is not “collect three objects in four rooms” and it is not permission management with fantasy labels. The target fantasy is:
I gain rare authority, use it to enter forbidden places, command trusted
agents, bind dangerous relics, expose corrupt actors, and reshape the
frontier.
That means:
- receive a narrow mission grant,
- choose writs, companions, relics, and supplies,
- inspect a dangerous frontier site,
- discover conflicts between military, temple, guild, rebel, and wizard authorities,
- fight, negotiate, delegate, expose, or revoke instead of only attacking,
- route capabilities to allies who can act on them,
- survive magical incidents with limited tools,
- return with proof, prisoners, recovered relics, or a sealed gate.
That maps cleanly to capOS. Every interesting magical or political permission can be represented as a capability, and the player experience can expose the same idea through in-world language.
The game shape is a compact expedition RPG first:
accept mission
choose writs / companions / relics
enter dangerous site
discover authority conflicts
fight / negotiate / delegate / revoke
extract with loot, survivors, evidence, or consequences
upgrade rank, base, companions, and future authority
Every major RPG system should answer one question: how does this change what the player is lawfully, socially, or supernaturally able to do? If a feature only adds generic RPG numbers, cut it or make it authority-native.
Design Goals
- Make the first ten minutes of any session engaging, and leave room for a long-running campaign across many sessions, profiles, and parties.
- Keep all game verbs backed by typed service calls.
- Use NPCs as processes with explicit caps, not scripted text pasted into the client.
- Make authority visible: the player should understand what they are allowed to do and why.
- Make revocation and delegation part of play without turning the UI into an OS lecture.
- Make the next useful command discoverable from room text, status text, NPC advice, or command completions instead of requiring source reading.
- Keep exact object and actor ids player-facing and stable, with aliases only as convenience paths to the canonical ids.
- Preserve deterministic QEMU transcript coverage for proof slices, even as the wider game grows seeded variation, persistent world state, and multiplayer.
- Demonstrate that the capability model is usable from more than one application language. Rust and Lua game code should both operate through typed caps; neither language should receive ambient authority.
- Treat the persistent shared world as a real product surface: profiles, ledgers, expedition checkpoints, faction history, market state, and contributor evidence all live in capability-bounded services with authoritative server ownership.
Non-Goals
- A parser-combinator text adventure.
- Random combat outcomes inside the deterministic QEMU smoke proof. Variance in normal play is fine; transcript-critical paths must remain reproducible under a fixed mission seed.
- Copyright-compatible retelling of any source novel.
- Kernel-side game state. The kernel enforces capability authority; mission, profile, ledger, and world state belong in userspace services.
- A user-owned save blob as authority over public world facts. User-owned Drive/Firebase capsules may back up private profile and explicit expedition state, but ledger records, multiplayer outcomes, market receipts, and contributor rewards remain server-authoritative.
Intuitiveness Layer
The current artifact exposes the right command categories but makes the player do too much vocabulary discovery. The next phase should treat every room view, inspection result, and failure message as part of the control surface.
Principles:
- Always print canonical ids for objects, actors, mobs, writs, and exits.
- Accept forgiving aliases such as
livia,Livia, andmagisteronly after resolving them to one canonical id in the response text. - When a command nearly matches a known id, return a suggestion:
No broker offers ward here. Did you mean request ward-writ? - When an actor cannot execute an order, list one or two valid tasks if the
player has enough information:
Livia cannot execute guard. With ward-writ delegated, try order livia to dispel-sigil. lookshould show the current objective, visible interactables, present actors, hostile mobs, and one short “lead” line when the player is stuck.statusshould separate survival state from mission state so players can scan it quickly:
Status: hp 12/14, guard 5, fatigue 0
Mission: expose the tower sigil, defeat ward-wraith, seal gate
Held: ward-writ accepted
Delegated: ward-writ -> livia
Lead: Livia needs line of sight; stand in the signal tower and order dispel-sigil.
Failure text should preserve causality. A refusal should say whether the missing piece is location, knowledge, authority, inventory, rank, cooldown, or target state. That makes the game feel fair and also teaches the capability model without naming kernel concepts.
Denial should usually reward the player with a lead. A blocked action can reveal a missing witness, unknown jurisdiction, forged seal, rival grant, corrupt actor, unsafe state, rank gate, or alternate route:
The gate refuses your route grant. It names the tower road, not the aqueduct.
Mira notices an old witness mark beside the lock.
Secrets and incomplete jurisdiction knowledge are core RPG fuel. The player should often discover that they are blocked because they do not yet understand who has authority here, not because they lack a generic key.
The current text parser can implement the first half of this through canonical
ids, aliases, and suggestions. The later CommandSession path should expose
the same hints as dynamic completions rather than duplicating a parser.
Scripting And NPC Brains
The kernel capability model should enforce authority. The adventure service, Lua scripts, NPC processes, and any later agent runners are ordinary userspace clients of that model. They should hold only the caps their role needs and exercise world mutation through typed service calls.
Rust should remain the default for core service code, bounded simulation, and proof-critical state transitions. Once Lua Scripting exists, Lua is a good fit for deterministic scenario glue: mission beats, dialogue state machines, quest-board text, debrief variants, and scripted NPC reactions. Lua scripts should receive narrow host APIs and object caps; they should not receive raw cap IDs, broad spawn authority, or a way to bypass Rust service validation.
This is useful for the game because it proves the capability model is not a Rust-only convention. A transcript can show a Rust service and a Lua-scripted NPC both using typed authority correctly, including one denied ungranted path.
For NPC behavior that does not need deterministic transcript output, the
Language Models and Agent Runtime design is the
better fit. LLM-backed NPCs can provide tavern chatter, optional hints, flavor
summaries, or reactive dialogue, but their output should be treated as data.
They must not decide mission-critical authority, relic custody, combat damage,
or policy denials, and they should not sit on the main QEMU proof path unless
served by a deterministic stub. The model/embedder/agent-runner capabilities,
per-tool permission modes (auto/consent/stepUp/forbidden), and budget plumbing
described in
Language Models and Agent Runtime are the upstream surface
the Phase 11d fake-agent budget metadata foreshadows; the live-agent path must
attach those typed caps to an AdventureNpc facet rather than handing ambient
LLM authority to a chat process.
Long-lived or agent-controlled NPCs should also inherit the hosted-agent
harness constraints from
capOS-Hosted Agent Swarms. An NPC agent is a
task-like process with a workspace/memory scope, advertised tools, audit, and
budget, not an ambient identity. The adventure-specific budget should include
per-NPC, per-session, and per-game-day token quotas, tool-call quotas,
cooldowns, and model profiles. When quota, fatigue, sleep schedule, or policy
blocks an answer, the NPC should refuse in-world, for example:
I'm tired. Going to sleep. That refusal is part of gameplay state and audit,
not a transport failure. Any memory/reflection output from the agent remains
low-authority data until compiled into deterministic content or service-owned
state through reviewed rules.
Player Loop
The 30-second loop is one meaningful command and one consequence:
- Explore, inspect, fight, negotiate, unlock, delegate, or extract.
- Receive at least one result: loot, map knowledge, danger, faction change, clue, shortcut, companion reaction, or new authority state.
- Re-evaluate the site with better or worse information.
The 10-minute loop is a writ-backed expedition:
- Briefing: an officer grants a mission writ and one restricted gate route.
- Preparation: the player chooses route, companion, relic loadout, and helper
authorities such as
ward-writ,scout-order,medic-token, orrelic-seal. - Expedition: the player enters two or three connected frontier locations.
- Encounter: a surprise reveals an authority conflict, enemy trick, secret jurisdiction fact, companion risk, or faction demand.
- Consequence: NPCs respond, a route opens or closes, and logs/evidence update.
- Extraction: the player returns with survivors, relics, evidence, scars, or consequences.
The multi-session loop is frontier reshaping:
- Increase rank and unlock new jurisdictions.
- Upgrade the base, temple, archive, court, and other authority modules.
- Build faction reputation and expose larger conspiracies.
- Gain companions, alter their trust and doctrine, and decide who can safely hold delegated power.
- Unlock future missions by expanding legal reach, not just damage and health.
Each loop should contain one deliberate choice, one reversible mistake, and one visible consequence. For example, the player can spend coin to buy a safe route, persuade the scout and keep the coin, or skip the route and risk an ambush. The transcript remains deterministic, but the player sees that the world is not a single locked command sequence.
The first implementation can cover one mission:
Mission: recover the missing eagle standard from the ruined signal tower.
Complication:
- the tower gate is unstable,
- a wounded legionary is trapped behind a ward,
- a guild scout wants payment before sharing a safe route,
- a temple witness refuses to certify the relic unless the player has not used
a forbidden oath rite.
Good outcomes:
- standard recovered,
- survivor evacuated,
- gate sealed,
- scout paid or persuaded,
- temple witness records clean custody.
Narrative Model
The game should read like a compact expedition report unfolding through play, not like a list of room fixtures. The server owns mission state; NPC processes own voice, advice, rumors, and local reactions.
Use mission beats:
briefed: Varro states the objective and offers two optional authorities.crossed-gate: the gate opens, chat switches from fort traffic to field traffic, and old fort chatter becomes history.first-contact: the first hostile sign teachesinspectand intent text.complication: a survivor, relic, or blocked route forces a tradeoff.turning-point: the player delegates, spends, steals, persuades, or fights.extraction: the player returns with relics, witnesses, prisoners, or sealed-route proof.debrief: rank marks, faction opinion, and audit records update.
Narrative should be stateful and short. A room description can have variants before and after key facts are known, but it should not become walls of prose:
Ashen Road
Signal ash drifts across the old paving stones. Maro has marked a narrow
ditch-route east.
Exits: west tower-east ditch-east
Actors: maro
Lead: ask maro about route, or inspect ash-tracks.
NPCs should surface stakes. Livia cares about unstable wards, Varro cares about orders and casualties, Iunia cares about custody and forbidden rites, and Maro cares about payment, favors, and survival. Their objections create interesting command choices rather than static lore.
World Model
Replace the current static room list with a small graph of Site records:
site_id
title
description
region
threat_level
exits
visible_items
actors
active_wards
required_route_cap
Keep the first map small:
fort_aurelian: command room, quartermaster, temple annex.gate_yard: portal control, squad muster, unstable gate.ashen_road: contested approach with scout and ambush traces.signal_tower: relic objective, wounded soldier, ward puzzle.under_vault: optional dangerous route with an oath-echo hook.
The wider game should grow this into multiple settlements and outposts rather
than a single hub with more rooms. fort_aurelian stays the first proof
settlement. Later content can add a civilian city, a temple-administered site,
a guild waystation, and resource-producing outposts such as mines, farms,
timber camps, shrines, salvage yards, gate-yards, and repair yards. Routes
between them carry distance, hazard, faction control, seasonal closure,
cargo-limit, and authority metadata. Outposts produce bounded resources and
consume supplies through service-owned state; user save capsules can back up
private profile data but cannot invent public production or market facts.
Items should become capabilities or evidence, not inert nouns:
eagle_standard: relic custody cap; proves mission objective.ward_writ: authority to request ward changes, but logs every use.scout_marker: grants access to hidden route hints.oath_echo: one-use inherited rite; powerful but politically risky.temple_seal: certifies clean custody if conditions are met.
Actors
NPCs should be separate processes where possible:
- Centurion Varro: mission issuer, grants route and squad authority.
- Magister Livia: battlefield wizard, can identify ward failures.
- Acolyte Iunia: temple witness, audits relic custody and forbidden magic.
- Maro the Guild Scout: knows routes, trades in favors, can withhold help.
- Wounded Legionary: rescue objective and source of battlefield facts.
- Gate Echo: hostile magical presence that can corrupt routes or chat.
Each actor should own only the caps that fit its role. For example, the scout does not get relic custody, and the temple witness does not get squad command.
Mechanics
Authority As Inventory
The current inventory is just strings. Split the player-facing inventory view into:
items: physical objects visible in rooms,writs: mission and faction permissions,relics: dangerous objects with custody rules,marks: progression/rank state,evidence: facts or signed observations.
Player-facing commands can still say inventory, but output should show why
each entry matters:
Writs:
gate-route: tower approach, expires after return
ward-writ: request imperial ward changes, audited
Relics:
none
Evidence:
broken sigil: tower ward failed from inside
This view must not imply that all entries are picked up with take. Physical
objects use take and drop. Authorities use explicit grant, delegation, or
custody verbs:
request <writ>asks an actor or broker for a grant.accept <writ>receives a grant already offered by an actor.delegate <writ> to <actor>grants a scoped child or NPC authority.revoke <writ>withdraws authority when the holder allows it.
Failure text should distinguish object state from authority state:
You can see the southern ward, but your mission writ names only the tower gate.
Centurion Varro has not offered squad command authority.
The relic can be carried only after a temple witness seals custody.
Writs As Loot
Writs are RPG loot, not boring quest permissions. A writ is gear, skill tree, access key, and social status at the same time:
| RPG concept | Aurelian version |
|---|---|
| Weapon | combat writ, relic mandate, dueling license |
| Armor | ward writ, sanctuary bond, witness shield |
| Skill | delegation pattern, seal-breaking rite, custody transfer |
| Key | route grant, gate mark, archive token |
| Reputation | rank seal, faction trust, lawful standing |
| Curse | corrupted grant, forged writ, hostile obligation |
| Legendary item | ancient relic with dangerous authority |
A good writ should feel like “this changes what I can do,” not “this allows the next quest step.” Writs can carry bounded affixes and drawbacks:
Route Grant of Urgency
- Allows passage through old aqueduct
- Expires after 40 turns
- Cannot be delegated
- +1 faction trust if survivor extracted
Custody Writ of Burden
- Allows carrying sealed relic
- Reduces combat initiative
- Requires witness before transfer
- Breaking custody causes temple penalty
The modifier set must remain deterministic under the mission seed. A writ must make issuer, scope, expiry, allowed verbs, delegation rules, drawbacks, and revocation conditions inspectable.
Authority Archetypes
Classes are authority archetypes, not generic stat packages:
| Archetype | Legal, social, and supernatural power |
|---|---|
| Warden | protects people, escorts survivors, creates safe routes, specializes in wards and evacuation |
| Marshal | enforces law, arrests hostile agents, handles bounties, duels, raids, and frontier justice |
| Archivist | finds hidden evidence, preserves witness chains, decodes old grants, detects forged authority |
| Custodian | handles relics, dangerous artifacts, sealed rooms, containment failures, and temple politics |
| Factor | controls logistics, markets, supply lines, caravan permissions, construction, and regional influence |
| Heretic/Renegade | uses forbidden authority faster while risking corruption, exile, unreliable witnesses, and hostile audits |
Two archetypes may share combat numbers, but they should solve blocked situations with different verbs. A Warden invokes sanctuary, a Factor proves a supply right, an Archivist exposes the old seal, and a Heretic breaks the seal at a cost.
Delegation Buildcraft
Delegation is buildcraft. Companions are fallible agents, not portable stat bonuses:
| Trait | Gameplay effect |
|---|---|
| Loyalty | obeys the spirit versus the letter of a writ |
| Ambition | may exploit broad authority for personal goals |
| Competence | handles dangerous grants safely |
| Reputation | affects faction trust when holding delegated power |
| Fear | may abandon delegated duty under pressure |
| Doctrine | interprets ambiguous orders through law, temple rule, guild practice, or renegade code |
The player’s build is partly deciding which powers to keep, which to delegate, and whom to trust. Carrying a relic personally may block entry to a polluted shrine. Delegating custody may free the player to act, but the companion can be bribed, frightened, corrupted, or forced to testify later. The service owns the deterministic result and prints the cause.
Item Use
Add explicit verbs:
use <thing>give <thing> to <actor>ask <actor> about <topic>inspect <thing>seal <site|gate|relic>request <writ>accept <writ>delegate <writ> to <actor>revoke <writ>order <actor> to <task>
The service validates authority and state. Invalid actions should return specific text, not just the unchanged room.
Progression
Use a small rank model:
tiro: recruit/operator, tutorial grants only.signifer: can carry relic custody.centurion: can issue squad orders.legate: future high-authority profile.
This is not a stats grind. Rank changes which capabilities the broker may grant in later missions. Magic progression can mirror this with circles:
- first circle: detect wards,
- second circle: reinforce shields,
- third circle: stabilize a minor gate.
Progression should unlock reach, not only power:
Can issue temporary ward writs
Can enter disputed shrines
Can appoint one field deputy
Can challenge forged grants
Can hold two relics in custody
Can revoke delegated authority remotely
Can negotiate with hostile jurisdictions
Can operate without local witness once per mission
Base modules should also create new verbs, not passive bonuses:
| Module | Function |
|---|---|
| Archive | stores evidence, unlocks old maps, verifies claims, exposes forged records |
| Temple vault | stores relics, enables custody upgrades, binds dangerous artifacts |
| Barracks | trains deputies and companions, improves command delegation |
| Court | resolves disputes, converts evidence into rank, revokes corrupt grants |
| Market hall | trades supplies and regional favors, supports escrow for ordinary goods |
| Signal tower | extends remote revocation and delegation range, reveals route hazards |
| Sanctuary | protects rescued NPCs and creates story consequences |
Combat
Combat should exist, but it should stay tactical and bounded. The first interesting version does not need a full roguelike engine, but it does need enemies, danger, skills, spells, and readable outcomes.
Combat is turn-based at the command level. A fight starts when the player enters a hostile site, triggers a ward, fails a negotiation, or chooses to engage. Each turn the player picks one action, allied NPCs act if present, then hostile mobs act. The transcript should stay deterministic for smoke coverage. Combat should attack authority as well as HP. The distinctive tactical question is whether the player can keep lawful control while under pressure.
Enemy types should interact with authority directly:
| Enemy | Threat |
|---|---|
| Forger | creates fake writs and causes false accusations |
| Null-priest | disables local grants or sanctuary bonds |
| Bandit captain | steals custody tokens or route proofs |
| Corrupt magistrate | revokes or contests authority mid-mission |
| Wraith | ignores physical defenses but obeys old seals |
| Spy | learns route grants and ambushes exits |
| Oathbreaker | turns delegated powers against the player |
Good tactical verbs include:
inspect seal
challenge authority
bind relic
revoke grant
delegate ward to Mira
force witness
seal exit
expose forgery
claim custody
invoke sanctuary
Later combat can borrow a narrow set of Evil Islands-style tactical mechanics without inheriting its real-time randomness or punitive retreat traps. The grounding is recorded in Game Mechanics Prior Art. The useful ideas are visible preparation, careful fight selection, body-zone targeting, damage-type and armor interaction, stealth openings, and cast-time risk. Aurelian should translate those into deterministic command outcomes: scouting reveals threat and intent, inspected enemies expose vulnerable zones, weapons and spells target bounded zones, and failed positioning has explicit costs.
Player actions:
- shield a wounded soldier with
ward-writ, - call a legionary NPC if holding
squad-order, - use
oath_echoto bypass a lock at political cost, - attack with a weapon skill,
- cast a prepared spell,
- guard, retreat, seal a route, or order an ally.
This gives the player real decisions without turning combat into repeated
attack commands.
Targeted attacks should stay small and readable:
attack ward-wraith head with spear
attack imp-scout legs
cast ember-dart at ghoul hands
Zone effects are deterministic and bounded:
head: harder to land, can increase critical or disruption outcomes;hands: can reduce attack cadence, weapon use, or casting stability;legs: can slow pursuit, block retreat prevention, or weaken charge intent;core: the default reliable target, lower risk and lower swing.
Damage and mitigation should consider weapon type, spell type, zone armor, ward state, and inspected knowledge. A spear against a lightly armored weak point, a mace against armored limbs, or a ward spell against a revealed sigil should feel different in transcript text and outcome. The service still owns the exact result; clients do not roll hidden dice.
Mobs
Mobs should be small state machines owned by the adventure service or by separate actor processes once that split is useful. Initial mob types:
imp-scout: weak, fast, tries to flee and report.ash-ghoul: slow melee enemy, punishes unguarded players.ward-wraith: ignores ordinary weapons until a ward is inspected or broken.gate-hound: blocks retreat unless stunned or distracted.echo-centurion: elite magical-warrior enemy used as a mission boss.
Each mob has:
name
threat_level
hp
armor
zone_armor
ward
attack
morale
traits
intent
intent is visible when the player has scout or wizard support:
The gate-hound lowers its head. Intent: lunge at the weakest target.
The ward-wraith gathers blue fire. Intent: break shield next turn.
This makes fights more interesting than hidden dice rolls.
Unknown enemies should not expose full mechanical truth immediately. A scout, wizard, height advantage, prior codex evidence, or an inspection action can upgrade the view from rough threat to exact armor/ward/intent/counter data. That keeps stealth and observation relevant without forcing real-time mouse precision.
Basic Stats
Use a small stat block:
vigor physical endurance and wound tolerance
discipline morale, command, resistance to fear
edge weapon accuracy and quick action
ward magical defense and shield capacity
focus spell control and ritual stability
Derived values:
hp 8 + vigor * 2
guard discipline + ward
initiative edge + focus
load vigor + discipline
The player should see compact status:
Status: hp 12/14, guard 5, fatigue 1
Ranks: warrior 2 stars, wizard 1 circle
Prepared: shield-bind, ember-dart
Writs: tower route, ward-writ
Stats are not the main reward system. They exist so combat and spell choices are legible.
Leveling And Reputation
Progression should reward mission outcomes rather than enemy grinding. A successful debrief can grant:
rank marks: unlocks brokered authorities such as relic custody or squad order;warrior stars: unlocks martial and command skills;wizard circles: unlocks prepared spells and restricted ritual caps;faction standing: changes prices, testimony, help, and PvP legal status;codex entries: records inspected wards, mobs, relics, and route hazards.
Progression inputs should be auditable mission facts:
Recovered eagle-standard: +1 imperial standing
Evacuated wounded-legionary: +1 cohort standing
Used oath-echo: +1 breach power, -1 temple standing
Sealed gate with witness present: unlock relic-custody eligibility
Rank is therefore a policy input, not just a number. A player may be strong
enough to win a fight but still unable to receive temple-seal authority
after abusing forbidden magic.
Stars And Circles
Use separate progression tracks for martial and magical competence.
Warrior stars:
| Stars | Meaning | Unlocks |
|---|---|---|
| 0 | civilian or raw recruit | flee, guard, basic strike |
| 1 | trained legionary | shield wall, steady aim |
| 2 | proven frontier fighter | counter, command ally |
| 3 | veteran signifer | rally, hold line, relic carry |
| 4 | centurion-grade | issue squad order, tactical stance |
| 5 | heroic champion | break elite guard, inspire cohort |
Wizard circles:
| Circle | Meaning | Unlocks |
|---|---|---|
| 0 | no formal spell license | use charged relics only |
| 1 | apprentice field magic | ember dart, detect ward |
| 2 | battlefield adept | shield-bind, mend wound |
| 3 | gate specialist | stabilize gate, dispel minor ward |
| 4 | war mage | dome shield, bind hostile spirit |
| 5 | archmage authority | rewrite route, seal major breach |
Stars and circles are player-facing rank labels, not copied external lore.
Implementation may rename them if a stronger capOS-specific vocabulary emerges.
Both are capability policy inputs. A 3-star warrior can receive relic custody
that a recruit cannot. A 2-circle wizard can receive a ward-writ but not a
gate-rewrite cap. The player-facing fiction is rank and training; the
system-facing implementation is brokered authority.
Skills And Spells
Skills are martial or command actions:
strike: basic weapon attack.guard: reduce incoming damage and protect one ally.shield-wall: requires 1 star and an allied legionary.counter: requires 2 stars; punish a missed melee attack.rally: requires 3 stars; restore morale and clear fear.order: requires appropriate writ; make an allied NPC act now.
Spells are prepared actions with fatigue or reagent costs:
ember-dart: 1 circle; reliable ranged damage.detect-ward: 1 circle; reveals ward traits and mob intent.shield-bind: 2 circles; temporary guard bonus.mend-wound: 2 circles; stabilize or heal a wounded target.stabilize-gate: 3 circles; stops gate hazards or opens safe retreat.dome-shield: 4 circles; protects the whole party for one turn.
Forbidden or risky techniques:
oath-echo: one-use inherited rite; strong effect, audit cost.demon-brand: hostile shortcut; should exist as a temptation, not an ordinary optimal play path.
Prepared spells should be visible in status. The first mission can grant only
ember-dart, detect-ward, and shield-bind.
Fight Commands
Add combat commands:
attack <mob>skill <name> [target]cast <spell> [target]guard [ally]order <ally> to <action>retreatstatus
Example:
[combat:signal_tower]> cast detect-ward wraith
The ward-wraith is bound to the broken tower sigil.
Intent: break shield next turn.
[combat:signal_tower]> order livia to dispel-sigil
Magister Livia spends the ward-writ grant. The sigil cracks.
[combat:signal_tower]> attack wraith
Your gladius bites through the fading ward. 5 damage.
Loot And Equipment
Loot should be sparse, inspectable, and tied to authority. The game should not become a pile of random nouns.
Item categories:
supplies: torches, bandages, reagents, gate-stabilizer parts.equipment: gladius, shield, bow, focus ring, warded cloak.relics: eagle standard, sealed tablets, oath-bound cores.evidence: ash traces, broken sigil sketches, witness statements.trade goods: coin, salvage bronze, guild favors, ration chits.
Every loot entry should have at least one of these uses:
- opens a route,
- changes a combat choice,
- helps an actor,
- sells for a predictable value,
- acts as evidence in debrief,
- is dangerous custody with audit consequences.
Equipment can remain simple:
weapon: affects attack damage or skill unlocks.shield: affects guard and ally protection.focus: affects spell fatigue and ward inspection.cloak: affects stealth, ambush, or faction recognition.load: caps carried gear before fatigue penalties.
Later equipment should support blueprint/artifact construction without turning the game into unbounded loot rolling. A construction job names a blueprint, materials, location/facility class, rank/star/circle gates, cost, expected duration, and output bounds. The service reserves materials and currency, validates the job, records it, and completes or releases it through the same transaction discipline used by markets. Item properties are derived from the base blueprint, material choices, crafter skill/rank, facility quality, and paid cost. Enchantment is a constrained post-process: object type, enchanter circle, lawful authority, and remaining enchantment slots determine valid results. Artifact-scale outputs include witness-sealed relic cases, warded cloaks, focus rings, route compasses, golem cores, and gate-stabilizer parts. This construction direction is grounded in Game Mechanics Prior Art, especially the Evil Islands and EVE Online notes.
Relics are not ordinary loot. They require custody authority, may be move-only, and should be visible in audit output. Dropping or trading a relic without the right witness should be a meaningful failure.
Buying, Selling, And Logistics
The shopkeeper should become a small economy service rather than ambient chat. The first version can be deterministic and local:
buy <item> from <actor>sell <item> to <actor>quote <item> from <actor>trade <item> to <actor> for <item|favor>repair <item> at <actor>
Markets should have roles:
- quartermaster: sells supplies for ration chits, requires imperial standing;
- guild scout: sells route hints and contraband for coin or favors;
- temple annex: certifies relic custody and sells lawful wards;
- field engineer: repairs gate parts, golems, and damaged equipment.
Prices should be legible and bounded:
Maro offers ditch-route for 1 coin or scout-favor.
Quartermaster refuses focus-ring: requires wizard circle 1.
Iunia will certify eagle-standard only if oath-echo was not used.
Buying and selling maps naturally to capabilities. A shop can only sell what its actor is authorized to transfer; the player can only receive items or writs permitted by rank, faction standing, and mission state. Trade failures should name the blocked authority, not pretend the item is missing.
The regional market target is closer to a brokered order book than a single shop inventory. Market services should define market-eligible item classes, regional buy orders, sell orders, price/time priority, immediate matching when prices cross, expiry, fees, and ordered ledger receipts. Items that are not market-eligible still move through explicit custody, barter, witness, or quest flows. If several services own profile inventory, expedition cargo, outpost stock, or cloud-backed records, the market coordinator needs reserve/escrow, commit/release, stale-version rejection, idempotency keys, cancellation, retry, and crash-recovery behavior before any player-visible two-party exchange is treated as implemented. This market direction is grounded in Game Mechanics Prior Art, especially the EVE Online notes.
Randomization
Randomness should make repeated play feel alive without making QEMU coverage fragile. Use seeded mission variation, not hidden unbounded dice. The legal model remains deterministic and auditable: under the same seed and discovered facts, authority grants, denials, revocations, custody outcomes, and faction consequences must replay exactly.
The mission seed can choose:
- one of several mob placements,
- one optional route hazard,
- one mission complication,
- one faction demand,
- one shop inventory variant,
- one companion behavior pressure,
- one relic side effect,
- one enemy authority trick,
- one optional objective,
- one loot or writ modifier,
- one NPC rumor or personality line,
- one loot cache location,
- one debrief complication,
- a calendar state: season, day, weather/hazard class, seasonal resource table, festival/event hook, and routine variant.
The seed should be visible through debug or transcript mode, and smoke tests should pass a fixed seed through the manifest or mission setup:
Mission seed: 0x0000_aure_0009
Variant: ash-ghoul at ashen_road, focus-ring at under_vault
Combat randomness should be constrained by intent text. If an attack can miss, the player should see why through guard, morale, terrain, fatigue, or a mob trait. Critical swings should be optional flavor unless the seed is fixed.
Calendar variation should be similarly explicit. Four 28-day seasons are a reasonable initial model. Seasonal crops, forage, fish, shops, route hazards, and outpost production have bounded availability tables. Ordinary seasonal crops and fragile goods expire or degrade at season change unless the content declares them as multi-season. Festivals and military events can alter actor routines, witness availability, shop stock, quests, gifts, and affection-style standing records, but those effects must be ledger/profile facts rather than client-local counters. This calendar direction is grounded in Game Mechanics Prior Art, especially the Stardew Valley notes.
The implemented foundation keeps this deterministic for the smoke seed: generated mission content carries a fixed season, day, weather, hazard class, bounded seasonal resource records for the proof categories, and fixed-smoke festival/military-event metadata. Status output prints that calendar state and the active event metadata. Production per-run seed selection, gameplay effects from events, actor routine changes, and gameplay consumption/expiry rules remain future work.
Seeded generation should produce explicit world artifacts, not invisible
ambient randomness. The stable base game remains authored content: factions,
major sites, named relics, law, core routes, capability interfaces, and proof
missions. A production world can then select deterministic overlays from a
WorldlineSeed:
- local room/map variants under authored region constraints;
- optional hazards, mob placement, loot caches, and rumor/debrief variants;
- seasonal resource tables, route closures, outpost production, and shop stock;
- festival or military-event schedules;
- bounded NPC routine variants and non-critical chatter hooks;
- regional market starting books, subject to service-owned order-book rules.
Every generated artifact should carry enough provenance to be replayed or rejected: content release id, worldline id, seed epoch, generator version, scope label, and bounded output size. Services should persist selected artifacts once admitted so later patches do not silently rewrite active worlds. Smoke runs keep using fixed authored selections until the generator itself has pure tests and a fixed-seed QEMU proof.
Chat And Room Memory
Room chat should become diegetic:
- room channels are command, scout, temple, and expedition channels,
- NPC process messages are radio/runner/magic-slate traffic,
- history replay is labeled as “recent room record” if intentionally shown,
- private messages require a separate cap or actor relation.
That turns the current chat persistence quirk into a feature.
Multiplayer
Multiplayer should be a first-class capability demonstration, not just several
clients sharing one room. The current local foundation keeps per-player state
keyed by live endpoint caller-session metadata and assigns service-local player
labels such as player-1 for party commands. Those labels are not caller-chosen
badges, global principal ids, or portable identity. The adventure service can
add explicit shared expedition objects for parties, duels, trades, and
contested sites when the single-service state model stops being sufficient.
Co-op mechanics:
party create <name>creates a shared expedition with a leader cap.party invite <player>sends a join offer; accepting grants a party member cap with scoped verbs.party delegate <writ> to <player>gives another player a narrow authority, such as route access, relic carry, or squad order.assist <player> with <task>contributes a skill, spell, item, or witness action to another player’s command.sync-turnor mission turn barriers let QEMU scripts prove deterministic multi-client combat without racing terminal input.- Split roles make co-op matter: scout reveals intents, wizard handles wards, warrior protects allies, witness certifies relic custody, quartermaster carries supplies.
Co-op failure should be interesting but bounded. A player can waste a shared turn, drop supplies, or revoke a delegated route, but cannot mutate another player’s private inventory without an accepted trade, custody transfer, or party rule.
Accepted trade and custody transfer must be service-mediated state transitions,
not independent edits to two player save blobs. If one service owns both
inventories, it should perform a single version-checked mutation and emit one
ordered receipt. If ownership is split across profile, expedition, market, or
cloud-backed stores, the Trade/Market/Expedition coordinator needs an
escrow or saga protocol: reserve the item and consideration with idempotency
keys, commit or release both sides, record an append-only ledger receipt, and
make stale offers, cancellation, retry, and crash recovery explicit. User-owned
Drive/Firebase save capsules cannot authorize these transfers; they may only
back up the resulting private state after the authoritative receipt exists.
PvP mechanics should be opt-in and lawful in the fiction:
duel challenge <player>creates a temporary arena with agreed rules.duel accept <player>grants a duel-combat cap scoped to the arena.spar <player>allows nonlethal training damage and skill practice.contest <site>lets factions compete over a route, relic, or witness record when the mission explicitly allows it.bounty mark <player>can exist only as a future policy-backed authority, never as ambient attack permission.
PvP must not mean “any client can attack any badge.” Harmful verbs require an arena, duel, faction-war, or bounty capability. The service should reject unauthorized attacks with a policy explanation:
No lawful conflict grants target marcus. Challenge a duel or enter the contested yard.
Useful rich PvP/co-op surfaces:
- shared threat tables where guarding an ally changes mob target intent;
- formation commands that require two or more players to hold compatible ranks;
- witness challenges where one party audits another party’s relic custody;
- contraband markets where guild standing helps one player but hurts temple reputation;
- route races where two parties can choose negotiation, sabotage, or legal contest depending on granted authorities;
- post-mission debriefs that record contribution, friendly fire, revocations, trades, and witness disputes.
Architecture:
Expeditionservice owns shared party/site/combat state.Adventurekeeps private player profile and inventory state.Chatcarries room, party, duel, and faction channels.TradeorMarketservice owns two-party item/currency exchange, including reserve/escrow, commit/release, stale-offer cleanup, and replay-safe receipts.Auditrecords contested custody, PvP consent, and debrief evidence.
Near-term multiplayer should use live caller-session keys now and move to
broker-granted service facets or service-created player objects once those are
available. Manifest-issued receiver selectors are not a temporary Aurelian
identity bridge; user-facing shell syntax must not choose or relabel another
service identity. A useful QEMU proof uses two player objects or two distinct
live caller sessions, one shared party, one delegated ward-writ, and one
deterministic assist:
player1: party create tower
player1: party invite player2
player2: party accept tower
player1: delegate ward-writ to player2
player2: assist player1 with detect-ward
player1: attack ward-wraith
That proof is more valuable than adding network transport first because it exercises authority, shared state, and deterministic turn ordering locally.
Keep multiplayer scoped and desirable. Near-term multiplayer is cooperative expedition pressure: shared sites, dangerous relic custody requiring multiple witnesses, player deputies, faction-controlled regions, contested bounties, asynchronous rescue contracts, and public proof of heroic or criminal actions. Open ambient PvP is not the target; harmful verbs require explicit duel, contest, bounty, or faction-war authority.
MMO-scale open economies, broad player construction seasons, LLM-driven mission-critical NPCs, cross-instance federation, and worldline travel are deferred until the compact expedition loop works as a local and cooperative RPG.
Commit 335a9ee at 2026-04-28 22:22 UTC landed the first bounded Phase 12
foundation: the existing Adventure service now owns local party records for
party create, party invite, party accept, party leave, party delegate,
and assist. Party membership, pending invites, delegated ward-writ, and
assist records are deterministic service state keyed by service-local player
labels derived from caller-session keys, with transitions routed through the
unit-tested adventure-content party state. The initial
assist <player> with detect-ward path requires party membership and a
matching delegated ward-writ; it does not transfer items, currency, or
private inventory authority. A real one-client cap assertion covers the typed
party surface, while the two-client proof remains open until the manifest and
launcher/session APIs can run two real Adventure clients with distinct live
caller-session keys without faking them inside one process.
Commit ac49375 at 2026-04-29 06:43 UTC landed the next bounded Phase 12
transfer foundation and keeps transfer state inside the existing Adventure
service. The new typed
Adventure.transfer(item, player) path supports transfer <item> to <player>
for physical items only, derives both service-local player labels from live
caller-session keys, requires shared party membership, refuses relic custody
such as eagle-standard, and mutates source/target inventories atomically
through unit-tested adventure-content transfer logic. The scenario process
asserts one-client refusal paths without synthesizing a second session. Currency
escrow, market-scale two-party exchange, and successful two-client QEMU transfer
proof remain open.
Parallel Universes And Cross-Instance Worlds
Parallel universes fit Aurelian better as sovereign worldlines than as one
shared mutable map. Each capOS instance can host one or more worldline services
that use the same content release but different WorldlineSeed values, calendar
epochs, market starts, event schedules, and generated regional overlays. Players
should experience those worlds as alternate Aurelian frontiers, while the
authority model treats each worldline as its own shard with its own ledger,
market, expedition, and profile policy.
This is feasible, but only after the local authority model is solid. Raw capability slots, endpoint generations, session ids, and local player labels cannot be portable authority across kernels. Cross-instance play needs a federation gateway that presents narrow local facade caps backed by remote protocol messages:
WorldlineDirectory: lists known remote worlds by content release, worldline id, endpoint, policy, and current ledger head.WorldlineVisit: grants read/observe/chat/travel-preview authority for a remote site without importing inventory or mutation rights.WorldlineExpedition: creates a bounded cross-world expedition object with explicit participants, allowed verbs, timeout, and home-world settlement rules.WorldlineTransfer: coordinates item, currency, custody, or profile-state movement through reserve/escrow, commit/release, replay-safe receipts, and content-version checks.WorldlineAudit: verifies remote receipts, ledger-head continuity, content hashes, generated-artifact provenance, and policy compatibility.
The game should support several integration levels, each with a different authority cost:
- Echo view: a player can inspect another worldline’s public map state, rumors, market summaries, and public history. This is read-only and can land first.
- Envoy visit: the local world creates a temporary projected character in a remote worldline. The projection may chat, observe, or perform explicitly granted low-risk actions, but cannot spend home inventory directly.
- Expedition bridge: two worlds run a shared mission instance with a fixed seed and a coordinator receipt. Contributions are recorded in both ledgers only after commit.
- Trade or custody transfer: ordinary goods, relics, currencies, or reputation effects move only through a transfer coordinator. Partial failure releases reservations; retries are idempotent.
- Worldline migration: a profile moves or copies into another worldline under policy. Public achievements and custody claims are imported as verifiable receipts, not trusted client save blobs.
Parallel universes make seeded generation more important. If every worldline is
the same authored graph with the same resources and routines, federation is
mostly remote chat. Meaningful worldline differences should come from bounded
seeded overlays: seasonal economies, market starts, event schedules, route
hazards, outpost outputs, NPC routines, optional dungeons, and regional map
variants. The generator must not mint authority by accident. It can choose that
guild_iron_mine is closed in one worldline and open in another, but the
resulting travel, market, custody, and reward effects still flow through the
same service-owned capability and ledger paths.
Cross-world compatibility rules:
- A worldline advertises a content release id and generator version. Peers may reject incompatible worlds or fall back to echo-only mode.
- Generated artifacts are referenced by stable ids and provenance hashes, not by trusting remote prose.
- Remote markets expose authenticated order/receipt views; local UI hints are not authority.
- Cross-world transfers name both ledger heads and both worldline ids in the receipt, so replay into a different universe fails closed.
- Faction standing, rank, and contributor rewards should import as witnessed claims with local policy gates. A remote honor does not automatically grant a local writ.
- Clock/calendar drift is part of the design: worlds may have different seasons, festivals, or wars. Shared expeditions must pin an event epoch or name which world’s calendar controls the mission.
- Failure modes are ordinary gameplay states: remote world unavailable, receipt stale, policy mismatch, content hash unknown, or escrow timeout. Each should have a deterministic denial path.
Near-term proof should not attempt full network-transparent play. A useful first slice can run two worldline services on one capOS instance with different fixed seeds and prove echo view plus a denied transfer:
worldline list
worldline inspect aurelian-mirror
worldline echo aurelian-mirror fort_aurelian
worldline transfer eagle-standard to aurelian-mirror
The expected result is that public state can be observed with content/seed metadata, while relic transfer is rejected until custody escrow, remote policy, and dual-ledger receipts exist.
Capability Mapping
The setting should teach capability ideas through play:
| Game concept | capOS concept |
|---|---|
| mission writ | restricted launcher or mission bundle |
| gate route | endpoint/router cap with revocation |
| ward writ | typed authority to request ward-state mutation |
| relic custody | move-only cap with audit trail |
| temple witness | audit/log service with policy checks |
| rank mark | session/profile metadata influencing broker grants |
| oath echo | sealed inherited state or one-use privileged cap |
| hostile magic | untrusted service/domain with strict schema boundary |
The player does not need to see cap IDs. The game text should make authority concrete: “You cannot open the southern ward; your writ names only the tower gate.”
Service Architecture
Target process split:
flowchart TD
Shell[capos-shell] --> Client[adventure command client]
Client --> Adventure[Adventure service]
Client --> Chat[Chat service]
Adventure --> Mission[Mission state service]
Adventure --> Audit[Audit or witness service]
NPC1[centurion process] --> Chat
NPC2[scout process] --> Chat
NPC3[temple witness process] --> Chat
NPC1 --> Adventure
NPC2 --> Adventure
NPC3 --> Adventure
Near-term implementation can keep one adventure-server process and separate
NPC processes that only hold console and chat, matching the current
system-adventure.cue shape. NPCs that affect world state should initially do
so indirectly through player-visible offers and chat events. Direct NPC calls
into Adventure require session-bound service facets such as AdventureNpc,
or an equivalent broker-granted authority that cannot mutate unrelated mission
state. Receiver-selector compatibility grants are not NPC mutation authority.
Later, mission state, actor AI, and audit/witness behavior can split into separate services.
Player, World, And Game Persistence
Persistence should be explicit service state, not kernel process checkpoint/restore. The adventure game needs several kinds of state with different durability rules:
- Session state: foreground client state, prompt mode, transient command context, and chat cursors. This is per client and may disappear when the client exits.
- Expedition state: current site, active mobs, hp/fatigue, temporary effects,
party membership, pending invites, turn ordering, and in-progress objective
state. It is resumable only when the player explicitly resumes an expedition.
Ordinary
run adventure-clientshould start from the current profile but not silently continue a half-finished mission. - Profile state: player id, display handle, rank marks, warrior stars, wizard circles, faction standing, cosmetics, contributor badges, title choices, and settings. This is durable player data and must survive client exit; once a durable store exists it should survive reboot.
- Ledger state: append-only mission facts, debrief records, relic custody, forbidden-rite use, witness certifications, market/trade receipts, reward mints, and revocations. This is the audit source for profile mutations and should be harder to rewrite than profile summary fields.
- World/public state: server-authoritative shared world data: persistent room history, public faction standing and consequences, quest-board state, market stock and prices, contested-site outcomes, ledger-derived public history, and shared campaign events. This is owned by game services, not by any one client, and is separate from private profile inventory. Smoke transcripts exercise a bounded slice of this state; production game-world instances grow it under capacity policy and shard boundaries rather than a fixed cap.
- Content state: mission definitions, generated content blobs, aliases, dialogue, map graph, and validation metadata. This is versioned read-only content selected by content hash or release id, not player-mutable state.
- User-owned backup state: encrypted save capsules stored through a browser session in the user’s own Google Drive app data folder or Firebase-backed user document space. This is private backup/sync state, not an authoritative source for shared world facts, rewards, or multiplayer outcomes.
Internal service split:
flowchart TD
Client[Adventure client] --> Adventure[Adventure service]
Adventure --> Content[AdventureContentCatalog]
Adventure --> Profile[AdventureProfileService]
Adventure --> Expedition[AdventureExpeditionService]
Adventure --> Ledger[AdventureLedger]
Profile --> Save[AdventureSaveStore]
Expedition --> Save
Ledger --> Save
Save --> Store[Store or CloudGameStore]
Save --> Vault[UserOwnedSaveVault]
AdventureContentCatalogexposes validated read-only mission content by content hash or release id and reports the generator/schema version used to build it.AdventureProfileServiceowns durable per-player profile summaries. The current pre-ledger substrate may expose direct bounded summary mutation for host-testable save/load behavior, but final reward, title, rank, faction, cosmetic, badge, and similar profile application must be ledger-backed onceAdventureLedgerexists.AdventureExpeditionServiceowns active mission/world instances. It can keep short-lived expeditions in memory, but explicit resume requires a checkpoint written throughAdventureSaveStore.AdventureLedgeris append-only from ordinary game clients. Correction and revocation require separate witness/admin authority and must leave a record rather than rewriting history.AdventureSaveStoreserializes bounded Cap’n Proto save records to whichever backing service it was granted. It hides whether the backing is RAM, local diskStore/Namespace, or a cloud bridge.CloudGameStoreis an optional bridge service, not a replacement for capOS storage semantics. It exposes the same save/load/append operations as the local store adapter and should be granted only to the profile, expedition, and ledger services that need it.UserOwnedSaveVaultis a browser-mediated backup target. The browser receives an encrypted, signed save capsule and writes it using user-granted Drive or Firebase authority. Encryption keys follow the storage domain: local capOS storage uses local capOS-host key material, while GCP-backed game-world data uses Cloud KMS envelope encryption with a per-world or per-shard KEK wrapping service-owned DEKs. capOS and the adventure service do not receive the user’s OAuth access token, Firebase refresh token, Drive file IDs beyond opaque handles, or provider credentials.
Recommended rollout:
- Volatile baseline: keep current in-memory state keyed by the live endpoint caller-session scoped ref plus epoch, but define the profile, expedition, ledger, and content records as bounded structs and add host tests for encode/decode and migration rules. Normal shell launch/grant commands now omit legacy badge and receiver-selector syntax; explicit selectors are low-level compatibility or hostile-path fixtures, not the state identity model.
- Local store baseline: use RAM-backed then disk-backed
Store/Namespacecaps to prove profile save/load, explicit expedition checkpoint/resume, and ledger append/replay. This is the offline and QEMU proof path. - GCP-backed bridge: run a narrow
CloudGameStorebridge outside capOS or as a capOS service once networking is available. A practical GCP deployment uses Cloud Run for the bridge endpoint, Firestore Native mode for mutable profile/index documents and transactional updates, Cloud Storage with object versioning/lifecycle policy for immutable snapshots and evidence blobs, and Secret Manager for bridge-side service credentials. capOS clients still see only theCloudGameStorecapability. - User-owned browser vault: for private player data, a web terminal or
browser companion can store encrypted save capsules in Google Drive
appDataFolderor a Firebase user document. This is useful before capOS has durable local disk or direct provider SDK support. It must be treated as user-controlled transport for game-world encrypted data: the user can delete, withhold, duplicate, or roll back blobs, but cannot decrypt or forge accepted state without the relevant local capOS key or game-world KMS authority. On restore, the game verifies signatures, schema/content hashes, profile id, monotonic capsule version, previous capsule hash, and policy bounds before accepting any state; decrypted ledger records still validate their own previous-record hash chains. - Hybrid sync: local
Storeremains the source for QEMU/offline proof paths, whileCloudGameStorereplicates selected profile/ledger objects. The sync boundary must be explicit: profile summaries may be overwritten through a checked version, ledger records append, and expedition checkpoints resolve conflicts by rejecting stale writes rather than merging combat state.
Minimum save record set:
AdventureProfile {
profile_id
display_handle
version
ranks
warrior_stars
wizard_circles
faction_standing
cosmetics
contributor_badges
settings
updated_at
}
AdventureExpeditionCheckpoint {
expedition_id
profile_id
content_hash
checkpoint_version
site_id
objective_state
player_state
party_state
mob_state
pending_events
saved_at
}
AdventureLedgerRecord {
record_id
profile_id
expedition_id
content_hash
kind
previous_record_hash
payload
witness
created_at
revoked_by
}
User-owned save capsules wrap those records rather than replacing them:
UserSaveCapsule {
schema_version
capsule_version
profile_id
device_id
content_hash
migration_policy
record_kind
record_version
previous_capsule_hash
plaintext_hash
ciphertext
aead_algorithm
signature_algorithm
signer_public_key_id
signature
created_at
}
Capsule encryption follows the same storage-domain rule as the backing store.
When state is stored locally on a capOS host, the encryption key is local
capOS-host key material and local backup/restore needs an explicit local key
recovery story. When state is stored in GCP services, Cloud KMS is the
key-encrypting-key service: it wraps or unwraps a capsule DEK, while the
game-world service decrypts and validates capsule plaintext internally using
the unwrapped DEK as service authority, modeled as a SymmetricKey capability.
The browser may transport ciphertext, wrapped DEKs, and opaque Drive/Firebase
handles, but it should not receive a plaintext DEK, SymmetricKey cap,
KeySource cap, KMS decrypt/unwrap grant, or provider-independent plaintext
authority unless a later explicit user-managed key export design adds that
mode. For GCP-backed worlds, access to unwrap and use the DEK is game-world
service authority mediated by KMS/IAM, not ownership of the Drive or Firebase
blob alone.
For the GCP path, each game-world instance or shard gets its own Cloud KMS key
ring and symmetric CryptoKey KEK. Runtime grants are scoped to the CryptoKey
where possible: encrypt-only writers use roles/cloudkms.cryptoKeyEncrypter to
wrap new DEKs, restore/migration readers use roles/cloudkms.cryptoKeyDecrypter
to unwrap existing DEKs, and only the small game-world service that must do both
uses roles/cloudkms.cryptoKeyEncrypterDecrypter. Rotation affects future DEK
wrapping but does not re-encrypt existing capsules or retire old key versions.
Re-encryption or rewrapping is a managed service operation: decrypt and validate
the capsule inside the game-world service, then write a new capsule with a new
DEK or a DEK rewrapped by the current primary KEK version. Old versions stay
enabled until no accepted wrapped DEK depends on them. Retiring a world removes
decrypt IAM first, may disable key versions to make protected capsules
inaccessible, and only schedules destruction after audit/recovery decisions
because completed key version destruction is irreversible.
Every persisted record needs a schema version, content hash or release id, size limit, and migration rule. Save/load must fail closed when the content hash is unknown, the record exceeds bounds, a capsule or ledger hash chain does not match, or the caller lacks the profile/expedition authority.
Do not use user-owned Drive/Firebase blobs as authority for public state:
- contributor rewards still require
AdventureLedgerwitness records; - multiplayer outcomes and market trades still require service-side validation;
- public room history and shared world events should be stored by the game service or cloud bridge, not accepted from a user’s private backup;
- rollback of a private backup may restore local profile cosmetics or an explicit expedition checkpoint, but it must not erase append-only public ledger facts.
Interface Sequencing
Do not add new gameplay verbs only as ad hoc client text. Every verb that changes world state needs a typed route before it is accepted as implemented.
For Phase 1 verbs, update these surfaces together:
schema/capos.capnp: add methods or typed command records forinspect,use,give,ask,order,seal,status, and explicit authority verbs includingrequest,accept,delegate, andrevoke.tools/generated/and canonical generated bindings through the existing generated-code workflow.demos/capos-chat: add request encoders, result decoders, and DTOs for the new adventure methods.demos/adventure-server: validate state, authority, bounds, and failure text in server handlers.demos/adventure-client: keep parsing thin; convert user text to typed calls rather than duplicating game rules.tools/qemu-shell-smoke.sh: assert one success path and one failure path for each new state-changing method.
The future CommandSession interface can replace the text adapter, but it is
not a reason to add stringly world mutation in the interim.
Resource Bounds And Determinism
Two distinct bound regimes apply: the deterministic QEMU smoke proof, which must stay small and reproducible, and the production game-world instance, which is bounded by service capacity, shard policy, and quotas rather than a single fixed cap. The numeric limits below are the smoke-instance defaults; production tuning belongs to the game-world deployment runbook and grows with profile/ledger/expedition substrate, multiplayer authority, and persistent world state.
Smoke-instance and per-shard rules:
- Per-instance
MAX_PLAYERSstays explicit; every per-player map entry is removed onleaveor process teardown. - Cap per-player inventory entries, writs, relics, marks, evidence records, active effects, and remembered chat cursors.
- Cap per-site mobs, items, active wards, actors, and pending events.
- Cap combat transcript lines per turn and reject oversized action text before semantic parsing.
- Smoke transcripts use fixed encounter scripts. Production play may seed variation from mission state, but transcript-critical paths must still force a stable seed.
- Keep multiplayer parties, duels, trades, pending invites, and contested-site records bounded per mission/shard with explicit overflow behavior, not silent drop.
- Chat history must either be cursor-based per client or printed under an explicit “recent room record” header with a bounded line count for live views; persistent chat history may grow under retention policy in ledger/world-state services.
- The current
StdIOadapter may accept 256-byte command lines, but typed ids inside service calls remain 64-byte ASCII ids unless a reviewed schema/runtime change raises that limit. - Keep free-form text fields separate from ids:
saytext and future rest-of-line command text may use the command-line limit, while object ids, actor ids, mob ids, writ ids, directions, spell names, and skill names use the id limit. - Generated mission content must define explicit bounds for titles, descriptions, lead text, aliases, dialogue, and debrief lines, and branches that check in generated content need a freshness check so generated Rust blobs cannot drift from source mission data.
Smoke-instance suggested limits (production shards may raise these under capacity policy, but transcript-critical paths must run inside these bounds):
players: 64
ordinary inventory entries per player: 6
writs per player: 8
evidence records per player: 16
active effects per player: 8
mobs per site: 8
party members: 4
pending trades per player: 4
pending invites per player: 4
chat history per room: 16 lines
command-line bytes: 256
typed id bytes: 64
room/site title bytes: 80
description bytes: 320
lead/failure-hint bytes: 160
actor dialogue/debrief line bytes: 320
Command Surface
The current StdIO parser can grow the first mission quickly, but the target
should be the structured command session described in
Interactive Command Surfaces.
Initial text commands:
lookgo <direction>inspect <thing>take <thing>use <thing>give <thing> to <actor>ask <actor> about <topic>request <writ>accept <writ>delegate <writ> to <actor>revoke <writ>order <actor> to <task>seal <target>inventorystatussay <text>quote <item> from <actor>buy <item> from <actor>sell <item> to <actor>trade <item> to <actor> for <item|favor>repair <item> at <actor>party <create|invite|accept|leave|delegate>assist <player> with <task>duel <challenge|accept|yield>spar <player>contest <site>quit
Dynamic completions should come from room state:
- exits for
go, - visible items and held writs for
inspectanduse, - present actors for
askandgive, - quoted shop inventory for
buyandsell, - party members and pending invites for
partyandassist, - mission targets for
seal.
Rich Browser Client
A later browser client should be a real game presentation layer: pixel-art
locations, animated characters, inventory and authority panels, combat affordance
buttons, event feeds, and chat surfaces. It should not be a terminal emulator
with decorative art around StdIO.
The presentation model should be a 2D tilemap, not prose-only room cards.
World data sent to the browser can include maps, tilesets, tile layers, object
layers, collision/interaction zones, spawn points, actor paths, region/outpost
markers, and event triggers. Tiled JSON is a plausible authoring/export format
if the content validator rejects oversized maps, missing tiles, unknown layer
types, invalid object references, and presentation data that tries to carry
authority. PixiJS plus @pixi/tilemap is a reasonable first rendering
candidate because it targets WebGL 2D tile rendering with a canvas fallback.
That renderer choice must stay client-side; the game service remains the owner
of authoritative location, collision, interaction, market, custody, and combat
state.
That client can bypass adventure-client. The text client remains valuable for
QEMU proofs, scripted transcripts, and compatibility, but the browser UI should
talk to the adventure and chat services through WebShellGateway-held session
authority:
Browser pixel-art UI
-> WebShellGateway / web shell capability-call proxy
-> session-scoped AdventurePlayer and ChatParticipant caps
-> adventure-server and chat-server
The browser does not hold capOS capabilities directly. It holds opaque
web-session handles and sends typed UI actions such as movement, target
selection, inventory use, delegation, order, spell, skill, and chat requests.
The gateway maps those requests onto the real session-scoped capabilities and
returns structured view state or event records for rendering. Raw capOS
CapIds, badge selectors, game-world keys, provider credentials, broad network
authority, and shell spawn authority must not cross into browser JavaScript.
The trusted-host transport pattern that the gateway must satisfy already exists
for the operator remote-session UI: see
Remote Session CapSet Clients
for the redaction, view-model, and policy-preflight rules that keep capOS
handles, redacted transcript bytes, and provider credentials inside a trusted
Rust backend while browser JavaScript receives only typed view models, call
results, and denial diagnostics. The adventure browser client should reuse that
same authority boundary: a trusted backend owns the session-scoped
AdventurePlayer and ChatParticipant caps, applies the same redaction and
denial discipline, and ships only view models and event records to the
pixel-art renderer.
For the purpose-built adventure UI, a narrow AdventurePlayer /
ChatParticipant surface is a better primary ABI than generic terminal text:
AdventurePlayer.look()
AdventurePlayer.go(direction)
AdventurePlayer.status()
AdventurePlayer.inventory()
AdventurePlayer.useItem(item, target)
AdventurePlayer.order(actor, task)
AdventurePlayer.cast(spell, target)
AdventurePlayer.skill(skill, target)
AdventurePlayer.delegate(writ, actor)
AdventurePlayer.pollEvents(cursor, maxEvents)
ChatParticipant.say(text)
ChatParticipant.history(cursor, maxLines)
CommandSession can still exist for terminal-like front ends, command palettes,
automation, and compatibility adapters. It is not required for a custom
pixel-art client whose UI already knows it is presenting the adventure game.
The non-negotiable boundary is that browser presentation never becomes
authority. Every action still flows through typed game capabilities, and the
server rejects invalid location, stale state, missing authority, bad custody,
combat restrictions, and oversized input.
This belongs well after the current game-depth phases. It depends on WebShellGateway authentication/origin policy and teardown, session-bound adventure/chat identity, persistent profile/checkpoint semantics, and a stable core game loop. Asset manifests for sprites, portraits, tiles, VFX, UI sounds, and animation ids should be explicit data. Asset presence or selection must not grant game authority, and missing assets should fail as presentation errors rather than mutating game state.
The browser harness should verify more than successful loading. It should drive one deterministic mission through UI actions and check tilemap layer order, actor placement, viewport/camera bounds, collision affordances, event-feed updates, logout/tab-close teardown, and rejection of browser-side attempts to mutate authoritative state without the typed gateway call.
QEMU Proof Path
Keep a deterministic smoke path similar to make run-adventure, but make it
prove game mechanics:
setup/login
run adventure-client
status
ask centurion about mission
request ward-writ
go gate
use ward-writ
go tower
inspect standard
recover eagle-standard
ask legionary about ward
give scout-marker to scout
go under-vault
seal gate
inventory
quit
exit
Assertions should check:
- launch grants remain explicit,
- no password leaks into logs,
- invalid action returns a specific failure,
- authority grant/delegation uses explicit verbs rather than
take, - item use changes world state,
- NPC process reacts to at least one player action,
- mission completion records an audit/witness line,
- replayed chat is either suppressed or labeled as history,
- at least one canonical-id suggestion for a near-miss command,
- one shop quote or rejected trade explains the authority or price gate,
- a fixed mission seed prints stable variant and calendar metadata once randomization lands,
- a two-client co-op proof can delegate one writ or assist one action without leaking private inventory authority.
Current implemented proof coverage is intentionally narrower than the eventual
target game, but it now follows the Aurelian mission path. make run-adventure keeps the shell-driven adventure-client transcript focused on
representative interactive behavior: typed inspect, status, attack,
skill, cast, give, ask, order, request, accept, and delegate
calls; room-view mission, lead, actor, mob, writ, item, and canonical exit
context; categorized Items, Writs, Relics, Marks, and Evidence
output; a rejected invalid inspect input; canonical-id suggestions for
near-miss ward and wraith inputs; Maro route evidence on ashen_road; a
separate NPC process reaction; a failed attack against a warded mob; delegated
order livia to dispel-sigil exposing a ward; a resolved Livia actor alias
with an improved task hint; repeated detect-ward idempotence on an already
exposed ward; ember-dart spell damage; a 2-star warrior strike; and
eagle-standard recovery.
The complex custody path is covered by adventure-scenario-test, a real capOS
userspace process with only Console and Adventure caps. It calls
AdventureClient methods under QEMU and asserts initial categories,
under_vault denial before temple-seal, pre-recovery Iunia denial,
ward-writ route authority setup, ward-wraith defeat, relic recovery,
non-droppable relic behavior, missing-location custody denial, missing
ward-writ authority denial, unsafe-route witness refusal, survivor evacuation,
gate sealing, witness-certified temple-seal custody, final evidence tokens,
and under_vault access after custody.
The test strategy should stay split by risk. Pure deterministic game logic
should live in ordinary Rust unit tests where possible: calendar rollover,
seasonal availability, market matching, escrow state machines, blueprint
validation, artifact property derivation, enchantment limits, route
constraints, and agent quota accounting. Cross-service gameplay scenarios
should use a real Rust userspace test client process that calls game caps under
QEMU, as adventure-scenario-test already does for custody. The shell-driven
adventure-client transcript remains the basic command/client proof for
parser behavior, rendering, representative typed calls, and smoke-path
integration; it should not become the only coverage for complex market,
construction, economy, or agent-NPC state machines.
Implementation Plan
Phase 1: Player-Visible Mission Substrate
- Implemented so far:
- typed
inspect,use,status,attack,skill,cast, andguardmethods across schema, generated bindings, client wrappers, server handlers, terminal parser, and QEMU transcript assertions, - typed
give,ask,order,seal,request,accept,delegate, andrevokemethods across the same schema/client/server/parser/proof path, - explicit result text for failed and successful
go,take, anddropactions, - compact player combat stats in
status: hp, guard, fatigue, warrior stars, wizard circles, prepared spells, and active mobs, - one deterministic ward-wraith encounter with a warded-mob failure path, spell reveal, spell damage, martial skill damage, guard effect, and mob defeat,
- one explicit objective and completion condition tied to ward-wraith defeat,
- minimal per-player authority state for
ward-writrequest, acceptance, delegation, revocation, and gate sealing, - bounded per-player evidence/effect storage surfaced in
status, - replayed room chat labeled as history for later room joins,
- bounded object-id validation for typed object inputs,
- server-side canonical id normalization for common casing and title aliases, plus bounded near-miss suggestions for the current mission ids,
- typed
AdventureRoomViewmission, lead, actor, mob, writ, item, and canonical exit context rendered bylook, - structured status and inventory output split into survival, location, mission, physical items, writs, relic custody, marks, evidence, effects, and lead,
- idempotent repeated spell behavior in the interactive transcript,
- a dedicated
adventure-scenario-testuserspace process that calls theAdventurecap directly to prove relic custody denial, witness refusal, temple-seal certification, categorized evidence, andunder_vaultaccess.
- typed
- Current playable slice: the Aurelian gate-fort mission now comes from
demos/adventure-content/content/prototype.cue, with checked-in generated Rust output consumed by the server and verified bymake generated-code-check. State-changing behavior remains in Rust handlers, andmake run-adventureproves the interactiveeagle-standardrecovery and replay-history path, whileadventure-scenario-testproves survivor, gate-seal, and temple custody outcomes through realAdventurecap calls. - Typed relic recovery:
recover eagle-standardis the dedicated custody verb, withtakeanddropreserved for physical items. - Local party foundation:
Adventureowns the first deterministic party state for service-local player labels, pending invites, scopedward-writdelegation, anddetect-wardassist records. PvP consent, transfer escrow, and the two-client QEMU proof remain future work. - Physical-item transfer foundation:
Adventure.transferperforms same-party local item mutation for ordinary inventory items and leaves currency escrow, cross-service trade, relic custody transfer, and successful two-client proof as future work.
Phase 2: Imperial Frontier Mission
- Replace current four-room content with the Aurelian gate-fort mission. Complete.
- Preserve objective/lead text in
lookandstatus, plus canonical-id suggestions for common near-miss commands. - Add typed inventory categories: items, writs, relics, evidence, marks. Complete for player-facing status and inventory output.
- Add at least three actor processes with distinct chat/personality behavior;
keep them chat-only unless explicit scoped
Adventuregrants and tests land in the same slice. - Add one route requiring a capability-style permission.
- Add one objective with two acceptable outcomes.
- Add one narrative debrief that records rank, standing, evidence, and audit consequences.
Phase 3: Persistent Profile And Ledger Substrate
- Define
AdventureProfile,AdventureExpeditionCheckpoint, andAdventureLedgerRecordstructs with schema versions, content hashes, size limits, and host migration tests. - Add
AdventureProfileService,AdventureExpeditionService,AdventureLedger, andAdventureSaveStoreinterfaces before persisting profile or world state in ad hoc server maps. - Prove a local baseline first: profile save/load, ledger append/replay, and
explicit expedition checkpoint/resume through RAM-backed or disk-backed
Store/Namespace. - Keep ordinary client launch fresh by default; require an explicit resume command or profile option before loading an active expedition checkpoint.
- Add one rejected stale-checkpoint write and one rejected wrong-profile load to QEMU or host-level proof coverage.
Phase 4: User-Owned Browser Save Vault
- Define
UserSaveCapsuleand browser transport semantics for private encrypted profile, settings, and explicit expedition checkpoint backups. - Use Google Drive
appDataFolderor Firebase user documents as opaque capsule transports only; browser-held OAuth/Firebase tokens must not enter capOS game services. - Add tamper, wrong-profile, stale-version, replay, unknown-content, and oversized-capsule rejection tests before real provider adapters.
- Keep public world state, multiplayer outcomes, reward witnesses, and market receipts out of user-owned blobs.
Phase 5: Compact Authority-RPG Loop
The next implementation phase should build on the pure targeted-combat
foundation from commit f149119, not reopen broad calendar, market,
construction, agent-NPC, federation, or worldline systems. The goal is one
excellent expedition loop where authority is RPG power: choose writs and
companions, enter a dangerous site, discover authority conflict, fight or
negotiate under pressure, delegate or revoke power, extract, and gain reach for
future missions.
- Generate combat profiles from CUE for current mobs and validate malformed
zones, damage kinds, alert groups, recognition thresholds, and stealth
references through
make generated-code-check. - Integrate generated combat profiles into
adventure-serverso inspected attacks use deterministic zone damage, fatigue, interruption, recognition, and alert helpers. Clients must not submit computed damage. - Extend parser/proof coverage only as needed for unambiguous authority-RPG
commands:
attack <mob> [zone] [with gladius],cast <spell> at <mob> [zone], and the firstchallenge authority <target>authority-combat alias. - Add one authority-attacking enemy behavior to the existing expedition: a forged route/custody claim, stolen custody token, seal conflict, corrupt revocation, or old-law wraith claim that can be inspected, exposed, bound, or revoked.
- Treat writs as loot. Add one fixed-seed or authored writ modifier with a meaningful drawback; print issuer, scope, expiry, delegation rules, revocation conditions, modifier, and drawback in inspect/status output, and enforce the drawback in service logic.
- Add one delegation-buildcraft proof using an existing companion. A trait such
as loyalty, competence, fear, reputation, or doctrine should change how a
delegated
ward-writor custody authority behaves and explain the cause. - Add one reach-based debrief unlock, such as Archive evidence verification, Temple vault custody upgrade, Signal tower remote revocation, or appointing one field deputy. This should unlock a future verb or jurisdiction, not generic damage or health.
- Keep denial rewarding: at least one new authority denial should reveal a lead about hidden jurisdiction, forged authority, missing witness, rival claim, or alternate route.
- Prove the slice with pure Rust tests for deterministic rules and one
adventure-scenario-testpath covering inspected targeted attack, authority threat/lead, writ drawback, delegation consequence, and reach unlock. The shell transcript should remain representative parser and smoke coverage.
Broad systems remain deliberately demoted until this loop is strong: calendar/season gameplay, regional market order books, construction jobs, artifact/enchantment production, optional agent NPCs, MMO/open economy work, federation, and worldlines should not be treated as next local sequencing truth for implementation agents.
Phase 6: Structured Command Session
- Move from app-owned
StdIOparsing toCommandSession. - Expose dynamic command metadata and completions.
- Keep a text adapter for QEMU scripts.
Phase 7: Multiplayer Authority Proof
- Do not start this phase until Adventure and chat authority use session-bound caller identity, or future broker-granted service facets, rather than player receiver-selector identity. The first bounded slices key local player labels from live caller-session metadata.
- Add local multi-client party state keyed by service-created player objects, with explicit invite, accept, leave, and delegation commands.
- Add one deterministic co-op combat or ward puzzle where one player assists another without receiving unrelated inventory authority.
- Add one opt-in duel or sparring proof with scoped harmful authority and a rejected unauthorized attack path.
- Add bounded trade offers for ordinary loot and reject relic transfers unless custody authority permits them. The proof must show the transfer coordinator cannot duplicate, lose, or partially transfer an item when offers go stale, cancellation races with acceptance, or a retry repeats the same request.
- Extend QEMU scripting to drive two clients or two command sessions through a stable multiplayer transcript.
Future Follow-Up: Golems, Gates, And Infrastructure
Golems should be imperial magotechnical infrastructure before they are enemies. They fit the setting as labor frames, cargo haulers, bridge-builders, sentries, field repair units, siege engines, and rare battlefield assets. Model each golem as a body, a bound core, and an energy source: the body defines role, the core defines identity and obedience, and the energy source defines endurance.
Initial golem types:
cargo-golem: moves sealed supplies or heavy relics only when granted a matching route authority.ward-golem: guards a shrine, vault, or gate and recognizes proof tokens rather than passwords.siege-golem: breaks barriers but requires multiple grants, such as engineer approval plus energy access.field-repair-golem: restores damaged ward anchors when supplied with materials and repair authority.corrupted-golem: obeys malformed or stale authority, making revocation and audit behavior visible in play.
A golem should not become ordinary inventory. The player receives scoped
command authority over it: inspect, wake, route, repair, bind, delegate, audit,
or revoke. A useful rule of thumb: order cargo-golem north-gate succeeds only
when the player holds both cargo-command and north-gate-route.
Gates and portals should be imperial route infrastructure: roads made executable. They move authority, troops, messengers, and supplies across the frontier. Standing at a gate is not enough; use requires a physical anchor plus a valid writ, seal, route token, ward key, or alignment state.
Gate components:
gate-anchor: fixed legal endpoint.route-writ: temporary authority to open one path.ward-key: faction or office authorization.stabilizer: consumable or repairable part that bounds usage.gate-log: inspectable audit trail for deterministic investigation evidence.
Gate constraints should create missions rather than decoration:
- gates open only between known anchors,
- heavy constructs require freight routes rather than personal routes,
- damaged gates can misroute, refuse cargo, or leak hostile entities,
- emergency gates can open one-way and revoke the route after use.
Follow-up mission candidates:
- Gate repair: recover a stabilizer, prove engineer authority, order a repair golem, and open a bounded evacuation route.
- Golem command: delegate a narrow task to a ward golem after presenting the correct seal.
- Logistics: move medicine, grain, or signal crystals through gate routes with cargo-size and route-authority limits.
- Investigation: inspect gate logs, compare seals, and identify which faction abused or forged route authority.
- Siege: choose between spending rare siege-golem command, negotiating gate access, or repairing an old military road.
- Containment: seal a corrupted portal while wizard-circle spells stabilize the breach and warrior-star formations defend the site.
Typed verb candidates for these later slices include bind, route,
repair, charge, open-gate, seal-gate, attune, stabilize,
trace-route, and audit-gate. They should remain scoped and revocable, and
status output should expose active seals, bound routes, charged spells,
wounded formation members, unstable wards, and delegated golem tasks.
Open Questions
- Should actor NPCs call the adventure service directly, or should they only communicate through chat in the first interesting version?
- How much randomized event timing can be allowed in production play before QEMU transcript coverage becomes brittle, given the smoke path runs under a fixed seed?
- Should shops and trades live inside
Adventureat first, or split into aMarketservice once two-party trade exists? - Should parties be mission-local objects, profile-level groups, or future session broker grants?
- What is the minimum PvP consent record that is useful for the game without overbuilding policy after the profile/ledger substrate exists?
- How are game-world shards/instances scoped: one shared world, one per campaign, one per party, or one per deployment? This determines whether faction standing, ledger history, and market state are global or per-shard.
- Where does the boundary sit between server-authoritative public history (visible to all players in a shard) and per-profile audit records that remain private even within the same shard?
- After the echo-only worldline proof, should the next federation slice be envoy visits, expedition bridges, market/custody transfer, or profile migration?
- What retention/archival policy applies to ledger records, debrief evidence, and chat history once the game runs long enough to accumulate them?
Follow-Up Proposal
After the Aurelian adventure design is implemented well enough to support stable profiles, ranks, evidence, debriefs, cosmetic items, and deterministic proof coverage, the game can grow a separate contributor-facing layer.
That follow-up is tracked in Contributor Quest Mechanics. It describes maintainer-witnessed “outer-world quests” for real capOS development work, such as fixing a full GitHub issue URL or improving QEMU proofs, and limits rewards to badges, temporary states, decorative items, and bounded game-only perks. It must not grant repository authority, OS authority, or any ability to mutate another player’s profile.
Design Grounding
Grounding files for this proposal:
CLAUDE.mdREADME.mddocs/proposals/index.mddocs/proposals/interactive-command-surface-proposal.mddocs/proposals/session-bound-invocation-context-proposal.mddocs/proposals/shell-proposal.mddocs/proposals/boot-to-shell-proposal.mddocs/backlog/runtime-network-shell.mddocs/proposals/service-object-capabilities-proposal.mddocs/backlog/stage-6-capability-semantics.mddocs/security/trust-boundaries.mddocs/proposals/cryptography-and-key-management-proposal.mddocs/proposals/volume-encryption-proposal.mddocs/proposals/storage-and-naming-proposal.mddocs/proposals/cloud-deployment-proposal.mddocs/proposals/contributor-quest-mechanics-proposal.mddocs/proposals/llm-and-agent-proposal.mddocs/proposals/hosted-agent-swarm-proposal.mddocs/proposals/remote-session-capset-client-proposal.mddocs/proposals/capos-repo-harness-engineering-proposal.mddocs/research/game-mechanics-prior-art.mddocs/research/plan9-inferno.mddocs/research/hosted-agent-harnesses.mddocs/backlog/aurelian-frontier.mddocs/backlog/hardware-boot-storage.mdschema/capos.capnpsystem-adventure.cuetools/qemu-shell-smoke.shdemos/adventure-client/src/main.rsdemos/adventure-server/src/main.rsdemos/adventure-npc-wanderer/src/main.rsdemos/adventure-npc-shopkeeper/src/main.rsdemos/capos-chat/src/lib.rs
Proposal: Contributor Quest Mechanics
How capOS can later use the adventure game as a playful interface for real open-source development work without confusing game rewards with repository authority.
Purpose
The current adventure proposal makes capability ideas playable inside capOS. This follow-up uses the same fiction and authority vocabulary to encourage real-world contributions to capOS itself.
An in-game officer, quartermaster, guild broker, temple witness, or academy scribe can issue “outer-world quests” such as:
Outer-world quest: fix https://github.com/<org>/<repo>/issues/123
Proof: merged PR linked to the issue and passing required checks
Reward: bug-hunter seal, review-lantern cloak clasp, +1 cohort standing
The goal is to make useful project work more visible and more fun:
- fix bugs,
- reproduce failures,
- write tests,
- improve docs,
- review security-sensitive changes,
- reduce flaky QEMU harnesses,
- triage issues,
- mentor new contributors,
- write design notes,
- run release checklists.
This must not become a substitute for maintainership, code review, or security policy. Real maintainers still decide what merges. The game records recognized contributions after normal project workflow has already accepted them.
Related Proposals
This proposal sits between the playful adventure substrate and the longer-term agentic-development substrate that already records how capOS sessions, reviews, and subagent runs are tracked. Read alongside:
docs/proposals/aurelian-frontier-proposal.mdis the base game that supplies profiles, ranks, evidence, debriefs, decorations, and deterministic proof coverage. Contributor quest mechanics layer on top of those structures and must not run before the base game is stable.docs/proposals/llm-and-agent-proposal.mddefines the language-model, embedder, and agent-runner capability surface. When an in-game NPC, quest issuer, or scribe is later voiced or reasoned over by a language model, the authority surface used is the typed LLM/agent capabilities from that proposal, with per-tool consent/stepUp/forbidden gating. Game-side rewards must never widen what an agent can do at the OS or repository level.docs/proposals/agentic-development-experiment-proposal.mdcovers the longitudinal study of agentic coding sessions, subagent dispatch, review agents, and session-recap tooling. Contributor quest evidence is a human-facing, reward-shaped overlay; the agentic-development experiment is the engineering-evidence overlay. They share the same underlying real-world contribution stream (merged PRs, closed issues, accepted proposals, review outcomes), but they intentionally keep separate ledgers: game-side rewards are cosmetic and reputational; agentic-development records are scientific observation and tooling artifacts. Neither overlay grants authority to the other.
Design Goals
- Make contribution paths discoverable for people who arrive through the demo.
- Reward real project progress without giving game systems repository power.
- Keep rewards mostly cosmetic, narrative, reputational, or convenience-level.
- Use full issue and PR URLs, commit hashes, and review records as evidence.
- Let maintainers mint or revoke recognition through explicit authority.
- Avoid incentives that make people spam issues, rush reviews, or optimize for game points over project quality.
- Preserve privacy: public contributions can be celebrated; private identity links require explicit consent.
Non-Goals
- No automatic merge, close, label, or assignment authority from the game.
- No token handling inside the game client.
- No paid bounty, token, cryptocurrency, or transferable reward system.
- No leaderboard that pressures security reviewers or maintainers into rushed public rankings.
- No reward that grants kernel, shell, broker, or repository authority.
- No reward that lets a player mutate another player’s profile or inventory.
Core Loop
The player sees an in-game quest board after the ordinary Aurelian campaign has enough profile state to matter.
- A trusted quest issuer publishes a bounded list of outer-world quests.
- The player claims or follows one quest, such as a GitHub issue.
- The player does the work outside the game through normal GitHub and review workflow.
- A maintainer or verifier records the accepted proof: merged PR, linked issue closure, accepted docs patch, reproduced bug log, or completed review.
- The game mints an in-world mark, badge, decorative item, state, title, or bounded perk.
- Debrief text explains what real contribution was recognized and why.
The game should phrase this as imperial frontier logistics rather than as a raw task tracker:
Quest Board: Outer Works
Issue 123: repair the failing run-adventure transcript.
Need: one merged fix and one reviewer witness.
Reward: Lanternwright badge, smoke-runner sash, cohort standing +1.
Reward Types
Rewards should be valuable enough to feel visible, but not strong enough to turn project contribution into a grind.
Badges
Badges are durable profile marks:
bug-hunter: fixed a confirmed defect.smoke-runner: repaired or improved a QEMU proof path.doc-cartographer: improved docs, backlog, roadmap, or proposal clarity.review-witness: completed a substantive review accepted by a maintainer.security-sentinel: closed or helped validate a security finding.first-boot-guide: helped a new contributor get a local boot/test flow.release-quartermaster: completed release or dependency audit work.
Badges are evidence-backed, not self-declared. A badge record includes the proof URL or commit hash, issuer, timestamp, and short reason.
States
States are temporary profile or expedition conditions:
on outer patrol: player has claimed or followed an issue.awaiting witness: contribution submitted, waiting for review/verification.maintainer witnessed: accepted proof exists, reward can be minted.needs reproduction: player is gathering logs or QEMU transcript evidence.blocked by design: task needs a proposal or maintainer decision first.
States should expire or be explicitly cleared. Stale states must not block the player from ordinary game progress.
Decorative Items
Decorative items are visible in the game world without granting project power:
- review lantern,
- smoke-runner sash,
- warded keyboard,
- cartographer map case,
- broken-panic trophy,
- issue-forged signet,
- release quartermaster ledger,
- first-boot camp banner.
Decorative items can appear in status, profile views, housing, tavern
dialogue, party banners, or debrief records.
Perks
Perks must stay bounded and low-risk:
- title choices in chat or debrief text,
- cosmetic room decorations,
- additional flavor dialogue from NPCs,
- small convenience options inside adventure missions,
- access to optional lore logs or museum rooms,
- non-combat party banner effects.
Avoid perks that create an optimal gameplay path only available to frequent contributors. The point is recognition and orientation, not a second economy.
Quest Types
First Contribution
Real-world task:
- make a first accepted contribution to capOS,
- pass the relevant checks,
- respond to review without needing maintainer rescue,
- leave the touched docs, tests, or code in a maintainable state.
Game framing:
- receive a recruit’s field mark,
- cross the first gate under supervision,
- return with a witnessed service record.
Reward examples:
first-gatebadge,- recruit sash,
- academy standing,
- optional mentor thank-you record.
This quest should reward the first accepted contribution once, not every small patch. It exists to make the contributor path visible and to recognize the friction of getting a local toolchain, QEMU proof, review loop, and project style working for the first time.
Bug Hunts
Real-world task:
- fix a confirmed GitHub issue,
- add regression coverage,
- preserve or improve relevant QEMU transcript assertions.
Game framing:
- track a breach,
- seal a faulty gate,
- prove the fix with a witness log.
Reward examples:
bug-hunterbadge,- broken-panic trophy,
- cohort standing.
Smoke Runner Work
Real-world task:
- improve
make run-*smoke stability, - add missing transcript assertions,
- reduce brittle sleeps,
- preserve password/log redaction.
Game framing:
- run the frontier signal route,
- repair a proof beacon,
- return with a clean gate log.
Reward examples:
- smoke-runner sash,
- transcript lantern,
- scout standing.
Documentation Cartography
Real-world task:
- improve
docs/tasks/README.md, backlog files, roadmap, proposal status, runnable demo docs, or research grounding, - remove stale status claims,
- clarify next-step sequencing.
Game framing:
- update the imperial route map,
- reconcile witness records,
- mark safe roads for new operators.
Reward examples:
doc-cartographerbadge,- map-case decoration,
- academy scribe title.
This should include small docs corrections, but the higher reward tier should require work that materially improves future contributors’ ability to navigate the project: clearer milestone state, sharper backlog decomposition, better runbook steps, or removal of misleading/stale status text.
Accepted Design Proposal
Real-world task:
- submit a design proposal that maintainers accept,
- ground it in existing project docs and relevant research when needed,
- update proposal indexes and any affected roadmap/backlog status,
- respond to review by narrowing unsafe scope or documenting tradeoffs.
Game framing:
- present a plan before the imperial council,
- have temple witnesses validate the authority chain,
- receive a sealed charter for future field work.
Reward examples:
charter-writerbadge,- council seal,
- strategy-table decoration,
- design-witness title.
“Accepted” means the proposal is merged or explicitly recorded as accepted in the repository. Drafts, brainstorms, and abandoned proposals can still receive ordinary participation flavor, but they should not mint the accepted-design badge.
Security Witnessing
Real-world task:
- review trust-boundary changes,
- close a review-finding task record,
- add proof coverage for hostile input,
- update security docs when a boundary changes.
Game framing:
- testify before the temple annex,
- certify relic custody,
- expose a forged writ.
Reward examples:
security-sentinelbadge,- temple witness seal,
- lawful-custody title.
Mentorship And Onboarding
Real-world task:
- help a contributor get local builds, tests, QEMU, or docs working,
- improve setup notes based on observed friction,
- pair on a first small patch.
Game framing:
- guide a new recruit through the gate yard,
- issue safe training writs,
- staff the frontier academy.
Reward examples:
first-boot-guidebadge,- recruit banner,
- academy standing.
Evidence Model
Each recognized contribution should produce a bounded ContributionEvidence
record:
quest_id
kind
full_issue_url
full_pr_url
commit_hash
issuer
subject_profile
summary
accepted_at
reward_ids
revoked
The evidence record is not a legal identity document. It is a project-visible game record that says a maintainer or authorized verifier recognized a public contribution.
Use full URLs for GitHub issues and PRs. Do not rely on shorthand issue numbers without repository identity.
Capability Mapping
| Game concept | capOS concept |
|---|---|
| quest board | read-only issue feed or maintainer-published mission list |
| quest claim | optional local profile state, not GitHub assignment |
| proof URL | evidence record input |
| maintainer witness | authority to certify accepted contribution evidence |
| badge mint | scoped profile mutation capability |
| reward revocation | audit-backed correction capability |
| decorative item | non-authority profile state |
| contributor standing | broker input only for game/social features |
The game must never turn a decorative badge into OS or repository authority. If a future broker uses contributor standing, it must be for game/social features unless a separate security design explicitly says otherwise.
Service Architecture
Keep this out of the core adventure service until the base game is stable.
Target split:
flowchart TD
GitHub[GitHub / Forge] --> Importer[Quest Importer]
Maintainer[Maintainer Session] --> Witness[Contributor Witness]
Importer --> Board[Quest Board]
Witness --> Rewards[Reward Mint]
Rewards --> Profile[AdventureProfileService]
Rewards --> Ledger[AdventureLedger]
Client[Adventure Client] --> Board
Client --> Profile
Initial implementation can be manual:
- a checked-in or manifest-provided quest list,
- maintainer-issued proof records,
- no network calls from capOS to GitHub,
- no tokens in the demo VM.
Reward records use the adventure persistence split:
- quest definitions and fixture issue lists are content/catalog data;
- claims and temporary states are profile state;
- accepted proof, issuer, timestamp, reward mint, and reward revocation are append-only ledger records;
- badge, title, decoration, and cosmetic summary fields are applied to
AdventureProfileServiceonly after a matching ledger record exists.
Later implementation can add a ForgeConnector service with narrow read-only
authority:
- list selected issues by repository and label,
- fetch merged PR metadata,
- verify commit hashes or check statuses,
- never mutate GitHub state.
Any mutating forge integration, such as labels or comments, requires a separate security proposal and must not be hidden inside the game service.
Abuse And Incentive Controls
The system should discourage low-quality contribution farming.
- Rewards are maintainer-witnessed, not automatically minted from activity.
- Repeated trivial fixes should not produce unbounded badges.
- Security review rewards should avoid public speed rankings.
- Issue claiming inside the game must not block real contributors on GitHub.
- Reward descriptions should name quality criteria: tests, docs, review, and accepted project value.
- Maintainers can revoke or amend mistaken rewards with an audit note.
- Public profiles should allow hiding or unlinking personal identity details.
Privacy And Identity
Players may want to keep game profiles and GitHub identities separate.
Rules:
- Linking a game profile to a GitHub account must be explicit.
- Public GitHub evidence can be recorded as a URL without exposing private tokens or session state.
- Private email addresses, tokens, and local machine paths must never appear in reward records.
- The game can show “verified public contribution” without requiring the player to reveal more than the accepted public artifact.
- If OIDC or passkeys later connect identity, use the user/session/policy proposals rather than adding identity shortcuts to the game.
Command Surface
Candidate commands after the base adventure command surface exists:
quests
quest inspect <quest-id>
quest follow <quest-id>
quest evidence <quest-id>
badges
badge inspect <badge-id>
decorate <slot> with <item-id>
title set <title-id>
profile share <public|private>
Maintainership commands must be separate and authority-gated:
quest publish <quest-id>
quest witness <quest-id> for <profile> with <proof-url>
reward mint <reward-id> for <profile>
reward revoke <reward-id> for <profile>
These are typed service calls, not shell-special strings.
Implementation Phases
Phase A: Manual Recognition
- Add this proposal to the proposal index and docs summary.
- Define bounded quest, evidence, badge, state, decoration, and perk data records.
- Store accepted proof and reward mint/revocation as append-only
AdventureLedgerrecords, then derive visible profile badges from those records. - Add a small checked-in sample quest list using full GitHub issue URLs.
- Add manual witness records in test content.
- Show badges/decorations in profile/status output.
- Add QEMU proof that a witnessed quest mints a cosmetic badge and that an unwitnessed claim does not.
Phase B: Game Integration
- Add an in-game quest board location after the Aurelian campaign.
- Add NPC dialogue that points contributors toward real project workflows: reproduce, test, document, review, and submit.
- Add debrief text that ties accepted contribution evidence to in-world recognition.
- Keep all rewards non-authority unless a separate reviewed design grants narrow game-only authority.
Phase C: Forge Read Model
- Add a read-only forge import path for selected repositories and labels.
- Verify merged PRs, issue links, check statuses, and commit hashes without storing tokens in game state.
- Add host tests for malformed URLs, cross-repository ambiguity, oversized metadata, and stale proof records.
- Add QEMU smoke using fixed fixture data rather than live network calls.
Phase D: Community Events
- Model release weeks, bug hunts, docs sprints, and review days as bounded seasonal quest boards.
- Add group recognition for cohorts without ranking individuals by raw activity count.
- Add opt-in public profile export for contributor showcases.
Open Questions
- Who can witness rewards before durable maintainer identity exists inside capOS?
- How should reward revocation be displayed without creating public shaming mechanics?
- Can issue labels be imported read-only without making GitHub availability a boot or smoke-test dependency?
- What is the smallest useful “perk” that feels meaningful while remaining non-authoritative?
- Should a local-only demo use fictional fixture issue URLs, real capOS issue URLs, or both?
Design Grounding
Grounding files for this proposal:
docs/proposals/aurelian-frontier-proposal.mddocs/backlog/aurelian-frontier.mddocs/proposals/llm-and-agent-proposal.mddocs/proposals/agentic-development-experiment-proposal.mddocs/proposals/storage-and-naming-proposal.mddocs/proposals/interactive-command-surface-proposal.mddocs/proposals/user-identity-and-policy-proposal.mddocs/proposals/oidc-and-oauth2-proposal.mddocs/proposals/security-and-verification-proposal.mddocs/security/trust-boundaries.mddocs/tasks/README.md
No docs/research/ report is directly applicable at this stage. This proposal
is community workflow and game-design planning layered on existing project
proposal documents, not a new OS/runtime architecture claim.
Proposal: Public Release and Maintainer Boundaries
How capOS can become publicly visible without accidentally promising general support, production security, broad feature review, or an always-on community moderation role.
This proposal is the maintainer-load and release-governance layer over the project’s existing security review and verification tracks. The owning trackers for the risk and process surfaces this document references are:
docs/design-risks-register.mdfor the consolidated index of long-horizon design risks and open architectural questions. The hygiene and split gates below do not redefine those risks; they only constrain what may be claimed publicly while they remain open;docs/proposals/security-and-verification-proposal.mdfor the security review vocabulary, trust-boundary checklist, and verification tracks (host tests, model checks, fuzzing, QEMU smokes, review). The “evidence vs. assurance” wording inSECURITY.mdbelow is the public-facing framing of that proposal’s conclusion that current checks are engineering evidence, not an independent audit;docs/proposals/repository-composition-proposal.mdfor the core/sibling scope rule the adventure split and history-rewrite gates below enforce.
Purpose
Open-sourcing capOS should be a controlled publication step, not a commitment to operate a public support queue. The project can benefit from public inspection, reproducible demos, and selected outside contributions while still rejecting vague bug reports, low-effort questions, unsupported feature requests, and large drive-by changes.
The first public release should optimize for:
- accurate public claims,
- narrow maintainer commitments,
- security honesty,
- reproducible build and QEMU evidence,
- curated contribution paths.
It should not optimize for community growth, support volume, or broad roadmap input.
Release Position
capOS should be described publicly as:
- experimental research software,
- x86_64/QEMU-first,
- capability-system focused,
- not production-ready,
- not independently security-audited,
- not suitable for hostile multi-user deployment,
- not a supported general-purpose OS.
Top-level public wording should be direct:
capOS is experimental research software.
It has not undergone an independent security audit. Treat the current code as
unsafe for production use, unsafe for hostile multi-user deployment, and
unsuitable for protecting real secrets or availability-sensitive workloads.
Security boundary documents in this repository describe design intent and
current proof coverage, not certification or a guarantee of correctness. QEMU
demos, host tests, model checks, and review notes are engineering evidence.
This statement belongs in README.md and SECURITY.md before the repository
is made public.
Non-Goals
- No public support promise for local build environments.
- No support guarantee for platforms beyond the documented QEMU path.
- No expectation that maintainers answer questions already covered by docs.
- No roadmap voting or feature-request queue.
- No public chat server at launch.
- No production-security claim.
- No response-time SLA for ordinary issues or pull requests.
- No acceptance of large unplanned subsystem PRs.
Maintainer Load Model
The initial public repository should run in source-visible mode:
- The code, docs, and reproducible demo paths are public.
- Issues are enabled only through strict templates.
- Pull requests are allowed but scoped by contribution rules.
- Discussions and chat are disabled at launch.
- Maintainers publish a small set of curated tasks they are willing to review.
Public visibility does not imply public support. Maintainers may close issues without extended discussion when they are:
- usage questions answered by
README.md,docs/, or command output; - vague bug reports without exact commands, commit hash, host/QEMU versions, or relevant log excerpts;
- broad feature requests outside
docs/roadmap.md; - requests to support unrelated platforms, package managers, or deployment environments;
- debates about project direction without a concrete proposal;
- large implementation PRs that did not start from an accepted design.
Closed issues should usually receive one short reason and, when possible, a link to the relevant document. Maintainers should not spend review time rewriting low-quality reports into actionable work.
Issue Intake
The public repository should not allow blank issues at launch. Use issue templates for:
- reproducible bug report,
- QEMU transcript failure,
- documentation problem,
- design proposal,
- private security report pointer.
Bug reports should require:
- capOS commit hash,
- host OS and architecture,
- QEMU version when QEMU is involved,
- exact command run,
- expected result,
- actual result,
- relevant bounded log excerpt.
Default labels should include:
needs-repro,needs-design,docs,qemu,security,not-planned,support-boundary,maintainer-curated,agent-assisted,needs-human-owner,agent-spam,review-capacity,too-large.
Suggested policy:
- close
needs-reproreports after 14 days without requested reproduction details; - close broad roadmap requests as
not-planned; - convert recurring valid questions into documentation, then close later duplicates by linking the docs;
- do not promise ordinary issue response times.
Automated issue batches should be closed without triage when they are not attached to a concrete human-owned reproduction, design proposal, or patch. Large generated reports are not useful by themselves; they must be reduced to one actionable issue with exact evidence.
Pull Request Intake
Small fixes, regression tests, docs corrections, and narrow bug fixes may be accepted without prior design discussion when they fit existing architecture.
Large changes should start as an accepted design proposal or maintainer-curated issue. This includes changes to:
- kernel capability semantics,
- schema or ABI,
- userspace runtime behavior,
- boot flow,
- security boundaries,
- dependencies,
- hardware or architecture support,
- public command surfaces,
- persistent storage,
- networking beyond the selected milestone.
Every non-trivial PR should state:
- motivation,
- changed trust boundary or
docs-only, - commands run,
- QEMU proof when behavior is user-visible,
- generated-code and dependency notes when relevant,
- design proposal link for large changes.
Maintainers may close unplanned large PRs without detailed review. That rule is necessary to avoid turning public visibility into unpaid architecture consulting.
Agent-Assisted Contributors
Public capOS will likely attract contributors using Claude Code, Codex, and similar tools to run long agent loops. That is acceptable only when the human owner remains accountable for the work.
Agent-assisted work must:
- disclose that automation was used;
- have one human owner who understands the diff and can answer review questions;
- stay attached to an accepted issue, proposal, or maintainer-curated task unless it is a narrow obvious fix;
- include exact verification commands and relevant output summary;
- preserve the repo’s worktree, review, and security rules;
- keep generated logs, prompts, transcripts, and local secrets out of the repository;
- avoid opening batches of speculative issues or PRs from broad scans.
Suggested public policy:
- one active non-trivial PR per new external contributor by default;
- no more than two active PRs per established external contributor unless a maintainer explicitly raises the limit;
- agent-generated drive-by refactors, mass lint churn, dependency churn, or
roadmap reshuffles are closed as
too-largeornot-planned; - PRs without a responsive human owner are closed as
needs-human-owner; - repeated automated noise is labeled
agent-spamand may be blocked at the account level.
The merge rule still stays positive: a useful PR with the required review context, relevant verification, and no unresolved blocking findings should be merged. The throttles exist to protect review capacity, not to reject good work because an agent helped produce it.
Review Capacity
Maintainers should publish the current review mode in a pinned issue, project
view, or CONTRIBUTING.md section:
open: accepting curated external work;limited: reviewing only bugs, security reports, and maintainer-curated tasks;paused: no new external PR review except private security reports.
When review capacity is full:
- new feature PRs may be closed as
review-capacitywithout technical review; - stale
needs-reproissues may be closed after the documented timeout; - stale
needs-changesPRs may be closed after the maintainer-requested changes are not addressed; - emergency capacity is reserved for private security reports and regressions in documented release/demo paths;
- public contributors should be pointed toward existing reviewed tasks instead of creating new backlog entries.
Passing CI is evidence, not entitlement to review. The review bar remains
REVIEW.md, useful project value, relevant verification, and no unresolved
blocking findings.
Workflow Transition
The current private workflow uses local branches, dedicated worktrees,
docs/tasks/README.md, backlog files, and local review loops as the planning and review
source of truth. That is appropriate while capOS is mostly private and
automation-heavy, but it is not the right public collaboration model forever.
At some point after public release, the project should migrate public work to:
- GitHub Issues for user-visible bugs, accepted design tasks, support-boundary closures, and curated contributor work;
- GitHub Pull Requests for review, CI evidence, design-grounding notes, and merge decisions;
- GitHub Projects for milestone planning, sequencing, and status views.
That transition should replace roadmap/backlog-driven operational planning for
public work. docs/roadmap.md should remain a high-level narrative roadmap,
not the active task board. docs/backlog/ should become design context and
historical decomposition, not the queue maintainers are expected to triage by
hand.
The migration should not happen in the first quiet source-visible launch. It needs explicit gates:
- issue templates and labels are stable enough to reject low-quality reports;
- PR templates capture trust-boundary, verification, and design-grounding requirements;
- CI exposes the baseline checks public contributors can run and cite as evidence;
REVIEW.mdremains the merge bar: a useful PR with the required review context, relevant verification, and no unresolved blocking findings should be merged;- maintainers have decided which roadmap/backlog items become public issues;
- a GitHub Project exists for the selected public milestone;
- stale local backlog entries have been either converted to issues or marked as historical context;
- private/security-sensitive planning remains outside public issues until the security policy says where it belongs.
During the transition, avoid dual sources of truth. If a task is public and tracked in GitHub Projects, its active status should live there. Repo docs should link to the project or issue instead of duplicating the live state.
Planning Source Migration Model
The private planning files should migrate by role, not by copying their whole contents into GitHub:
docs/roadmap.mdremains a narrative of visible outcomes and design order. It should name the current public milestone and link to the GitHub Project view, but it should not list every public task status.- Local task records under
docs/tasks/remain the private/operator surface for local agent loops, security-sensitive tasks, and unreleased integration state. For public work, they should contain issue or PR URLs plus the reason the item is active, not a duplicate checklist. docs/backlog/becomes design context, decomposition history, and migration notes. Actionable public tasks move to issues with one owner, acceptance criteria, verification gates, and links back to the relevant design/backlog section.- Private unresolved-review work remains in task records until a finding is safe to disclose. Public findings become GitHub issues only after the security policy says disclosure is acceptable; the local task record should then link to the issue instead of restating live public status.
Issue bodies should preserve the review discipline already used locally: problem statement, affected files or trust boundary, design-grounding links, expected proof command, QEMU or host-test acceptance criteria, and explicit non-goals. GitHub Projects should track only status and sequencing fields such as milestone, area, risk, owner, review mode, and blocked-by. The design rationale belongs in docs or issue discussion, not in project-card metadata.
Before converting a backlog slice, maintainers should de-duplicate it against current docs and closed commits. A converted issue should be either:
- a visible outcome that maps to a roadmap milestone;
- a review finding with concrete remediation and verification;
- a scoped implementation task under an accepted proposal; or
- a documentation correction tied to a known drift point.
Avoid migrating stale checklists verbatim. If an item has already landed, move
the historical note to docs/changelog.md or leave it in commit history. If an
item is still speculative design, keep it in the proposal/backlog until a
maintainer is ready to review implementation.
Communication Channels
Launch with:
- GitHub Issues for reproducible bugs and accepted work tracking;
- pull requests for reviewable scoped changes;
- private security contact from
SECURITY.md.
Do not launch with:
- Discord,
- Matrix,
- Slack,
- public support email,
- office-hours promises,
- GitHub Discussions unless maintainers explicitly budget moderation time.
Chat systems create interrupt-driven support load and reward questions that should become docs or issue templates. They should wait until the project has a known moderation policy and enough maintainers to enforce it.
Security Statement
SECURITY.md should state:
# Security Policy
capOS is experimental research software and has not undergone an independent
security audit. Do not use it to protect real secrets, production workloads, or
hostile multi-user environments.
Security reports are still useful. Report suspected vulnerabilities privately
to <security contact>.
Do not open public issues containing exploit details, private keys, tokens,
credential material, or instructions for attacking third-party systems.
At this stage, maintainers do not provide a security-fix SLA. Accepted reports
are triaged based on project relevance, reproducibility, and impact on
documented capOS security boundaries.
The security page should link to:
docs/security/trust-boundaries.md,docs/security/verification-workflow.md,docs/tasks/README.md,docs/trusted-build-inputs.md.
It should also distinguish security evidence from security assurance. Current
checks, QEMU smokes, Kani proofs, Loom models, fuzz targets, and reviews are
useful engineering evidence. They are not an independent audit. The evidence
vs. assurance line is the public-facing framing of
docs/proposals/security-and-verification-proposal.md; open
risks the public claim must remain silent about are tracked in
docs/design-risks-register.md.
Repository Hygiene Gates
Before public visibility:
- add a license file;
- add
CONTRIBUTING.md; - add
SECURITY.md; - add issue templates and a pull request template;
- add top-level experimental/no-audit wording;
- split the adventure game server, client, NPC processes, content generator, and proposal/backlog/demo docs out of this repository into a dedicated adventure repository before public visibility (see “Adventure Repository Split” below);
- rewrite git history into a curated public-import history before publication (see “Git History Rewrite” below); do not publish the current private agent-driven commit log unchanged;
- run a secret and history scan against the rewritten history, not just the current tree;
- scan the rewritten history and the public tree for personal identifiers (maintainer GitHub account names, personal cloud project / bucket names, personal home-directory paths, personal email addresses, personal host or user names) and either remove the artifact or replace those values with neutral placeholders or environment-driven configuration;
- remove or sanitize local operator infrastructure that is not part of the public OS: maintainer-private CI configs, private cloud project references, maintainer-side automation services and scripts, and any other path that exists only because the current maintainer runs the project that way;
- remove local-only artifacts from the public tree;
- run the documented baseline checks;
- create only maintainer-curated public starter issues.
Local-only and generated artifacts need an explicit policy before publication. Source-controlled generated bindings are acceptable when freshness checks remain documented. Generated adventure content moves with the adventure split and is no longer a capOS-repository hygiene concern after the split lands. Local caches, build output, QEMU images, transient manifests, automation logs, maintainer-private cloud infrastructure configs, and automation service units should not be part of a public source snapshot.
Adventure Repository Split
The Local MUD/adventure prototype, NPC-as-process fleet, Aurelian expedition content, save vault work, and contributor-quest framing exist primarily as a shared-service demo and as motivation for service-object capabilities and agent-shell tool surfaces. They are not part of the capability-OS core claim the first public capOS release should defend.
Keeping adventure in the same public repository would:
- conflate “experimental research OS” with “experimental research game”, making the public scope statement and security boundaries harder to defend;
- attract game-feature requests, balance debates, and content contributions through the same maintainer queue as kernel/capability/security issues;
- pull narrative, world-building, and content-generation review onto the same review capacity that should be focused on capability semantics, schema/ABI, security boundaries, and the documented QEMU proofs;
- expand the public attack surface and the public claim surface beyond what the OS work is ready to defend.
Before the first public capOS release, adventure-specific code, content, generators, proposals, backlog, and demo docs must move to a dedicated adventure repository that depends on capOS as a downstream consumer. In the capOS repository, only the minimum hooks needed for capOS’s own service-object and shared-service-demo proofs may remain, and they must be defensible without reference to game design.
Concretely the split must, at minimum, relocate or remove:
demos/adventure-server/,demos/adventure-client/,demos/adventure-content/,demos/adventure-chat-actors/,demos/adventure-npc-shopkeeper/,demos/adventure-npc-wanderer/,demos/adventure-scenario-test/, and any future adventure-named demo crates;tools/adventure-content-gen/and any adventure-specific generator fixtures or content blobs;- adventure-named manifests (for example
system-adventure.cueand any derivedmanifest-adventure.bin/capos-adventure.isobuild artifacts),make run-adventure*targets, and harness scripts that exist only to drive adventure demos; - adventure-only modes embedded in shared tooling – for example the
drive adventure/assert adventuremodes intools/qemu-shell-smoke.shand any adventure-shaped output handling inside other sharedtools/qemu-*-smoke.shortools/qemu-*-harness.shscripts – which must be removed from the core scripts and re-homed in the adventure repository; - adventure-specific build/check tooling such as
tools/check-generated-adventure-content.sh, thegenerated-adventure-content-checkMakefile target, and any adventure-named recipe stanzas,MANIFEST_SOURCE/MANIFEST_BIN/ISOoverrides, and.PHONYentries (run-adventure,generated-adventure-content-check, etc.) that exist only to build or exercise adventure demos; - adventure entries in
DEPENDENCY_POLICY_MANIFESTS/LOCKFILES,cargoworkspace lists, and any other Makefile or CI list that namestools/adventure-content-gen/or adventure-named crates; docs/proposals/aurelian-frontier-proposal.md,docs/proposals/contributor-quest-mechanics-proposal.md(when its scope is game-shaped),docs/backlog/aurelian-frontier.md,docs/demos/adventure.md, and any other adventure-shaped narrative or content docs;- adventure-specific entries in
docs/tasks/README.md,docs/roadmap.md, task records, and changelog narrative — keep only items that describe capability-OS invariants the core repository must continue to defend after the split.
The adventure-files/paths/assets prohibition in the history-rewrite gate
applies to all of the above. The split is not complete while
adventure-only logic still lives in shared tooling under names that do
not say “adventure” – for example a generic-looking drive mode that
in practice exists only to script adventure transcripts. A reviewed
inventory of these shared-tooling crossings is part of the split task
and must be cleared before the curated public-import history is built.
The split must not silently weaken capOS’s own proofs. Any adventure-anchored
service-object, IPC, save-store, ledger, or chat-identity invariant currently
enforced inside kernel/, capos-config/, capos-rt/, init/, or
shell/ and exercised only by an adventure demo needs an equivalent
non-adventure proof in capOS before the adventure code leaves, or the
invariant moves to the new repository together with its proof. A reviewed
inventory of these crossings is a prerequisite of the split task, not a
follow-up.
The split is a release-time gate, not a routine refactor. Before it lands:
- the new adventure repository must build against a tagged or pinned capOS reference so the cross-repo dependency direction is verified;
- adventure-anchored capOS proofs must either be replaced by non-adventure equivalents or be moved to the adventure repository;
- the public-release readme, security, and roadmap statements must reflect the narrowed capOS scope;
- the rewritten public-import history (below) must not contain adventure-specific commits, paths, or assets.
Git History Rewrite
The current commit history was produced under a private automation-heavy workflow with high commit volume, mid-task narrative messages, exploratory branches, intermediate worktree state, and adventure-shaped commits that no longer belong in capOS once the adventure split is enforced. Publishing that history unchanged would:
- expose private workflow detail (worktree names, automation checkpoints, intermediate planning notes) that is not valuable to public readers;
- make the public claim surface match every speculative direction explored privately, instead of the actual current capability-OS shape;
- carry adventure-specific paths, assets, and commit messages into the OS repository the split is meant to leave behind;
- complicate any future secret/history scan by mixing release-relevant history with disposable automation narrative.
Before the first public release the maintainer must produce a curated public-import history. The acceptable approach is one of:
- a single squashed initial-public-import commit on a fresh public branch,
with a short message that describes the project state at publication and
links to
docs/changelog.mdfor historical milestone narrative; or - a small number of curated, signed commits that group related capability subsystems and proofs in a way that is reviewable by an outside reader, with the same forward link to the changelog.
The rewritten history must:
- contain no adventure-specific files, paths, generated content, or commit messages;
- contain no private automation narrative, private worktree names, internal reviewer-only notes, or local-only artifacts;
- contain no personal identifiers (maintainer GitHub account names, personal home-directory paths, personal cloud project / bucket names, personal email addresses, personal host or user names) in committed files, commit messages, author/committer fields beyond the maintainer’s intended public attribution, or generated content;
- pass a secret and history scan and a personal-identifier scan over the rewritten commits, not just the current tree;
- carry an explicit license and attribution from the first public commit;
- preserve
docs/changelog.mdas the narrative record of completed milestones and reviews so historical context is not lost when the raw commit history is collapsed.
Force-pushing a rewritten history over an already-public branch is not the intended use of this gate. The rewrite happens before the repository becomes public. The current private repository may continue to exist as an internal mirror; only the rewritten history is published.
Repository Composition
The adventure split is the first concrete instance of a wider rule: the
public capOS repository should defend a narrow, recognizable claim, and
non-core tracks should live in downstream repositories that depend on
capOS rather than ride along inside it. The detailed scope rule, the
full list of split candidates (whitepaper, public website, userspace
network stack, production remote-access services, protocol stacks,
language runtimes, GPU, agent shell, cloud images, volume encryption),
the when-to-split criteria, the cross-repository mechanics, and the
intended cap-os-dev GitHub organization placement live in
docs/proposals/repository-composition-proposal.md.
For public-release readiness the only repository-composition gates are:
- the adventure split (above) must be complete;
- the rewritten public-import history (above) must respect the core/sibling
scope rule defined in
docs/proposals/repository-composition-proposal.md, so that no sibling-bound paths or assets enter the public capOS history; - public-facing READMEs, security pages, and roadmap statements must describe the narrowed core scope rather than the pre-split private workspace;
- if the GitHub organization move to
cap-os-devhappens before public release, the public-import history is published into that organization rather than into a personal account.
Other splits described in the Repository Composition proposal happen on their own readiness timelines and are not public-release prerequisites.
Launch Phases
Phase A: Quiet Source-Visible Launch
- Repository is public.
- Issues use templates.
- Discussions and chat remain disabled.
- PRs are accepted only for narrow fixes or maintainer-curated work.
- README and security pages make the experimental/no-audit status explicit.
Phase B: Curated Contribution Phase
- Publish a small list of tasks maintainers are willing to review.
- Add
good-first-issueonly to tasks with enough context and an expected verification command. - Close unrelated feature requests instead of expanding the backlog.
- Move repeated valid questions into docs.
Phase C: Broader Community Phase
Only after Phase A and B produce manageable signal:
- consider GitHub Discussions;
- consider a public chat room;
- broaden accepted issue categories;
- publish a maintainer rotation or moderation policy if more maintainers exist.
This phase is optional. capOS can remain source-visible and selectively contribution-friendly indefinitely.
Phase D: Hosted Public Demo
A public WebShellGateway or Adventure Game deployment is a separate operational milestone, not a side effect of making the source repository public. A Reddit-scale traffic spike should be assumed before any public link is posted.
The hosted demo must be treated as an untrusted public service:
- demo sessions use guest-only or demo-only profiles;
- no public demo path grants operator shell authority;
- no public demo shell receives raw
BootPackage, broadProcessSpawner, provider-token, model-admin, storage-admin, or unrestricted network authority; - each browser session gets isolated caps, bounded resources, and deterministic teardown on logout, tab close, timeout, crash, or quota exhaustion;
- sessions have maximum wall-clock duration, idle timeout, input/output byte limits, process/cap/resource quotas, and bounded transcript storage;
- per-IP, per-session, and per-account rate limits exist before launch;
- queueing and overload pages are preferred to silently starting unbounded VMs or capOS sessions;
- maintainers have a kill switch that disables new sessions without affecting repository access;
- logs are redacted and retention is documented;
- the public page states that the demo is best-effort, may disappear, has no persistence guarantee, and is not a support channel.
Adventure Game traffic adds game-specific gates:
- anonymous players cannot mutate authoritative public world state;
- public profiles, rewards, and contributor-quest identity links are opt-in;
- saved state, if offered, goes through the reviewed
AdventureProfileService,AdventureLedger, andAdventureSaveStoreboundaries; - public multiplayer uses service-created player objects, not user-selected identity badges;
- NPC or agent-assisted game features hold only narrow per-NPC or demo caps;
- abuse reports and moderation controls exist before public chat-like features are exposed.
The first hosted demo should be capacity-limited and disposable. It should not share credentials, sessions, storage, or authority with maintainer-operated development environments.
Public Claim Checklist
Public-facing docs should avoid claiming:
- production readiness,
- real-hardware support beyond documented experiments,
- secure remote access,
- independently audited security,
- compatibility with ordinary OS workloads,
- stable ABI,
- stable contributor API,
- support for every future roadmap track.
They may claim only what current docs and verification support:
- x86_64/QEMU-focused research OS;
- typed capability interfaces;
- capability-ring transport;
- current shell/login demos;
- selected service demos;
- documented verification commands;
- known limitations and future tracks.
Design Grounding
Grounding files for this proposal:
README.mddocs/tasks/README.mdREVIEW.mddocs/roadmap.mddocs/design-risks-register.mddocs/proposals/aurelian-frontier-proposal.mddocs/proposals/boot-to-shell-proposal.mddocs/proposals/contributor-quest-mechanics-proposal.mddocs/proposals/llm-and-agent-proposal.mddocs/proposals/mdbook-docs-site-proposal.mddocs/proposals/repository-composition-proposal.mddocs/proposals/resource-accounting-proposal.mddocs/proposals/security-and-verification-proposal.mddocs/proposals/shell-proposal.mddocs/proposals/user-identity-and-policy-proposal.mddocs/backlog/aurelian-frontier.mddocs/backlog/runtime-network-shell.mddocs/security/trust-boundaries.mddocs/security/verification-workflow.mddocs/trusted-build-inputs.md
No docs/research/ report is directly applicable. This proposal is release
governance and maintainer-load policy layered on existing project docs, not a
new OS architecture or runtime design.
Proposal: Repository Composition
How capOS should be split across repositories so that the public capability-OS claim, the kernel review queue, and the security/release cadence stay recognizable as the project grows beyond a single private workspace.
Purpose
capOS currently lives in a single private repository that mixes the kernel, the userspace runtime, the native shell, generic capability/IPC/ring demos, the Aurelian Frontier game, an academic whitepaper draft, the public docs site sources, and proposals for protocol stacks, language runtimes, GPU support, cloud images, and other future tracks.
That packing is acceptable while the project is private and agent-driven: one workspace, one review loop, one history. It is not the right shape once capOS becomes public. A single-repository public capOS would conflate unrelated scopes, drag unrelated tracks through one review queue, attach the OS security posture to product-shaped surfaces, and force unrelated release cadences to share one tag stream.
This proposal defines:
- what the public capOS core repository should defend (the scope rule);
- what should ship in sibling repositories that depend on capOS;
- the criteria for when a track is ready to split;
- the cross-repository mechanics that keep splits honest.
It generalizes the “Repository Hygiene Gates” of
docs/proposals/public-release-boundaries-proposal.md. The adventure split
and the curated git-history rewrite remain release gates in that proposal;
this proposal explains why those gates exist and how the same rule applies
to other tracks over time.
Non-Goals
- This proposal does not require splitting any track on a deadline beyond
the explicit release gates already named in
docs/proposals/public-release-boundaries-proposal.md. It defines a rule, not a calendar. - It does not redesign the capability model, schema, or kernel/runtime boundary. Those are owned by the relevant subsystem proposals.
- It does not propose a multi-organization governance model. capOS may remain a single-maintainer or small-team project across multiple repositories.
- It does not propose mirroring sibling repositories back into capOS. Once a track has split, capOS does not re-vendor it.
- It does not promise a public chat or coordination forum for cross-repo work; that follows the launch phases in the public-release proposal.
Scope Rule For The Core Repository
The capOS core repository defends a narrow, recognizable claim. A track belongs in the core repository when at least one of the following is true:
- removing it would weaken a capability-OS invariant the kernel or runtime currently enforces;
- removing it would delete a proof the documented review process relies on;
- it is part of the minimum surface required to boot capOS in QEMU and exercise the documented capability/IPC/ring/scheduling/security invariants.
A track does not belong in the core repository when its primary purpose is product, protocol, or language-runtime work that happens to run on capOS, even when it currently shares a workspace with the kernel.
In practice the core repository should contain:
- the schema definitions and the generated bindings the kernel and
runtime rely on (
schema/,capos-abi/,capos-lib/,capos-config/); - the kernel itself, including arch-specific code under
kernel/src/arch/; - the userspace runtime contract that consumes the schema (
capos-rt/,init/,shell/); - the manifest and code-generation tooling needed to boot and build
capOS (
tools/mkmanifest/,tools/capnp-build/); - demos that exist only to exercise core capability/IPC/ring/scheduling/ trust-boundary invariants, not application-shaped product surfaces;
- the security boundary, verification workflow, trusted-build-input, panic-surface inventory, and authority-accounting/transfer design documents;
- the core proposals describing the OS itself: capability model, IPC, error handling, scheduling, SMP, networking architecture (high level), storage and naming (high level), service architecture, security and verification, formal MAC/MIC, live upgrade design, threading, key-management abstractions, user identity and policy abstractions;
docs/changelog.md,docs/roadmap.md,docs/tasks/README.md,REVIEW.md, and migrated review-finding task records for the narrowed core scope;- the documentation site sources that describe the core scope. The deployment of those sources can be a sibling concern (see “Public Website And Hosted Demos” below); the sources themselves stay with the OS they describe.
Tracks That Should Eventually Move Out
The following tracks already exist or are planned, and each one is or will become a candidate for a sibling repository. Each carries its own scope statement, security posture, maintainer load profile, and release cadence that should not be merged into capOS’s core scope statement.
The list is descriptive, not a queue. A track moves only when the split criteria below are satisfied for that track.
Adventure Ecosystem
Server, client, NPC processes, content generator, content blobs,
adventure-named manifests/run targets, adventure proposals/backlog/demo
docs, contributor-quest mechanics. The split is already a release gate in
docs/proposals/public-release-boundaries-proposal.md. The dedicated
sibling is capos-adventure. Any capOS invariant currently exercised
only by an adventure demo needs a non-adventure equivalent in capOS, or
moves into capos-adventure together with its proof.
Whitepaper And Academic Publication
papers/schema-as-abi/ is a Typst project, and docs/paper/plan.md,
docs/paper/outline.md, and docs/paper/evidence-gaps.md are paper
planning documents. Academic publication has its own review cycle,
publication venue, citation cadence, and corrections process that should
not share the OS’s tag stream. A capos-paper repository can cite capOS
by tag or commit, track evidence-gap closure, and run paper-specific
build/CI without expanding the OS repository’s review surface.
docs/changelog.md and proof-evidence narrative remain in capOS so the
paper has a stable reference target.
Public Website And Hosted Demos
The public landing page, marketing-shaped copy, hosted-demo deployment scripts (Cloudflare Pages glue, container images, CI for the public site, hosted WebShellGateway and adventure-demo deployment) are operational concerns with public-traffic implications. They should not share a release cadence with kernel changes, and their incident response must not pull on kernel review capacity.
mdBook content describing the OS itself stays in capOS. The deployment
of that content as a public site can move to a sibling repository (for
example capos-site) that depends on the capOS docs sources by tag or
commit. The hosted public WebShellGateway or adventure-demo deployment
follows Phase D of the public-release proposal and lives outside capOS.
Userspace Network Stack And NIC Drivers
The current QEMU smoke path keeps smoltcp, virtio-net, the line
discipline, and the Telnet IAC filter inside kernel/. Once the
userspace driver authority gate (docs/dma-isolation-design.md) lands
and the userspace TCP/IP stack and NIC drivers leave the kernel
(docs/proposals/networking-proposal.md Phase C), the resulting
userspace components are large enough and carry enough independent
attack surface to live in capos-net. capOS keeps the kernel-side
DMA/MMIO/interrupt authority gates and the schema/ABI of the network
capabilities; the implementation of the stack is a downstream consumer.
Production Remote-Access Services
The host-local Telnet demo is research evidence for the
TerminalSession / SessionManager / AuthorityBroker /
RestrictedShellLauncher boundary; it stays in capOS. The host-local
SSH Shell Gateway research demo similarly stays as long as it is a
host-local research artifact under
docs/proposals/ssh-shell-proposal.md.
The production successors – a real OpenSSH-protocol gateway with
production host-key management, persistent authorized-key/account
storage, channel policy, audit, and remote-traffic threat model, and any
production WebShellGateway with browser-side session UI and public
moderation policy – are product-shaped services. They belong in
dedicated repositories (for example capos-ssh-gateway,
capos-web-shell) once they outgrow the host-local research surface.
Protocol Stacks Built On Key-Management Primitives
TLS/X.509, OIDC/OAuth2, ACME, OCSP, CT log handling, DPoP, workload
identity federation, and similar large protocol surfaces described in
docs/proposals/certificates-and-tls-proposal.md and
docs/proposals/oidc-and-oauth2-proposal.md should ship as sibling
repositories (for example capos-tls, capos-oidc) consuming the capOS
key-management primitives. Their CVE response, dependency surface, and
review queue should not be merged into the OS core’s. capOS keeps the
abstract SymmetricKey, PrivateKey, KeySource, KeyVault, and
audit primitives from
docs/proposals/cryptography-and-key-management-proposal.md; the
protocol stacks are downstream consumers.
Language Runtimes And Toolchain Ports
Go (GOOS=capos), libc / libcapos, WASI, Lua, and any future language
runtime port belong in dedicated repositories (capos-go,
capos-libc, capos-wasi, capos-lua, …). Language-runtime
releases follow upstream language cadence, and porting work should not
block kernel review. The capOS userspace ABI documented in
capos-rt/, capos-abi/, and the schema is the contract these ports
target.
GPU And CUDA Capability Integration
The GPU capability work in docs/proposals/gpu-capability-proposal.md
brings a large external driver and toolkit dependency surface, vendor
runtime distribution constraints, and hardware-specific testing needs.
When implementation begins it belongs in a dedicated capos-gpu
repository. capOS keeps the abstract device-authority gate and the
relevant capability schema; vendor-specific glue and toolkit packaging
is downstream.
LLM And Agent Runtime
The agent shell tool runner, model bindings, on-ISO local-model
packaging, and provider-specific glue from
docs/proposals/llm-and-agent-proposal.md and
docs/proposals/realtime-voice-agent-shell-proposal.md carry
independent supply-chain, content-policy, and operational concerns.
Provider TOS, model weight redistribution, and content-safety reviews
do not belong on the kernel review queue.
The shell capability and authority model – including how the agent
shell’s per-tool consent/step-up/forbidden modes consume broker-issued
capabilities – stays in capOS. The agent runner itself, the model
bindings, the on-ISO local-model packaging, and the provider glue ship
in a dedicated repository when implementation begins (for example
capos-agent-shell).
Cloud Images And Instance Bootstrap
Cloud VM image building, AWS/GCP/Azure packaging, NVMe and cloud-NIC
integrations, and the cloud-metadata bootstrap from
docs/proposals/cloud-deployment-proposal.md and
docs/proposals/cloud-metadata-proposal.md are operational
image-building concerns with cloud-vendor dependency exposure. They
should live in a capos-cloud-images repository that consumes capOS
releases as inputs.
Volume Encryption And KMS Integration
The encryption-at-rest work from
docs/proposals/volume-encryption-proposal.md will pull in cloud KMS
clients, key-rotation policy, and cryptographic dependency exposure
that should ship in a dedicated capos-volume-crypto (or similarly
named) repository. The abstract key-management contracts and the
storage-side authority gates remain in capOS.
Hosted Demo Tooling, Logs, And Operational Glue
Anything that is part of operating a public capOS deployment – session-quota policy, browser-side WebShellGateway UI, public landing copy, hosted log/metric pipelines, abuse-mitigation glue, public moderation tooling – is operational rather than OS work. It should live with the relevant sibling (for example public website or WebShellGateway service repositories) rather than inside capOS.
Tracks That Stay In The Core Repository
These tracks are intrinsic to the OS claim and should not be considered split candidates:
- the kernel, including arch-specific code under
kernel/src/arch/; - the schema definitions and generated bindings;
- the userspace runtime (
capos-rt),init, the native shell, and the manifest tools needed to boot capOS; - demos that exercise core capability/IPC/ring/scheduling invariants:
capset-bootstrap,console-paths,ring-corruption,ring-reserved-opcodes,ring-nop,ring-fairness,endpoint-roundtrip,ipc-server,ipc-client,terminal-session,terminal-stranger,tls-smoke(the TLS userspace runtime smoke, not protocol stack),virtual-memory,timer-smoke,timer-flood,ipc-zerocopy-demo, and any future demo that exists only to exercise a core capability invariant; - the chat demo as a generic IPC and service-object example may stay, but only in a form that defends a capability-OS invariant. Game-shaped chat features (named NPC actors, contributor-quest framing, adventure-tied identity flows) follow the adventure split;
- the security boundary, verification workflow, trusted-build-input, panic-surface inventory, authority-accounting/transfer design, and DMA-isolation design documents;
- the core proposals listed in the “Scope Rule” section above;
docs/changelog.md,docs/roadmap.md,docs/tasks/README.md,REVIEW.md, and migrated review-finding task records for the narrowed core scope;docs/research/, because each research note grounds a current capability-OS design decision; research notes that grow into full proposals follow the relevant subsystem.
When To Split A Track
A track should not be split prematurely. While a track lives only in proposal documents or a small experimental crate, the friction of a sibling repository (separate CI, separate review setup, separate license and security policy, cross-repo version pinning) outweighs the benefit.
The right time to split is when all of the following are true for the track:
- Independent product or protocol shape. The track has a recognizable purpose that is not “exercise a capOS invariant”. For example, a TLS stack, a Go port, a hosted public demo, or a game.
- Non-trivial implementation surface. The track draws review attention away from kernel review or carries an independent dependency surface large enough to need its own dependency-policy/audit posture.
- Defensible cross-repo dependency direction. The sibling can build against a tagged or pinned capOS reference without modifying capOS internals; the inverse direction (capOS depending on the sibling for a core invariant proof) is not required.
- Independent release cadence is desirable. The track wants its own tag stream, security advisory channel, or upstream synchronization schedule.
When any of these is missing, the track stays in the core repository or remains a proposal until it is ready.
A useful counter-test: would a public reader looking at the core capOS README, security policy, and release notes be misled by the presence of this track? If yes, that is a sign the scope statement is being stretched and the track is overdue to split. If a reader would not notice, the benefit of splitting is small.
Cross-Repository Mechanics
When a sibling repository is created, the following mechanics apply.
GitHub Organization Placement
The capOS core repository currently lives under a personal GitHub account. Once one or more siblings exist, hosting them all under the same personal account conflates personal projects with the capOS project, makes maintainer-set changes harder, and gives a confusing public landing surface for readers looking for the project.
The intended landing place for capOS and its siblings is a dedicated
GitHub organization, cap-os-dev. Concretely:
- the curated public-import history defined by the history-rewrite gate
in
docs/proposals/public-release-boundaries-proposal.mdis published as a freshcap-os-dev/caposrepository when the organization is used. A GitHub repository transfer or fork from the current private capOS repository is not the intended mechanism, because it would carry the existing private uncurated history, branches, refs, and intermediate automation state into the public organization. The current private repository may continue to exist as an internal mirror after publication, but it is not the same repository as the public one; - siblings are created under
cap-os-dev/<sibling>rather than under any individual maintainer’s account; for examplecap-os-dev/capos-adventure,cap-os-dev/capos-paper,cap-os-dev/capos-site,cap-os-dev/capos-net,cap-os-dev/capos-ssh-gateway,cap-os-dev/capos-web-shell,cap-os-dev/capos-tls,cap-os-dev/capos-oidc,cap-os-dev/capos-go,cap-os-dev/capos-libc,cap-os-dev/capos-wasi,cap-os-dev/capos-lua,cap-os-dev/capos-gpu,cap-os-dev/capos-agent-shell,cap-os-dev/capos-cloud-images,cap-os-dev/capos-volume-crypto; - repository names listed in this proposal and in
docs/proposals/public-release-boundaries-proposal.mdare intent names, not reservations. Final naming happens at the moment a sibling is actually created and may collapse, rename, or skip entries based on what the project actually needs.
Using a dedicated organization also makes the public-release maintainer boundaries easier to enforce: organization-level security policy, issue-template defaults, branch-protection settings, and team membership apply consistently across capOS and its siblings without per-repository drift.
The org adoption is not a blocker for the public-release hygiene gates:
the adventure split and history rewrite from
docs/proposals/public-release-boundaries-proposal.md are the
release-blocking gates, and they can land regardless of whether the
public-import history is first published under cap-os-dev/capos or
temporarily under another account. cap-os-dev is, however, the
recommended public landing surface, and once it is used, public-facing
materials should point at the organization rather than at any
individual maintainer’s account.
Dependency Direction
- The sibling depends on capOS by tag, commit, or other pinned reference; it does not depend on capOS by path-dependency into a private workspace.
- capOS does not depend on the sibling for any core invariant or proof. capOS may declare an optional release artifact from a sibling (for example a packaged adventure demo image) when an end-to-end story requires it, but the artifact must be a declared release input, not a path link.
- When a sibling demonstrates a capOS invariant by running on it, the sibling records the capOS reference (tag or commit) it was tested against, and the sibling carries the proof, not capOS.
Per-Repository Hygiene
- Each sibling repository owns its own license,
CONTRIBUTING.md,SECURITY.md, issue/PR templates, and review-capacity statement, even when the initial maintainer set overlaps with capOS. - Each sibling repository owns its own scope statement and public claim list. Public capOS claims do not extend over sibling content; sibling claims do not extend over capOS.
- Generated artifacts, content blobs, and large binaries belong with the sibling that owns the source they describe, never with capOS unless capOS itself produced them.
Documentation Location Rule
- Documentation about a sibling lives in the sibling. capOS may keep a
short pointer in
docs/proposals/index.md, the README, or a release-notes section so readers can find the sibling, but it does not duplicate sibling-internal proposals, backlog, or roadmap state. - Cross-repo planning that is privately coordinated must still respect the public-release rule that “if a task is public, its active status lives in one place”; capOS does not maintain a public mirror of sibling task state.
Security Coordination
- During a transition phase, security reports affecting capOS and a
sibling are coordinated through the capOS
SECURITY.mdcontact, with downstream siblingSECURITY.mdfiles pointing back to that contact until the sibling has its own staffed response. - Once a sibling has a staffed security response, its
SECURITY.mdbecomes authoritative for sibling-only issues, and only cross-cutting reports require coordination. - Neither capOS nor a sibling promises a security-fix SLA at the research-software stage; the capOS security statement language remains the baseline.
Release And Tagging
- Each sibling owns its own release cadence and tag stream.
- A sibling release that requires a specific capOS revision pins it explicitly in the sibling’s release notes.
- capOS releases do not promise sibling availability or compatibility beyond “the schema and userspace ABI used by sibling X at tag Y are what capOS at tag Z provides”.
History At Split Time
- A split should not silently remove evidence. Before a sibling becomes the authoritative location for a track, the relevant proofs, demos, and documentation must be present and reviewed in the sibling.
- The capOS history rewrite specified in
docs/proposals/public-release-boundaries-proposal.mddoes not need to preserve the pre-split track history inside capOS. The sibling’s history begins at split time with whatever curated initial state the sibling chooses to publish. - The capOS
docs/changelog.mdcontinues to record completed capability-OS milestones; sibling milestones are recorded in the sibling.
Migration Approach
The split is gradual and gated by readiness, not by a release calendar beyond the explicit public-release prerequisites.
The intended order is:
- Adventure ecosystem – gated by the public-release adventure-split gate. This is the first concrete instance of the rule and produces a reusable pattern (cross-repo dependency direction, sibling hygiene, documentation pointers) for later splits.
- Whitepaper / academic publication – when the paper is ready to accept public review, or when its evidence-gap log starts to drive review cycles independent of the kernel review queue.
- Public website and hosted-demo deployment – when a hosted demo becomes a real operational milestone (Phase D of the public-release proposal) rather than a research artifact.
- Userspace network stack and NIC drivers – after the userspace driver authority gate lands and the in-kernel networking surface shrinks to the kernel-side authority gates.
- Production remote-access services, protocol stacks, language runtimes, GPU, LLM/agent, cloud images, volume encryption – as their implementations begin and meet the split criteria.
Splits earlier in this list set the precedent for splits later in the list. If the adventure split is messy, later splits should learn from it before being attempted.
Anti-Goals
- Do not split the kernel. The kernel is one repository. Architecture
layers (
kernel/src/arch/<arch>/) stay inside capOS; aarch64 and other ports stay in-tree. The split rule is about distinguishing the OS from applications, protocols, and language runtimes, not about cutting the kernel into micro-repos. - Do not split userspace runtime internals.
capos-rt,init, and the native shell stay together because they share the userspace ABI contract. - Do not vendor sibling repositories back into capOS. Once a track has split, capOS does not re-import it as a path or vendored copy. Cross-repo coordination uses tags and pinned references, not vendoring.
- Do not split for marketing reasons alone. The split criteria are about protecting review capacity, security posture, and the public scope statement. Splitting only to project a larger ecosystem surface area without staffed maintenance is not allowed.
- Do not block on a perfect split plan. A track that meets the split criteria can be moved with the minimum mechanics described above. Cross-repo mechanics will improve incrementally; waiting for an ideal model before any split is its own failure mode.
Open Questions
- Where should the chat demo end up after the adventure split? It is partly generic IPC scaffolding and partly application-shaped (chat rooms, message history). The current intent is that a generic capability-IPC chat surface stays in capOS as a service-object proof, while game-shaped chat features follow adventure. The exact line is not yet drawn.
- How should
docs/research/be treated long term? Each note grounds a current design decision, so it stays in capOS. If research notes proliferate after public release, a curateddocs/research/index.mdmay be enough to keep them navigable without splitting them out. - Should the mdBook docs sources and the docs site deployment be in the same repository or split? The current intent is that the sources stay in capOS while the deployment can move to a sibling. Whether that split is worth doing before a hosted demo exists is open.
- How should cross-repo CI evidence be presented when a paper or a service repository wants to cite a capOS proof run? A simple “tested against capOS commit X” record is the baseline; richer attestation can be added later if the project needs it.
- When is the right moment to publish a sibling’s first release? Sibling-internal readiness criteria belong in the sibling; capOS does not gate sibling releases beyond the cross-repo mechanics described here.
Design Grounding
Grounding files for this proposal:
README.mddocs/tasks/README.mdREVIEW.mddocs/roadmap.mddocs/changelog.mddocs/proposals/public-release-boundaries-proposal.mddocs/proposals/aurelian-frontier-proposal.mddocs/proposals/contributor-quest-mechanics-proposal.mddocs/proposals/networking-proposal.mddocs/proposals/ssh-shell-proposal.mddocs/proposals/shell-proposal.mddocs/proposals/boot-to-shell-proposal.mddocs/proposals/cloud-deployment-proposal.mddocs/proposals/cloud-metadata-proposal.mddocs/proposals/cryptography-and-key-management-proposal.mddocs/proposals/certificates-and-tls-proposal.mddocs/proposals/oidc-and-oauth2-proposal.mddocs/proposals/llm-and-agent-proposal.mddocs/proposals/realtime-voice-agent-shell-proposal.mddocs/proposals/gpu-capability-proposal.mddocs/proposals/go-runtime-proposal.mddocs/proposals/userspace-binaries-proposal.mddocs/proposals/volume-encryption-proposal.mddocs/proposals/storage-and-naming-proposal.mddocs/proposals/security-and-verification-proposal.mddocs/proposals/mdbook-docs-site-proposal.mddocs/security/trust-boundaries.mddocs/security/verification-workflow.mddocs/dma-isolation-design.mddocs/trusted-build-inputs.md
No docs/research/ report is directly applicable. This proposal is
project-composition policy layered on existing capOS architecture, not a
new OS architecture or runtime design.
Proposal Group Archive
This page is retained as a compact grouping aid for older links and sidebar navigation. The canonical status table is Proposal Index; update that page first when a proposal changes role.
The public sidebar now nests proposal documents under the proposal index instead of exposing every long-form design page as a top-level entry.
Active Support
| Proposal | Status | Purpose |
|---|---|---|
| mdBook Documentation Site | Partially implemented | Defines the documentation site structure, status vocabulary, and curation rules for architecture, proposal, security, and research pages. |
Future Runtime And Deployment
| Proposal | Status | Purpose |
|---|---|---|
| Go Runtime | Future design | Plans a custom GOOS=capos userspace port and runtime services for Go programs. |
| Lua Scripting | Partially implemented | Defines Lua as a capability-scoped userspace runner with curated libraries and exact grants. Phase 0 and Phase 1 host bindings are in tree; Phase 2+ remains future work. |
| Cloud Metadata | Future design | Describes cloud bootstrap inputs and manifest deltas without importing cloud-init. |
| Cloud Deployment | Partially implemented | Records QEMU boot, ACPI/PCI/MSI-X discovery, the landed cloudboot image/harness, and the first GCP imported-image serial-console boot proof. Provider NIC/storage drivers, cloud clocking, AWS/Azure proofs, and aarch64 deployment remain future work. |
| Browser/WASM | Future design | Explores a browser-hosted capOS model using WebAssembly and workers. |
Future Security, Policy, And Lifecycle
| Proposal | Status | Purpose |
|---|---|---|
| User Identity and Policy | Partially implemented | Defines user/session identity and policy layers over capability grants. Current implementation covers anonymous/operator/guest UserSession metadata, bootstrap credential/session flows, broker-issued shell bundles, and seed-account configuration; durable accounts, external bindings, session revocation, quotas, and broader ABAC/MAC remain future work. |
| Cryptography and Key Management | Future design | Defines key, signing, encryption, and vault capabilities for later security services. |
| Certificates and TLS | Future design | Defines X.509, trust store, ACME, and TLS configuration capabilities. |
| OIDC and OAuth2 | Future design | Defines federated login, OAuth2 clients, token capabilities, and broker integration. |
| Volume Encryption | Future design | Defines encryption-at-rest for system and user volumes. |
| System Monitoring | Future design | Defines scoped observability capabilities for logs, metrics, traces, health, status, crash records, and audit. |
| Formal MAC/MIC | Future design | Defines a formal access-control and integrity model for later proof work. |
| Live Upgrade | Future design | Designs service replacement while preserving handles, calls, and authority. |
| GPU Capability | Future design | Sketches isolated GPU device, memory, and compute authority. |
Future Domains
| Proposal | Status | Purpose |
|---|---|---|
| Language Models and Agent Runtime | Future design | Defines model, embedding, and agent-runner capabilities. |
| Realtime Voice Agent Shell | Future design | Extends the agent-shell path for realtime voice and media sessions. |
| capOS As A Robot Brain | Future design | Defines capability-oriented robotics service graphs and actuator boundaries. |
| Contributor Quest Mechanics | Future design | Defines contribution-linked game badges and bounded perks. |
| Public Release and Maintainer Boundaries | Future design | Defines public release posture and maintainer-load boundaries. |
Rejected Or Superseded
| Proposal | Status | Purpose |
|---|---|---|
| Endpoint Badges as Service Identity | Rejected | Post-mortem for the seL4-style endpoint badge identity model that was superseded by Service Object Capabilities, then by Session-Bound Invocation Context. |
| Service Object Capabilities | Superseded | Historical service-minted object capability model; the landed synthetic routing/lifecycle proof remains low-level coverage, but the implemented replacement is Session-Bound Invocation Context. |
| Cap’n Proto SQE Envelope | Rejected | Records why ring SQEs stay fixed-layout transport records instead of becoming Cap’n Proto messages themselves. |
| Sleep(INF) Process Termination | Rejected | Records why infinite sleep should not replace explicit process termination, while preserving typed status and future sys_exit removal as separate lifecycle work. |
Rejected Proposal: Endpoint Badges as Service Identity
Status
Rejected. This was the short-lived seL4-style model where a capability hold
edge carried a u64 badge and endpoint servers used that badge as the
service-visible caller identity.
The model was superseded by Service Object Capabilities, which reframed the badge field as an opaque receiver selector owned by a service object capability. That proposal is also superseded: the active direction is Session-Bound Invocation Context, where each process has one immutable session context and endpoint calls expose privacy-preserving caller-session metadata instead of caller-selected badges or service-object identity migration.
This document records what badges were, how they were intended to be used, what was implemented, and why the design was rejected.
Proposal
Add a word-sized badge to each capability hold edge and deliver that value to an endpoint server whenever the holder invokes the endpoint. Multiple clients could therefore share one endpoint object while the server still distinguished them:
endpoint object
client cap hold badge 100 -> chat participant 100
client cap hold badge 200 -> chat participant 200
client cap hold badge 302 -> adventure player 302
The model came from seL4’s endpoint badge and mint pattern. A trusted holder of an endpoint owner capability could mint differently badged client facets for children or services. Copy and move transfer preserved the badge, so delegation kept the same service-visible identity unless a trusted mint path created a fresh one.
Intended Use
Badges were intended to solve a real early shared-service problem: chat, adventure, stdio bridges, and endpoint smokes needed more than one logical client on a resident service endpoint. Creating one kernel endpoint per client was unnecessary overhead for the demo stage, and putting a caller name or role in request bytes would have been trivial to spoof.
The intended rules were:
- a badge is not a generic rights bitmask;
- a badge is hold-edge metadata, not part of the endpoint object;
- endpoint CALL delivery reports the invoked hold badge to the server;
- copy and move transfer preserve the badge;
- raw spawn grants preserve the source badge;
- endpoint owners and ProcessSpawner-created parent endpoint result facets may mint a requested child client badge;
- delegated client facets may be passed on only with the same badge.
Under that model, a chat server could key membership by badge and an adventure server could key per-player room/inventory state by badge. The badge was meant to be server-visible caller identity, not a user-facing permission flag.
Implementation Specifics
The concrete implementation landed in several steps:
- Commit
3ee5240(feat: propagate endpoint capability badges, 2026-04-22) addedCapRef.badgeto the manifest schema, parsed optional CUEbadgefields, stored the value inCapHold, and changed endpoint CALL dispatch socall.badgecame from the invoked capability slot. The cross-process IPC smoke asserted a nonzero badge on RECV and RETURN completions. - Commit
df0d140(feat: add spawn grant badge attenuation, 2026-04-22) addedCapGrant.badgeto the ProcessSpawner ABI. Raw grants failed closed if the requested badge differed from the source hold.ClientEndpointgrants could mint the requested badge only from an endpoint owner source. The init spawn proof printed[init] Spawn badge attenuation ok.after exercising the path. - Commit
2face05(demos: extract badged endpoint service loop, 2026-04-24) extractedserve_badged_endpointintodemos/service-common/. The helper performed endpoint RECV, released unexpected transferred caps, decoded params, and called service handlers ashandle_request(state, badge, method_id, params). - Chat and adventure used that helper to route per-client service state by
badge. Manifest examples such as
system-chat.cueandsystem-adventure.cuecarried explicit badge values for shared-service clients and NPC/client identities. - Commit
3e59540(fix: narrow endpoint result badge minting, 2026-04-25) stopped treating every endpointResultCapas trusted badge minting authority. Only endpoint owners and ProcessSpawner-created parent endpoint result facets retained mint authority; ordinary IPC result transfers stayedResultCapand could not become a badge-mint path. - Commit
f955cd5(fix: reject delegated endpoint relabeling, 2026-04-25) fixed the first containment failure: already-delegated client facets could no longer request a different badge throughClientEndpointspawn grants. - Commit
a64c216(spawn: preserve delegated endpoint identities, 2026-04-25) fixed the shell/defaulting case. Omitted shell badge syntax began preserving the source badge viaPRESERVE_CLIENT_ENDPOINT_BADGE = u64::MAX, while explicit relabel attempts and low-level legacy badge-zero encodings failed closed for delegated client facets.
The final contained implementation still has a badge field in several ABI and
implementation structs. Current docs call it legacy receiver metadata or a
receiver selector when it is still needed for low-level tests, service-object
history, or non-identity parameters such as scoped TCP listen ports. It is no
longer the target identity model.
What Failed
The design gave too much meaning to an untyped number. Even when the kernel preserved badges across copy/move transfer, shell and spawn surfaces could still turn a caller-selected integer into service-visible identity unless every grant path handled mint authority perfectly.
The concrete failure was delegated endpoint relabeling. A shell holding a
delegated chat client endpoint could request:
run "chat-client" with { chat: client @chat badge 200 }
Before the containment fixes, that could produce a child client facet whose
service-visible identity differed from the delegated source. Omitted badge
syntax was also dangerous because the old parser defaulted it to badge 0,
which was another relabeling path for a nonzero source client.
The bug was narrow, but it exposed the wrong abstraction. The server was being
asked to treat a generic transport field as identity. The kernel could enforce
some mint rules, but the meaning of 100, 200, 302, or 0 lived in each
service by convention. That made ordinary shell syntax look like an authority
selector and made future network-backed shell exposure too easy to get wrong.
Rationale For Rejection
Endpoint badges are a useful low-level routing mechanism, but not a good service identity model for capOS.
Problems:
- Caller-selected identity pressure. The natural user-facing syntax was
client @service badge N, which invited users and tests to select service identity directly. - Untyped service semantics. The same
u64field could mean a chat member, an adventure player, an NPC, a stdio bridge, a TCP port, or a test fixture. The kernel could not validate those meanings. - Policy by convention. Each service had to remember whether a badge was a participant, a session, a role, a receiver cookie, or just a transport tag.
- Delegation hazards. Copy/move propagation was straightforward, but spawn minting needed subtle distinctions between endpoint owners, ProcessSpawner-created parent endpoint result facets, ordinary IPC result caps, and delegated client facets.
- Bad privacy shape. A server-visible endpoint field encouraged exposing stable caller identity by default, while the active model wants privacy-preserving session references and explicit bounded disclosure.
- Poor long-term composition. Cross-service and network-transparent designs need typed roots/facets, session context, transfer policy, and disclosure policy. A single badge value cannot carry those contracts.
The accepted historical fix was first to contain relabeling, then to stop treating badges as the target architecture. Service Object Capabilities moved identity into service-minted object capabilities and receiver selectors. That was still too much machinery for normal workload identity and was replaced by Session-Bound Invocation Context.
Replacement Direction
The active replacement is:
- capabilities answer whether the process may invoke a service at all;
- each process has exactly one immutable session context;
- endpoint delivery carries privacy-preserving caller-session metadata by default;
- richer subject disclosure requires an explicit request and a matching broker/service disclosure scope;
- shared services key user-facing state by broker-granted service capabilities plus service-scoped session references, not by caller-selected badges.
Legacy badge fields may remain as internal receiver metadata, hostile-test fixtures, or non-identity configuration encodings until the corresponding code paths are migrated. They should not appear as normal user-facing service identity syntax.
Design Grounding
Project files read for this post-mortem:
docs/capability-model.mddocs/architecture/ipc-endpoints.mddocs/proposals/service-object-capabilities-proposal.mddocs/proposals/session-bound-invocation-context-proposal.mddocs/authority-accounting-transfer-design.mddocs/research/capability-systems-survey.mddocs/security/trust-boundaries.mddocs/tasks/README.mddocs/tasks/README.md
Relevant research:
docs/research/sel4.md
The historical badge model followed the seL4 badge/mint precedent recorded in the repo research notes. The rejection is capOS-specific: schema-typed interfaces, session-bound process identity, broker-issued service authority, and privacy-bounded disclosure fit the project better than making a generic endpoint metadata word carry service identity.
Proposal: Service Object Capabilities
Status: Superseded by Session-Bound Invocation Context. This document remains as historical design context for the already-landed synthetic routing/lifecycle proof. Do not continue the subject/proof root-opening or shared-service service-object migration from this proposal.
Replace caller-selected endpoint identity with service-minted object capabilities.
Problem
Endpoint client metadata currently carries service identity. A client endpoint is a capability plus a caller-visible numeric tag; services can use that tag as a member, session, role, connection, or actor key. That is too close to a permission bitmask or ambient label: the generic IPC substrate accepts an untyped number, and each service has to assign security meaning by convention.
The pre-containment problem became concrete through shell spawn syntax. A shell
that held a delegated chat client endpoint could request:
run "chat-client" with { chat: client @chat badge 200 }
The launcher path could then pass the child a client facet with a different value than the shell originally held. Gate 0 now rejects that relabeling for ordinary delegated client facets, but the chat example remains the reason the broader migration exists: service authority should not depend on a caller-selected numeric tag.
capOS needs multi-client services, per-client state, service-created attenuation, audit subject binding, and shell-spawned children. It does not need caller-selected numeric identities.
Goals
- Make the capability object itself carry the service authority.
- Keep endpoint transport generic while avoiding generic service roles, permission bits, or caller-selected labels.
- Let services expose many logical objects through one resident process.
- Let services bind subject/audit information at object creation time without putting identity policy into the kernel IPC fast path.
- Preserve explicit transfer semantics: copy and move pass the same object authority unless a trusted minter creates a new one.
- Provide a staged migration path for current chat, adventure, stdio, and endpoint smokes.
Non-Goals
- A POSIX credential model.
- PID, PID@host, UID, role strings, or host names as service authority.
- Kernel interpretation of chat rooms, moderators, players, sessions, or principals.
- Generic per-capability permission bitmasks.
- Full network-transparent object references in the first slice.
Design
An Endpoint remains a transport queue owned by a server process. Ordinary
clients should not hold “endpoint plus badge”; they should hold a capability
to a service object exported by that server.
Design Grounding
Project files read for this design:
docs/capability-model.mddocs/architecture/ipc-endpoints.mddocs/proposals/service-architecture-proposal.mddocs/proposals/shell-proposal.mddocs/proposals/interactive-command-surface-proposal.mddocs/backlog/stage-6-capability-semantics.mddocs/backlog/runtime-network-shell.mddocs/backlog/shared-service-demos.mddocs/security/trust-boundaries.mddocs/authority-accounting-transfer-design.mddocs/tasks/README.md
Relevant research:
docs/research/sel4.mddocs/research/eros-capros-coyotos.mddocs/research/genode.md
The design deliberately supersedes the prior seL4-style badge/mint direction for service identity. Genode’s RPC/session object model is a closer fit for capOS services: clients hold capabilities to service-created objects, while delegation passes the same object authority. EROS/CapROS/Coyotos and the authority-accounting design reinforce the rule that authority should remain in the capability graph, not in caller-selected numeric metadata.
Examples by service:
Chat service:
ChatRoot
ChatParticipant
ChatRoom
ChatModerator
Terminal/child I/O service:
StdIO
Adventure service:
AdventurePlayer
AdventureNpc
Each object capability has one interface ID and one server-selected receiver selector. The receiver selector is opaque to the client. It is not a user field, not shell syntax, and not a policy label. It exists only so the kernel can route the call to the resident server and the server can dispatch it to the right object state.
Conceptually:
service object cap = target endpoint + interface id + opaque receiver selector
Only trusted minting paths may create a new receiver selector:
- the endpoint owner/server,
- a supervisor or broker that holds explicit mint authority from the server,
- transitional manifest/init wiring for boot services.
Copying or moving a service object cap preserves the same receiver selector. An ordinary client cannot relabel a delegated cap into a sibling object.
Subject / Proof Binding
A service should be able to learn who or what a service object represents, but that subject must be bound through trusted issuance rather than caller payload claims.
The general shape is:
interface Subject {
deriveProof @0 (request :DelegationRequest) -> (proof :SubjectProof);
}
interface SubjectProof {
attest @0 (challenge :Challenge) -> (statement :SubjectStatement);
}
interface ServiceRoot {
open @0 (proof :SubjectProof, request :OpenRequest)
-> (object :ServiceObject);
}
interface ServiceObject {
call @0 (request :Request) -> (response :Response);
}
UserSession is the interactive-user case and can derive a proof scoped to a
service root, request digest, audience, and freshness window. A service account,
workload identity, broker-issued proof, anonymous session, guest session, or
other typed subject cap can fill the same role when that is the right trust
boundary. The root/factory validates the proof through a verifier, broker,
account, audit, or application policy interface it was granted, stores verified
metadata in its own object table, and returns a service object cap.
Later calls on the returned object do not need caller-supplied identity. Possession of that object cap is the authority. The service can still record principal/session audit identifiers, display names, channel memberships, quota/accounting state, moderation state, workload labels, or other policy metadata internally, but those records are service state, not endpoint metadata that the caller can edit.
Example Chat Shape
The current chat service uses one Chat endpoint and maps legacy endpoint
metadata to members. The target model is a root/factory plus participant
objects.
interface ChatRoot {
join @0 (channel :Text, handle :Text, session :UserSession)
-> (participant :ChatParticipant);
}
interface ChatParticipant {
join @0 (channel :Text) -> (joined :Bool);
leave @1 (channel :Text) -> (left :Bool);
send @2 (channel :Text, text :Text) -> (sent :Bool);
who @3 (channel :Text) -> (members :List(Text));
poll @4 (maxEvents :UInt16) -> (events :List(ChatEvent));
close @5 () -> ();
}
interface ChatModerator {
kick @0 (participant :ChatParticipant, channel :Text) -> (kicked :Bool);
}
ChatParticipant is the participant authority. If a child process receives
that cap, it acts as that same participant. It cannot type another receiver
selector and become another participant.
Moderator behavior is a separate cap/interface. The service may internally associate participant and moderator state with the same subject, but the kernel does not provide a role field and the client does not choose one.
Chat does not need to know about AdventurePlayer or AdventureNpc.
Adventure-specific caps belong to the adventure service. Room speech should
cross into chat through ordinary chat object caps such as ChatParticipant or
a future room-scoped chat object; the chat service should see chat subjects
and channels, not adventure interfaces.
Kernel Contract
The kernel should enforce object-cap invariants and avoid service semantics.
Required invariants:
- Only endpoint owners or explicit mint-authority holders may create a service object cap for a new receiver selector.
- Delegating an existing service object cap preserves the receiver selector.
- Process spawning may copy or move service object caps but may not relabel them.
- Client-held object caps cannot receive or return endpoint messages unless their interface explicitly grants server authority.
- Receiver selectors are scoped to the target endpoint object; no global numeric namespace is part of the ABI.
- Process exit and cap release still drive endpoint cleanup for queued calls, in-flight returns, and server-visible cancellation.
The first compatibility step can keep the current u64 storage field but
change the rules: a delegated client endpoint’s numeric identity is preserved
on re-delegation. The target step renames and narrows the concept from
badge to an opaque receiver selector for service object caps.
Shell And Launcher Contract
Shell users should launch applications, not assign service identities.
Target user shape:
run "chat-client"
run "adventure-client"
Prototype explicit-grant shape while migration is incomplete:
run "chat-client" with { stdio: client @stdio, chat: @chat_participant }
The normal shell must not expose badge N as user-facing authority syntax.
If a grant parser keeps legacy badge syntax for manifest or smoke migration,
the kernel must still reject any delegated-client relabeling.
Omitting a badge in shell syntax preserves the source identity; low-level
legacy badge-zero encodings remain hostile-test inputs and must still fail
closed for nonzero delegated client facets.
External And Network Boundaries
External identity assertions do not open service objects directly. OIDC ID
tokens, passkey assertions, certificate chains, cloud workload tokens, and
remote gateway-authenticated claims first pass through an admission service
that normalizes provider kind, issuer, tenant, and subject; maps the result to
a local or pseudonymous principal when policy allows; and mints a local
subject/proof capability. Imported groups, roles, tenants, acr, amr, device
posture, source network, and token age are ABAC inputs to that mint decision,
not downstream object authority.
Network-transparent capability transport is also out of the first slice. A future bridge should maintain connection-local export/import tables and expose broken-reference semantics on disconnect. It must not serialize local cap-table handles, endpoint generations, receiver selectors, or server cookies as portable authority. Persistent restore, if needed, should go through a capability-bearing naming or persistence service that authorizes and mints a fresh live object.
Migration Plan
The current execution plan lives in
docs/backlog/service-object-identity-migration.md and uses four large chunks.
Gate 0 containment below is already historical substrate. It does not mean the
service-object model is implemented. This proposal records the design sequence;
the backlog owns task breakdown and verification gates.
0. Contain delegated-client relabeling, landed
The kernel and shell paths now reject ordinary delegated-client relabeling. This is containment, not the final model.
1. Core service-object routing and lifecycle, landed
Commit a4655f0 at 2026-04-28 14:10 UTC added the synthetic QEMU service
proof. It covers trusted serviceObject minting, receiver-cookie routing,
copy/move IPC transfer, nested spawn delegation, generation-checked service
receiver cookies, close/revoke rejection, and stale-cookie rejection after
record reuse.
2. Subject/proof root opening
Validate local subject/proof authority before object mint. External assertions must first normalize through admission into local or pseudonymous subject/proof caps.
3. Convert shared-service demos
Move chat, adventure, and stdio/terminal child bridges from caller-selected endpoint identity to root/factory-opened service object caps.
4. Retire legacy endpoint identity
Remove compatibility syntax and rename internal fields once normal smokes no longer depend on caller-selected endpoint identity.
Security Notes
This design keeps the kernel out of role and identity policy. The kernel only knows whether a caller holds a particular object cap and whether transfer rules allow that cap to move. Services decide what their object records mean.
PID, PID@host, and process names are diagnostics. They are not authority: process IDs recycle, hosts need cryptographic naming for federation, and a single subject can legitimately hold multiple service objects with different authority.
The broker and session services remain the right place to validate subjects and policy before a service object is minted. After minting, the object cap is the authority.
Rejected Proposal: Cap’n Proto SQE Envelope
Proposal
Replace the fixed C-layout CapSqe descriptor with a fixed-size padded
Cap’n Proto message. Each SQ slot would contain a serialized single-segment
Cap’n Proto struct with a union for call, recv, return, release, and
finish, then zero padding to the chosen SQE size.
The live ring currently pins each SQ slot to 64 bytes (SQE_SIZE in
capos-config/src/ring.rs), so any Cap’n Proto envelope would either have to
fit inside that budget or motivate a slot-size bump. For a hypothetical 128-byte
slot, the rough layout would be:
+0x00 u32 segment_count_minus_one
+0x04 u32 segment0_word_count
+0x08 word root pointer
+0x10 RingSqe data words, including union discriminant
+0x?? zero padding to 128 bytes
A compact schema would need to keep fields flat to avoid pointer-heavy nested payload structs:
struct RingSqe {
userData @0 :UInt64;
capId @1 :UInt32;
methodId @2 :UInt16;
flags @3 :UInt16;
addr @4 :UInt64;
len @5 :UInt32;
resultAddr @6 :UInt64;
resultLen @7 :UInt32;
callId @8 :UInt32;
union {
call @9 :Void;
recv @10 :Void;
return @11 :Void;
release @12 :Void;
finish @13 :Void;
}
}
Potential Benefits
A Cap’n Proto SQE envelope would make the ring operation shape schema-defined instead of Rust-struct-defined. That has some real advantages:
- The ABI documentation would live in
schema/capos.capnpnext to the capability interfaces. - Future userspace runtimes in Rust, C, Go, or another language could use generated accessors instead of hand-mirroring a packed descriptor layout.
- The operation choice could be represented as a schema union, making it clear that fields meaningful for CALL are not meaningful for RECV or RETURN.
- Cap’n Proto defaulting gives a familiar path for adding optional fields while letting older readers ignore fields they do not understand.
- Ring dumps and traces could be decoded with generic Cap’n Proto tooling.
- A single “everything crossing this boundary is Cap’n Proto” rule is architecturally simpler to explain.
Those benefits are mostly about schema uniformity, generated bindings, and tooling. They do not remove the need for an operation discriminator; they move it from an explicit fixed descriptor field to a Cap’n Proto union tag.
Rationale For Rejection
The SQE is the fixed control-plane descriptor for a hostile kernel boundary. It should be cheap to classify and validate before any operation-specific payload parsing. A Cap’n Proto SQE envelope would still have a discriminator, but would move it into generated reader state and require Cap’n Proto message validation before the kernel even knows whether the entry is a CALL, RECV, or RETURN.
The current shape concentrates that hostile-input validation in one place:
sqe_wire_validation_error in capos-config/src/ring.rs is the single source
of truth shared by the kernel dispatch path and the sqe_validation fuzzer
under fuzz/fuzz_targets/. Replacing the descriptor with a Cap’n Proto
message would push some of that validation into generated reader state and
split the fuzz surface across the framing parser and the per-opcode predicates.
Cap’n Proto framing also consumes slot space: a single-segment message needs a segment table and root pointer before the struct data. The live 64-byte slot would not fit a Cap’n Proto envelope without either dropping fields or growing the slot; a 128-byte envelope would spend much of the slot on framing and padding. Nested payload structs are worse because they add pointers inside the ring descriptor.
The accepted split is:
- fixed
#[repr(C)]ring descriptors for SQ/CQ control state; - Cap’n Proto for capability method params, results, and higher-level transport payloads where schema evolution is valuable;
- endpoint delivery metadata in a small fixed
EndpointMessageHeaderfollowed by opaque params bytes.
EndpointMessageHeader is concretely 56 bytes today (see the static-size
assertion in capos-config/src/ring.rs), which keeps the endpoint delivery
header well under one cache line while leaving payload bytes opaque to the
kernel.
There is also a layering issue. The capability ring is part of the local Cap’n Proto transport implementation: it is the mechanism that moves capnp calls, returns, and eventually release/finish/promise bookkeeping between a process and the kernel. The SQE itself is therefore below ordinary Cap’n Proto message usage. Making the transport substrate depend on parsing Cap’n Proto messages to discover which transport operation to perform would couple the transport implementation to the protocol it is supposed to carry. Method params and results are proper Cap’n Proto messages; the ring descriptor is the framing/control structure that gets the transport to the point where those messages can be interpreted.
This keeps queue geometry simple, preserves bounded hostile-input handling, and avoids running a Cap’n Proto parser on the hot descriptor path.
Related Documents
- Ring v2 SMP Proposal – forward path for ring
geometry that keeps the fixed-layout descriptor and negotiates
sqe_sizerather than wrapping each slot in a Cap’n Proto message. - ABI Evolution Policy – how non-capnp ring ABIs (including SQE/CQE layouts) evolve alongside the Cap’n Proto schema.
- Error Handling Proposal – where Cap’n Proto
does sit on the dispatch path:
CapExceptionpayloads carried in SQE result buffers.
Rejected Proposal: Sleep(INF) Process Termination
Concern
Unix-style zombies are a poor fit for capOS. A terminated child should not keep
its address space, cap table, endpoint state, or other authority alive merely
because a parent has not waited yet. The remaining observable state should be a
small, capability-scoped completion record, and only holders of the corresponding
ProcessHandle should be able to observe it.
The current ProcessHandle.wait() -> exitCode :Int64 shape is also too weak for
future lifecycle semantics. Raw numeric status cannot distinguish normal
application exit from abandon, kill, fault, startup failure, runtime panic, or
supervisor policy actions without inventing process-wide magic numbers.
Proposal
Introduce a system sleep operation and treat Sleep(INF) as a special terminal
operation. The argument for this spelling is that a process that never wants to
run again can enter an infinite sleep instead of becoming a zombie. The kernel
would recognize the infinite case and handle it specially:
- finite
Sleep(duration)blocks the process and wakes it later; Sleep(INF)never wakes, so the kernel tears down the process;- the process’s authority is released as if it had exited;
- parent-visible process completion is either omitted or reported as a special status.
A variant also removes the dedicated sys_exit syscall and makes
Sleep(INF) the only user-visible process termination primitive.
Candidate Semantics
Sleep(INF) as Exit(0)
The simplest version maps Sleep(INF) to normal successful exit.
This is rejected because it lies about intent. A program that completed successfully, a program that intentionally detached, and a program that chose to disappear without status are not the same lifecycle event. Supervisors would see the same status for all of them.
Sleep(INF) as Abandoned
A less lossy version gives Sleep(INF) a distinct terminal status:
struct ProcessStatus {
union {
exited @0 :ApplicationExit;
abandoned @1 :Void;
killed @2 :KillReason;
faulted @3 :FaultInfo;
startupFailed @4 :StartupFailure;
}
}
struct ApplicationExit {
code @0 :Int64;
}
ProcessHandle.wait() would return status :ProcessStatus instead of a bare
exitCode :Int64. Normal application termination returns exited(code), while
Sleep(INF) returns abandoned.
This fixes the type problem, but leaves the operation name wrong. Sleep normally means the process remains alive and keeps its authority until a wake condition. The infinite special case would instead release authority, reclaim memory, cancel endpoint state, complete process handles, and make the process impossible to wake. That is termination, not sleep.
Sleep(INF) as Detached No-Status Termination
Another version treats Sleep(INF) as detached termination and gives parents no
status. That avoids inventing an exit code, but it weakens supervision. Init and
future service supervisors need a definite terminal event to implement restart
policy, diagnostics, dependency failure reporting, and “wait for all children”
flows. A missing status is not a useful status.
Remove sys_exit Through a Typed Lifecycle Capability
Removing the dedicated sys_exit syscall is a separate, plausible future
direction. The cleaner version is not Sleep(INF), but an explicit lifecycle
operation:
interface ProcessSelf {
terminate @0 (status :ProcessStatus) -> ();
abandon @1 () -> ();
}
interface ProcessHandle {
wait @0 () -> (status :ProcessStatus);
}
The process would receive ProcessSelf only for itself. Calling terminate
would be non-returning in practice: the kernel would process the request,
release process authority, complete any ProcessHandle waiter with the typed
status, and not post an ordinary success completion back to the dying process.
The transport shape needs care. A generic Cap’n Proto call normally expects a completion CQE, but a self-termination operation cannot safely rely on the dying process to consume one. Viable implementations include:
- a dedicated ring operation such as
CAP_OP_EXITtargeting a self-lifecycle cap; - a
ProcessSelf.terminatecall whose method is explicitly non-returning and never posts a CQE to the caller; - keeping
sys_exittemporarily until ring-level non-returning operations have explicit ABI and runtime support.
This path removes the ambient exit syscall without overloading sleep. It also forces terminal status to become typed before kill, abandon, restart policy, or fault reporting are added.
Rationale For Rejection
Sleep(INF) solves the wrong abstraction problem. The zombie problem is not
that a process needs a forever-blocked state. The problem is retaining process
resources after terminal execution. capOS should solve that by separating
process lifetime from process-status observation:
- process termination immediately releases authority and reclaims process resources;
- a
ProcessHandleis only observation authority, not ownership of the live process; - if a handle exists, a small completion record may remain until it is waited or released;
- if no handle exists, terminal status can be discarded;
- no ambient parent process table is needed.
Under that model, a sleeping process remains alive and authoritative, while a
terminated process does not. Special-casing Sleep(INF) to perform teardown
would make the name actively misleading and would create a hidden terminal
operation with different semantics from finite sleep.
The accepted direction is therefore:
- keep explicit process termination semantics;
- replace raw
exitCode :Int64with typedProcessStatusbefore adding more lifecycle states; - keep the minimal terminal self-exit ABI until a typed self-lifecycle capability or ring operation can replace it cleanly;
- add future
Timer.sleep(duration)only for real sleep, where the process remains alive and may wake.
Sleep(INF) remains rejected as a termination primitive. The concern it raises
is valid, but the solution is typed terminal status plus status-record cleanup,
not infinite sleep.
Papers
Long-form research write-ups produced from the capOS codebase. Each paper is
typeset with Typst from sources under papers/<slug>/ in the repository and
published as a PDF alongside this site.
Schema-as-ABI: Typed Capabilities and Ring-Transport Dispatch in capOS
A pre-evidence draft describing the schema-as-ABI thesis: Cap’n Proto schemas
acting as kernel ABI, access-control mechanism, IPC wire format, and (planned)
persistence and network-transparency substrate, layered over a shared-memory
SQ/CQ ring with a two-syscall surface (exit and cap_enter).
The draft separates closed contributions (capability ring transport,
exactly-once accounting rollback, capability lifecycle, the verification
stack) from evidence-gated claims that depend on outstanding artifacts (C1
service-object migration, C2 measurement run, C3 persistence proof-of-concept,
C4 network-transparency proof-of-concept). Sections that depend on missing
artifacts are flagged with TODO admonitions naming the gap and the entry in
docs/paper/evidence-gaps.md
that closes them.
Source: papers/schema-as-abi/main.typ in the repository. Build locally with
make paper; the same target runs in make cloudflare-pages-build and
publishes the PDF at the link above.
Research Deep-Dive Index
The pages under docs/research/ are deep-dive reports informing capOS design
decisions. Proposals and design notes cite them as grounding for capability
model, IPC, scheduling, networking, error handling, runtime, agent, and
prior-art choices. The Capability-Based and Microkernel Operating Systems
survey records the cross-system design
consequences pulled from this body of research; the entries below give the full
alphabetical listing of individual reports for direct discovery.
Start here:
- Research: Capability-Based and Microkernel Operating Systems – Cross-system survey synthesizing the design consequences for capOS (capability table, IPC, memory, scheduling, persistence, VFS, resource accounting, language support).
Individual reports:
- Research: Browser Engines, Document Engines, and Agent Browsers – Browser engine portability, cap-native document-engine options, and agent-browser patterns for capOS browser capabilities.
- Cap’n Proto Error Handling: Research Notes – Prior-art on capnp-rpc error semantics.
- Cloud DMA Provider Evidence Inventory – Official AWS/Azure/GCP device-surface facts, evidence-matrix schema, live guest-probe checklist, and fail-closed classification rules for the cloud DMA backend decision.
- Research: Cloudflare, Cap’n Proto, Workers RPC, and Cap’n Web – Cloudflare Workers, workerd, Durable Objects, Workers RPC, Cap’n Web, and Cloudflare’s production use of Cap’n Proto/KJ.
- Completion Rings And Threaded Runtimes – Io_uring-style transports under threaded runtimes.
- Crash Recovery and Supervision – Prior-art survey of supervision trees, restart budgets, and death-observation semantics across Erlang/OTP, systemd, Kubernetes, Fuchsia, seL4, and Genode.
- Debug, Trace, and Profiling Authority – Prior-art survey of debug/trace/profile authority models (GDB RSP, Linux ptrace/Yama, perf/eBPF, Fuchsia, seL4, Genode) for the Debug and Trace proposal.
- DMA User-Space Driver Isolation – DMA, user-space driver, vIOMMU, and no-IOMMU bounce-buffer design consequences for capOS device authority.
- Research: EROS, CapROS, and Coyotos – Persistent capability-system lineage.
- Research: Future Scheduler Architecture – Survey of modern scheduler algorithms and architectures for capOS scheduler evolution.
- Research: Game Mechanics Prior Art – Grounded mechanics research for Aurelian Frontier seasonal play, markets, construction, and tactical combat.
- Genode OS Framework: Research Report for capOS – Componentized OS framework.
- Research: Hosted Agent Harnesses – OpenClaw-like harnesses, swarms, memory/wiki systems, and agent orchestration research for capOS-hosted agents.
- Research: HPC Parallel Patterns – HPC benchmark and programming-model grounding for generic parallel processing patterns.
- IOMMU Remapping Grounding – Primary-source grounding for future Intel VT-d, AMD-Vi, and QEMU IOMMU remapping work.
- IX-on-capOS Hosting Research – IX as a package corpus, content-addressed build/store model, and a capability-native build-service surface for capOS.
- Research: Linux Sandboxes And Virtualization For Workloads – Linux sandbox, container, gVisor, KVM, microVM, and CPU-isolation prior art for generic Linux workload execution.
- LLVM Target Customization for capOS – Requirements for a custom LLVM target triple.
- Research: Multimedia Pipeline Latency – Survey of PipeWire and JACK design lessons for the capOS multimedia graph.
- Research: NO_HZ, SQPOLL, and Realtime Scheduling – Linux NO_HZ, io_uring SQPOLL, CPU isolation, PREEMPT_RT, SCHED_DEADLINE, and seL4 MCS grounding for capOS timer and realtime design.
- OS Error Handling in Capability Systems: Research Notes – Cross-OS error-model comparison.
- Research: Out-of-Kernel Scheduling – Userspace scheduling prior art.
- Research: Paperclips Clean-Room Functional Spec – Clean-room functional mechanics summary and capOS Paperclips improvement candidates.
- Pingora Architecture and Philosophy: Research Report for capOS – Proxy/server framework as a userspace runtime case study.
- Research: Plan 9 from Bell Labs and Inferno OS – Namespace-oriented systems.
- Research: Realtime Multimodal Agent APIs – Provider API survey for realtime native-audio, multimodal, tool-using agents and their consequences for capOS agent surfaces.
- Research: Robotics Realtime Control – Robotics realtime-control practice and the consequences for using capOS as a robot brain.
- Research: Scientific Agent-Lab Software Stack – Scientific computing, solver, proof-assistant, notebook, and reproducible-package prior art for a capOS-hosted LLM research lab.
- seL4 Deep Dive: Lessons for capOS – Microkernel and capability reference.
- seL4 HAMR: Model-Based High-Assurance Engineering – Evaluation of HAMR (AADL/Slang/CAmkES) versus the capOS Cap’n Proto schema-as-contract model.
- Small Open-Weights LLM Survey for the capOS Agent-Shell – Model candidates for the on-ISO local LLM.
- Research: Spritely, OCapN, and CapTP – Spritely, OCapN, CapTP, netlayers, locators, Syrup, promise pipelining, handoffs, and capability-network lessons for capOS.
- Research: Time and Clock Authority – Prior-art survey of OS clock IDs, NTP/PTP discipline, leap-second handling, time namespaces, and Fuchsia UTC clock objects for the capOS time and clock design.
- x2APIC and APIC Virtualization – Interrupt routing on modern x86.
- Fuchsia Zircon Kernel: Research Report for capOS – Handle-based OS reference.
Research: Capability-Based and Microkernel Operating Systems
Survey of existing systems to inform capOS design decisions across IPC, scheduling, capability model, persistence, VFS, and language support.
This survey records the cross-system design consequences; the research index lists the individual deep-dive reports. Read the consequences below first; open individual reports only when their design context is relevant.
Design consequences for capOS
- Keep the flat generation-tagged capability table; seL4-style CNode hierarchy is not needed until delegation patterns demand it.
- Treat the typed Cap’n Proto interface as the permission boundary; avoid a parallel rights-bit system that would drift from schema semantics.
- Continue the ring transport plus direct-handoff IPC path, with shared memory
reserved for bulk data once
SharedBuffer/MemoryObjectexists. - Treat seL4-style endpoint badges as historical receiver metadata, not as the active service identity model; use move/copy transfer descriptors, object-epoch revocation, and session-bound invocation context to make authority delegation explicit and reviewable.
- Model session lifetime as revocable liveness state plus grant leases, not as generic capability expiry. EROS/CapTP-style revocation-by-indirection and Genode-style session closure are better precedents than refreshing every old reference in place.
- Keep persistence explicit through Store/Namespace capabilities; do not adopt EROS-style transparent global checkpointing as a kernel baseline.
- Push POSIX compatibility and VFS behavior into libraries and services rather than adding a kernel global filesystem namespace.
- Add resource donation, scheduling-context donation, notification objects, and runtime/thread primitives only when the corresponding service or runtime path needs them.
- Use Pingora-style lifecycle frameworks only above the capability transport: userspace service libraries can provide phase hooks, per-request context, readiness, graceful shutdown, retry policy, and observability, while kernel interfaces remain narrow typed capabilities with explicit authority.
Individual deep-dive reports:
- seL4 – formal verification, CNode/CSpace, IPC fastpath, MCS scheduling
- Fuchsia/Zircon – handles with rights, channels, VMARs/VMOs, ports, FIDL vs Cap’n Proto
- Plan 9 / Inferno – per-process namespaces, 9P protocol, file-based vs capability-based interfaces
- EROS / CapROS / Coyotos – persistent capabilities, single-level store, checkpoint/restart
- Genode – session routing, VFS plugins, POSIX compat, resource trading, Sculpt OS
- LLVM target customization – target triples, TLS models, Go runtime requirements
- Linux sandboxes and virtualization for workloads – Linux namespaces, cgroup v2, seccomp, Landlock, bubblewrap, nsjail, systemd-nspawn, OCI runtimes and images, User-Mode Linux, gVisor, QEMU/KVM, Firecracker, Kata Containers, and capOS auto full-nohz interaction grounding for generic Linux workload execution, familiar user environments, and agent-initiated jobs
- Cap’n Proto error handling – protocol, schema, and Rust crate error behavior used by the capOS error model
- Cloudflare, Cap’n Proto, Workers RPC, and Cap’n Web – Cloudflare Workers, workerd, Durable Objects, Workers RPC, Cap’n Web, and production Cap’n Proto/KJ lessons for capOS remote-capability design
- Spritely, OCapN, and CapTP – object capability network protocols, netlayers, locators, sturdyrefs, Syrup, promise pipelining, distributed GC, and third-party handoffs
- Browser engines, document engines, and agent browsers – mainstream browser engine portability, cap-native document-engine substrate options, automation protocols, Donut Browser-style profile orchestration, and implications for visual and agent/shell browser capabilities
- OS error handling – error patterns in capability systems and microkernels used by the capOS error model
- IX-on-capOS hosting – clean integration of IX package/build model via MicroPython control plane, native template rendering, Store/Namespace, and build services
- Out-of-kernel scheduling – whether scheduler policy can move to user space, and which dispatch/enforcement mechanisms must stay in kernel
- Completion rings and threaded runtimes –
completion ownership,
io_uring, futex, and IOCP precedents for capOS’s full-SMP ring/threading ABI - x2APIC and APIC virtualization – x2APIC backend direction, QEMU/KVM validation constraints, and why the current xAPIC MMIO LAPIC path should remain the Phase C foundation
- IOMMU remapping – primary-source Intel VT-d, AMD-Vi, and QEMU grounding for future real DMA remapping work, while current capOS remains diagnostics-only with direct DMA blocked
- Cloud DMA provider evidence inventory – official AWS/Azure/GCP device-surface facts, the evidence-matrix schema, the live guest-probe checklist, and the fail-closed classification rules the cloud DMA backend decision consumes
- Future scheduler architecture – Linux CFS/EEVDF, SCHED_DEADLINE, sched_ext, FreeBSD ULE, seL4 MCS, ghOSt, scheduler activations, Shenango, Caladan, Shinjuku, and Arachne lessons for capOS per-CPU queues, CPU accounting, fair scheduling, scheduling contexts, CPU isolation leases, realtime islands, and user-space scheduler policy
- NO_HZ, SQPOLL, and realtime scheduling – Linux NO_HZ, clocksource/clockevent, CPU isolation/housekeeping, io_uring SQPOLL, SCHED_DEADLINE, PREEMPT_RT, and seL4 MCS grounding for capOS tickless idle, SQPOLL nohz, scheduling contexts, and realtime islands
- HPC parallel patterns – Berkeley dwarfs, NAS Parallel Benchmarks, HPL/LINPACK, HPCG, Graph500, MPI collectives, and OpenMP loop/task/reduction grounding for future single-node and multi-node parallel benchmark coverage
- Scientific agent-lab software stack – PARI/GP, SageMath, GAP, Singular, OSCAR, SymPy, SciPy, R, Octave, JupyterLab, Z3, cvc5, HiGHS, SCIP, OR-Tools, JuMP, CVXPY, Lean/mathlib, Rocq, Isabelle, Agda, Spack, Guix-HPC, Nix, and Apptainer grounding for a future capOS scientific standard package and LLM agent research lab
- Pingora – phase-oriented service framework design, operational lifecycle, pooling/retry lessons, and why capOS should borrow the userspace library shape without importing Pingora’s HTTP or process model. The concrete capOS follow-up is capos-service, starting with terminal/networking lifecycle rather than HTTP.
- Multimedia pipeline latency – PipeWire and JACK lessons for a capOS media graph optimized for the minimal possible guaranteed-stable stack latency, explicit latency ranges, admitted realtime islands, and xrun/deadline telemetry
- Realtime multimodal agent APIs – OpenAI Realtime, Google AI Gemini Live API, and Vertex AI Live API implications for capOS voice agent-shell, realtime media sessions, tool-call gating, and provider adapters
- Hosted agent harnesses – OpenClaw-like harness controls, hosted agent swarms, LLM-maintained wiki memory, schema-guided reasoning, MCP/A2A-style adapters, and implications for capability-scoped capOS agent services
- Game mechanics prior art – Stardew Valley, EVE Online, and Evil Islands mechanics translated into capability-shaped Aurelian Frontier calendar, market, construction, and combat tasks
- Robotics realtime control – ROS 2, micro-ROS, ros2_control, seL4 MCS, PREEMPT_RT, Xenomai, Orocos, Nav2, PX4, ArduPilot, Autoware, and OPC UA lessons for using capOS as a robot brain with explicit actuator authority and admitted realtime islands
Cross-Cutting Analysis
1. Capability Table Design
All surveyed systems store capabilities as process-local references to kernel objects. The key design variable is how capabilities are organized.
| System | Structure | Lookup | Delegation | Revocation |
|---|---|---|---|---|
| seL4 | Tree of CNodes (power-of-2 arrays with guard bits) | O(depth) | Subtree (grant CNode cap) | CDT (derivation tree), transitive |
| Zircon | Flat per-process handle table | O(1) | Transfer through channels (move) | Close handle; refcount; no propagation |
| EROS | 32-slot nodes forming trees | O(depth) | Node key passing | Forwarder keys (O(1) rescind) |
| Genode | Kernel-enforced capability references | O(1) | Parent-mediated session routing | Session close |
| capOS | Flat table with generation-tagged CapId, hold-edge metadata, and Arc<dyn CapObject> backing | O(1) | Manifest exports plus copy/move transfer descriptors through Endpoint IPC | Local release/process exit, object-epoch revocation for child-local grants, and target session liveness/grant-lease checks |
Recommendation for capOS: Keep the flat table. It is simpler than seL4’s CNode tree and sufficient for capOS’s use cases. Augment each entry with:
- Hold-edge metadata – transfer scope, disclosure scope, object id/epoch, and any legacy receiver metadata needed for transport compatibility.
- Generation counter (from Zircon) – upper bits of CapId detect stale references after a slot is reused. (Implemented.)
- Epoch (from EROS) – per-object revocation epoch. Incrementing the epoch invalidates all outstanding references. O(1) revoke, O(1) check.
- Session/grant lease reference (from Genode/EROS/CapTP-style lifecycle lessons) – a future pointer to mutable liveness or grant state so logout, renewal, and revocation do not require scanning all cap tables or relabeling a running process.
Not adopted: per-entry rights bitmask. Zircon and seL4 use rights bitmasks
(READ/WRITE/EXECUTE) because their handle/syscall interfaces are untyped.
capOS uses Cap’n Proto typed interfaces where the schema defines what methods
exist. Method-level access control is the interface itself – to restrict what
a caller can do, grant a narrower capability (a wrapper CapObject that
exposes fewer methods). A parallel rights system would create an impedance
mismatch: generic flags (READ/WRITE) mapped arbitrarily onto typed methods.
Meta-rights for the capability reference itself (TRANSFER/DUPLICATE) may be
added when Stage 6 IPC needs them. See Capability Model
for the full rationale.
2. IPC Design
IPC is the most performance-critical kernel mechanism. Every capability invocation across processes goes through it.
| System | Model | Latency (round-trip) | Bulk data | Async |
|---|---|---|---|---|
| seL4 | Synchronous endpoint, direct context switch | ~240 cycles (ARM), ~400 cycles (x86) | Shared memory (explicit) | Notification objects (bitmask signal/wait) |
| Zircon | Channels (async message queue, 64KiB + 64 handles) | ~3000-5000 cycles | VMOs (shared memory) | Ports (signal-based notification) |
| EROS | Synchronous domain call | ~2x L4 | Through address space nodes | None (synchronous only) |
| Plan 9 | 9P over pipes (kernel-mediated) | ~5000+ cycles | Large reads/writes (iounit) | None (blocking per-fid) |
| Genode | RPC objects with session routing | Varies by kernel (uses seL4/NOVA/Linux underneath) | Shared-memory dataspaces | Signal capabilities |
Recommendation for capOS: Continue the dual-path IPC design:
Fast synchronous path (seL4-inspired, for RPC):
- When process A calls a capability in process B and B is blocked waiting, perform a direct context switch (A -> kernel -> B, no unrelated scheduler pick). The current single-CPU direct handoff is implemented.
- Future fastpath work can transfer small messages (<64 bytes) through registers during the switch instead of copying through ring buffers.
Async submission/completion rings (io_uring-inspired, for batching):
- SQ/CQ in shared memory for batched capability invocations. This is the current transport for CALL/RECV/RETURN/RELEASE/NOP.
- Support SQE chaining for Cap’n Proto promise pipelining.
- Use Spritely/OCapN CapTP as the prior-art shape for remote capability sessions, third-party handoffs, answer namespaces, and distributed reference-release accounting, but do not treat current OCapN drafts as a frozen capOS ABI.
- Signal/notification delivery through CQ entries (from Zircon ports).
- User-queued CQ entries for userspace event loop integration.
Bulk data (Zircon/Genode-inspired):
SharedBuffercapability for zero-copy data transfer between processes.- Capnp messages for control plane; shared memory for data plane.
- Critical for file I/O, networking, and GPU rendering.
3. Memory Management Capabilities
Zircon’s VMO/VMAR model is the most mature capability-based memory design. The Go runtime proposal shows why these primitives are essential.
VirtualMemory capability (baseline implemented; still central for Go and advanced allocators):
interface VirtualMemory {
map @0 (hint :UInt64, size :UInt64, prot :UInt32) -> (addr :UInt64);
unmap @1 (addr :UInt64, size :UInt64) -> ();
protect @2 (addr :UInt64, size :UInt64, prot :UInt32) -> ();
}
MemoryObject capability (needed for IPC bulk data, shared libraries).
Zircon calls this concept a VMO (Virtual Memory Object); capOS uses the name
SharedBuffer – see docs/proposals/storage-and-naming-proposal.md for the canonical
interface definition.
interface MemoryObject {
read @0 (offset :UInt64, count :UInt64) -> (data :Data);
write @1 (offset :UInt64, data :Data) -> ();
getSize @2 () -> (size :UInt64);
createChild @3 (offset :UInt64, size :UInt64, options :UInt32)
-> (child :MemoryObject);
}
4. Scheduling
| System | Model | Priority inversion solution | Temporal isolation |
|---|---|---|---|
| seL4 (MCS) | Scheduling Contexts (budget/period/priority) + Reply Objects | SC donation through IPC (caller’s SC transfers to callee) | Yes (budget enforcement per SC) |
| Zircon | Fair scheduler with profiles (deadline, capacity, period) | Kernel-managed priority inheritance | Profiles provide some isolation |
| Genode | Delegated to underlying kernel (seL4/NOVA/Linux) | Depends on kernel | Depends on kernel |
| Out-of-kernel policy | Kernel dispatch/enforcement + user-space policy service | Scheduling-context donation through IPC | Kernel-enforced budgets, user-chosen policy |
| User-space runtimes | M:N work stealing, fibers, async tasks over kernel threads | Requires futexes, runtime cooperation, and OS-visible blocking events | Usually runtime-local only |
Recommendation for capOS: Start with round-robin (already done). When implementing priority scheduling:
- Add scheduling context donation for synchronous IPC: when process A calls process B, B inherits A’s priority and budget. Prevents inversion through the capability graph.
- Support passive servers (from seL4 MCS): servers without their own scheduling context that only run when called, using the caller’s budget. Natural fit for capOS’s service architecture.
- Add temporal isolation (budget/period per scheduling context) for the cloud deployment scenario.
For moving scheduler policy out of the kernel, see Out-of-kernel scheduling. The key finding is a split between kernel dispatch/enforcement and user-space policy: dispatch, budget enforcement, and emergency fallback remain privileged, while admission control, budgets, priorities, CPU masks, and SQPOLL/core grants can be represented as policy managed by a scheduler service. Thread creation, thread handles, scheduling contexts, and park authority should be capability-based from the start; the remaining research task is measurement: compare generic capnp/ring calls against compact capability-authorized park-shaped operations before deciding the park hot-path encoding.
5. Persistence
| System | Model | Consistency | Application effort |
|---|---|---|---|
| EROS/CapROS | Transparent global checkpoint (single-level store) | Strong (global snapshot) | None (automatic) |
| Plan 9 | User-mode file servers with explicit writes | Per-file server | Full (explicit save/load) |
| Genode | Application-level (services manage own persistence) | Per-component | Full |
| capOS (planned) | Content-addressed Store + Namespace caps | Per-service | Full (explicit capnp serialize) |
Recommendation for capOS: Three phases, as informed by EROS:
- Explicit persistence (current plan) – services serialize state to the Store capability as capnp messages. Simple, gives services control.
- Opt-in Checkpoint capability – kernel captures process state (registers, memory, cap table) as capnp messages stored in the Store. Enables process migration and crash recovery for services that opt in.
- Coordinated checkpointing – a coordinator service orchestrates consistent snapshots across multiple services.
Persistent capability references (from EROS + Cap’n Proto):
struct PersistentCapRef {
interfaceId @0 :UInt64;
objectId @1 :UInt64;
permissions @2 :UInt32;
epoch @3 :UInt64;
}
Do NOT implement EROS-style transparent global persistence. The kernel complexity is enormous, debuggability is poor, and Cap’n Proto’s zero-copy serialization already provides near-equivalent benefits for explicit persistence.
6. Namespace and VFS
Plan 9’s per-process namespace is the closest analog to capOS’s per-process
capability table. The key insight: Plan 9’s bind/mount with union
semantics provides composability that capOS’s current Namespace design lacks.
Recommendation: Extend Namespace with union composition:
enum UnionMode { replace @0; before @1; after @2; }
interface Namespace {
resolve @0 (name :Text) -> (hash :Data);
bind @1 (name :Text, hash :Data) -> ();
list @2 () -> (names :List(Text));
sub @3 (prefix :Text) -> (ns :Namespace);
union @4 (other :Namespace, mode :UnionMode) -> (merged :Namespace);
}
VFS as a library (from Genode): libcapos-posix should be an in-process
library that translates POSIX calls to capability invocations. Each POSIX
process receives a declarative mount table (capnp struct) mapping paths to
capabilities. No VFS server needed.
FileServer capability (from Plan 9): For resources that are naturally
file-like (config trees, debug introspection, /proc-style interfaces),
provide a FileServer interface. Not universal (as in Plan 9) but available
where the file metaphor fits.
7. Resource Accounting
Genode’s session quota model addresses a gap in capOS: without resource accounting, a malicious client can exhaust a server’s memory by creating many sessions.
Recommendation: Session-creating capability methods should accept a resource donation parameter:
interface NetworkManager {
createTcpSocket @0 (bufferPages :UInt32) -> (socket :TcpSocket);
}
The client donates buffer memory as part of the session creation. The server allocates from donated resources, not its own.
8. Language Support Roadmap
From the LLVM research, the recommended order:
| Step | What | Blocks |
|---|---|---|
| 1 | Custom target JSON (x86_64-unknown-capos) | Done for booted userspace crates |
| 2 | VirtualMemory capability | Done for baseline map/unmap/protect; Go allocator glue remains |
| 3 | TLS support (PT_TLS parsing, FS base save/restore) | Done for static ELF processes and current-process ThreadControl; per-thread TLS remains |
| 4 | Park authority capability + measured ABI | Go threads, pthreads |
| 5 | Timer capability (monotonic clock) | Done for monotonic now/sleep; wall-clock and event timers remain future work |
| 6 | Go Phase 1: minimal GOOS=capos (single-threaded) | Runtime capability checkpoint done; Go fork remains |
| 7 | Kernel threading | Go GOMAXPROCS>1 |
| 8 | C toolchain + libcapos | C programs, musl |
| 9 | Go Phase 2: multi-threaded + concurrent GC | Go network services |
| 10 | Go Phase 3: network poller | net/http on capOS |
Key decisions:
- Keep
x86_64-unknown-nonefor kernel,x86_64-unknown-caposfor userspace. - Use
local-execTLS model (static linking, no dynamic linker). - Implement park as capability-authorized from the start. Because it operates on memory addresses and must be fast, measure generic capnp/ring calls against a compact capability-authorized operation before fixing the ABI.
- Go can start with cooperative-only preemption (no signals).
Recommendations by Roadmap Stage
Stage 5: Scheduling
| Source | Recommendation | Priority |
|---|---|---|
| Zircon | Generation counter in CapId (stale reference detection) | Done |
| seL4 | Add notification objects (lightweight bitmask signal/wait) | Medium |
| LLVM | Custom target JSON for userspace (x86_64-unknown-capos) | Done |
| LLVM | Per-thread TLS state for Go/threading | Medium |
Stage 6: IPC and Capability Transfer
| Source | Recommendation | Priority |
|---|---|---|
| seL4 | Direct-switch IPC for synchronous cross-process calls | Done baseline |
| seL4 | Badge field on capability entries for server-visible caller identity | Historical / rejected as service identity; see Rejected: Endpoint Badges as Service Identity |
| Zircon | Move semantics for capability transfer through IPC | Done |
| Zircon | MemoryObject capability (shared memory for bulk data) | Done baseline |
| EROS | Epoch-based revocation (O(1) revoke, O(1) check) | High |
| Zircon | Sideband capability-transfer descriptors and result-cap records | Done baseline |
| Genode | SharedBuffer capability for data-plane transfers | High |
| Plan 9 | Promise pipelining (SQE chaining in async rings) | Medium |
| Genode | Session quotas / resource donation on session creation | Medium |
| seL4 | Scheduling context donation through IPC | Medium |
| Plan 9 | Namespace union composition (before/after/replace) | Low |
Post-Stage 6 / Future
| Source | Recommendation | Priority |
|---|---|---|
| seL4 | MCS scheduling (passive servers, temporal isolation) | When needed |
| EROS | Opt-in Checkpoint capability for process persistence | When needed |
| Genode | Dynamic manifest reconfiguration at runtime | When needed |
| Plan 9 | exportfs-pattern capability proxy for network transparency | When needed |
| EROS | PersistentCapRef struct in capnp for storing capability graphs | When needed |
| seL4 | Rust-native formal verification (track Verus/Prusti) | Long-term |
Design Decisions Validated
Several capOS design choices are validated by this research:
-
Cap’n Proto as the universal wire format. Superior to FIDL (random access, zero-copy, promise pipelining, persistence-ready). The right choice. See Zircon Section 5.
-
Flat capability table. Simpler than seL4’s CNode tree, sufficient for capOS. Only add complexity (CNode-like hierarchy) if delegation patterns demand it. See seL4 Section 4.
-
No ambient authority. Every surveyed capability OS confirms this is essential. EROS proved confinement. seL4 proved integrity. capOS has this by design.
-
Explicit persistence over transparent. EROS’s single-level store is elegant but the kernel complexity is enormous. Cap’n Proto zero-copy gives most of the benefits. See EROS, CapROS, Coyotos Section 6.
-
io_uring-inspired async rings. Better than Zircon’s port model for capOS (operation-based > notification-based). See Zircon Section 4.
-
VFS as library, not kernel feature. Genode’s approach, matched by capOS’s planned
libcapos-posix. See Genode Section 3. -
No fork(). Genode has operated without fork() for 15+ years, proving it unnecessary. See Genode Section 4.
Design Gaps Identified
-
Bulk data path is only a substrate. Copying capnp messages through the kernel works for control but not for file/network/GPU data. MemoryObject now provides the mapped-frame substrate; service-facing SharedBuffer APIs remain future Stage 6+ work.
-
Resource accounting is partially unified. The authority-accounting design exists, and VirtualMemory plus FrameAllocator/MemoryObject frame grants now charge the process
ResourceLedger::frame_grant_pagescounter. Future shared-buffer, DMA, log-volume, and CPU-budget resources still need the same treatment. -
No notification primitive. seL4 notifications (lightweight bitmask signal/wait) are needed for interrupt delivery and event notification without full capnp message overhead.
-
No per-thread TLS object yet. Static ELF TLS, context-switch FS-base state, and current-process ThreadControl exist, but future user threads still need independently settable FS bases per thread.
References
See individual deep-dive reports for full reference lists. Key primary sources:
- Klein et al., “seL4: Formal Verification of an OS Kernel,” SOSP 2009
- Lyons et al., “Scheduling-context capabilities,” EuroSys 2018
- Shapiro et al., “EROS: A Fast Capability System,” SOSP 1999
- Shapiro & Weber, “Verifying the EROS Confinement Mechanism,” IEEE S&P 2000
- Pike et al., “The Use of Name Spaces in Plan 9,” OSR 1993
- Feske, “Genode Foundations” (genode.org/documentation)
- Fuchsia Zircon kernel documentation (fuchsia.dev)
seL4 Deep Dive: Lessons for capOS
Research notes on seL4’s design, covering formal verification, capability model, IPC, scheduling, and applicability to capOS.
Primary sources: “seL4: Formal Verification of an OS Kernel” (Klein et al., SOSP 2009), seL4 Reference Manual (v12.x / v13.x), “The seL4 Microkernel – An Introduction” (whitepaper, 2020), “Towards a Verified, General-Purpose Operating System Kernel” (Klein et al., 2008), “Principled Approach to Kernel Design for MCS” (Lyons et al., 2018), seL4 source code and API documentation.
1. Formal Verification Approach
What seL4 Proves
seL4 is the first general-purpose OS kernel with a machine-checked proof of functional correctness. The verification chain establishes:
-
Functional correctness: The C implementation of the kernel refines (faithfully implements) an abstract specification written in Isabelle/HOL. Every possible execution of the C code corresponds to an allowed behavior in the abstract spec. This is not “absence of some bug class” – it is a complete behavioral equivalence between spec and code.
-
Integrity (access control): The kernel enforces capability-based access control. A process cannot access a kernel object unless it holds a capability to it. This is proven as a consequence of functional correctness: the spec defines access rules, and the implementation provably follows them.
-
Confidentiality (information flow): In the verified configuration, information cannot flow between security domains except through explicitly authorized channels. This proves noninterference at the kernel level.
-
Binary correctness: The proof chain extends from the abstract spec through a Haskell executable model, then to the C implementation, and finally to the compiled ARM binary (via the verified CAmkES/CompCert chain or translation validation against GCC output). On ARM, the compiled binary is proven to behave as the C source specifies.
The Verification Chain
Abstract Specification (Isabelle/HOL)
|
| refinement proof
v
Executable Specification (Haskell)
|
| refinement proof
v
C Implementation (10,000 lines of C)
|
| translation validation / CompCert
v
ARM Binary
Each refinement step proves that the lower-level implementation is a correct realization of the higher-level spec. The Haskell model serves as an “executable spec” – it’s precise enough to run but abstract enough to reason about.
Properties Verified
- No null pointer dereferences – a consequence of functional correctness.
- No buffer overflows – all array accesses are proven in-bounds.
- No arithmetic overflow – all integer operations are proven safe.
- No use-after-free – memory management correctness is proven.
- No memory leaks (in the kernel) – all allocated memory is accounted for.
- No undefined behavior – the C code is proven to avoid all UB.
- Capability enforcement – objects are only accessible through valid capabilities, and capabilities cannot be forged.
- Authority confinement – proven that authority does not leak beyond what capabilities allow.
Practical Implications
What verification buys you:
- Eliminates all implementation bugs in the verified code. Not “most bugs” or “common bug classes” – literally all of them, for the verified configuration.
- The security properties (integrity, confidentiality) hold absolutely, not probabilistically.
- Makes the kernel trustworthy as a separation kernel / isolation boundary.
What verification does NOT cover:
- The specification itself could be wrong (it could specify the wrong behavior). Verification proves “code matches spec,” not “spec is correct.”
- Hardware must behave as modeled. The proof assumes a correct CPU, correct memory, no physical attacks. DMA from malicious devices can break isolation unless an IOMMU is used (and IOMMU management is proven correct).
- Only the verified configuration is covered. seL4 has unverified configurations (e.g., SMP, RISC-V, certain platform features). Using unverified features voids the proof.
- Performance-critical code paths (like the IPC fastpath) were initially outside the verification boundary, though significant progress has been made on verifying them.
- The bootloader and hardware initialization code are outside the proof boundary.
- Compiler correctness: on x86, the proof trusts GCC. On ARM, binary verification closes this gap.
Design Constraints Imposed by Verification
The requirement of formal verification has profoundly shaped seL4’s design:
-
Small kernel: ~10,000 lines of C. Every line must be verified, so the kernel is as small as possible. Drivers, file systems, networking – everything lives in user space.
-
No dynamic memory allocation in the kernel: The kernel does not have a general-purpose heap. All kernel memory is pre-allocated and managed through typed capabilities (Untyped memory). This eliminates an entire class of verification complexity (heap reasoning).
-
No concurrency in the kernel: seL4 runs the kernel as a single- threaded “big lock” model (interrupts disabled in kernel mode). SMP is handled by running independent kernel instances on each core with explicit message passing between them (the “clustered multikernel” approach), or by using a big kernel lock (the current SMP approach, which is NOT covered by the verification proof).
-
C implementation: Written in a restricted subset of C that is amenable to Isabelle/HOL reasoning. No function pointers (mostly), no complex pointer arithmetic, no compiler-specific extensions. This makes the code more rigid than typical C but provable.
-
Fixed system call set: The kernel API is small and fixed. Adding a new syscall requires extending the proofs – a major effort.
-
Platform-specific verification: The proof is per-platform. ARM was verified first; x86 verification came later with additional effort. Each new platform requires new proofs.
2. Capability Transfer Model
Core Concepts
seL4’s capability model descends from the EROS/KeyKOS tradition but with significant innovations driven by formal verification requirements.
Kernel Objects: Everything the kernel manages is an object: TCBs (thread control blocks), endpoints (IPC channels), CNodes (capability storage), page tables, frames, address spaces (VSpaces), untyped memory, and more. The kernel tracks the exact type and state of every object.
Capabilities: A capability is a reference to a kernel object combined with access rights. Capabilities are stored in kernel memory, never directly accessible to user space. User space refers to capabilities by position in its capability space.
CSpaces, CNodes, and CSlots
CSlot (Capability Slot): A single storage location that can hold one capability. A CSlot is either empty or contains a capability (object pointer
- access rights + badge).
CNode (Capability Node): A kernel object that is a power-of-two-sized
array of CSlots. A CNode with 2^n slots has a “guard” and a “radix” of
n. CNodes are the building blocks of the capability addressing tree.
CSpace (Capability Space): The complete capability namespace of a thread. A CSpace is a tree of CNodes, rooted at the thread’s CSpace root (a CNode pointed to by the TCB). Capability lookup traverses this tree.
Thread's TCB
|
+-- CSpace Root (CNode, 2^8 = 256 slots)
|
+-- slot 0: cap to Endpoint A
+-- slot 1: cap to Frame X
+-- slot 2: cap to another CNode (2^4 = 16 slots)
| |
| +-- slot 0: cap to Endpoint B
| +-- slot 1: empty
| +-- ...
+-- slot 3: empty
+-- ...
Capability Addressing (CPtr and Depth)
A CPtr (Capability Pointer) is a word-sized integer used to name a capability within a thread’s CSpace. It is NOT a memory pointer – it is an index that the kernel resolves by walking the CNode tree.
Resolution works bit-by-bit from the most significant end:
- Start at the CSpace root CNode.
- The CNode’s guard is compared against the corresponding bits of the CPtr. If they don’t match, the lookup fails. Guards allow sparse addressing without allocating huge CNode arrays.
- The next
radixbits of the CPtr are used as an index into the CNode array. - If the slot contains a CNode capability, recurse: consume the next bits of the CPtr to walk deeper.
- If the slot contains any other capability, the lookup is complete.
- The depth parameter in the syscall tells the kernel how many bits of the CPtr to consume. This disambiguates between “stop at this CNode cap” and “descend into this CNode.”
Example: A CPtr of 0x4B with a two-level CSpace:
- Root CNode: guard = 0, radix = 4 (16 slots)
- Bits [7:4] = 0x4 -> index into root CNode slot 4
- Slot 4 contains a CNode cap: guard = 0, radix = 4 (16 slots)
- Bits [3:0] = 0xB -> index into second-level CNode slot 11
- Slot 11 contains an Endpoint cap -> lookup complete
Flat Table vs. Hierarchical CSpace
seL4’s hierarchical CSpace has significant implications:
Advantages of hierarchical:
- Sparse capability spaces without wasting memory. A process can have a huge CPtr range with only a few CNodes allocated.
- Subtree delegation: a parent can give a child a CNode cap that grants access to a subset of capabilities. The child can manage its own subtree without affecting the parent’s.
- Guards compress address bits, allowing efficient encoding of large capability identifiers.
Disadvantages of hierarchical:
- Lookup is slower than a flat array index – multiple memory indirections per resolution.
- More complex kernel code (and more complex verification).
- User space must explicitly manage CNode allocation and CSpace layout.
capOS comparison: capOS uses a flat Vec<Option<Arc<dyn CapObject>>>
indexed by CapId (u32). The shared Arc lets a single kernel capability
back multiple per-process slots, which is what makes cross-process IPC work
when another service resolves its CapRef via CapSource::Service. The flat
layout is simpler and faster for lookup (single array index), but cannot
support sparse addressing or subtree delegation.
For capOS’s research goals, the flat approach is adequate initially. If
capOS needs hierarchical delegation later (e.g., a supervisor delegating
a subset of caps to a child without copying), it could add a level of
indirection without adopting seL4’s full tree model.
Capability Operations
seL4 provides these operations on capabilities:
Copy: Duplicate a capability from one CSlot to another. Both the source and destination must be in the caller’s CSpace (or the caller must have CNode caps to the relevant CNodes). The new cap has the same authority as the original, minus any rights the caller chooses to strip.
Mint: Like Copy, but also sets a badge on the new capability. A badge is a word-sized integer embedded in the capability that is delivered to the receiver when the capability is used. Badges allow a server to distinguish which client is calling – each client gets a differently-badged cap to the same endpoint, and the server sees the badge on each incoming message.
Move: Transfer a capability from one CSlot to another. The source slot becomes empty. This is an atomic transfer of authority.
Mutate: Move + modify rights or badge in one operation.
Delete: Remove a capability from a CSlot, making it empty.
Revoke: Delete a capability AND all capabilities derived from it. This is the most powerful operation – it allows a parent to withdraw authority it granted to children, transitively.
Capability Derivation and the CDT
seL4 tracks a Capability Derivation Tree (CDT) – a tree recording which capability was derived from which. When capability A is copied or minted to produce capability B, B becomes a child of A in the CDT.
Revoke(A) deletes all descendants of A in the CDT but leaves A itself.
This gives the holder of A the power to revoke all authority derived from
their own authority.
The CDT is critical for clean revocation but adds significant kernel complexity. It requires maintaining a tree structure across all capability copies throughout the system.
Untyped Memory and Retype
One of seL4’s most distinctive features is that the kernel never allocates
memory on its own. All physical memory is initially represented as
Untyped Memory capabilities. To create any kernel object (endpoint, CNode,
TCB, page frame, etc.), user space must invoke the Untyped_Retype operation
on an untyped cap, which carves out a portion of the untyped memory and
creates a new typed object.
This means:
- User space (specifically, the root task or a memory manager) controls all memory allocation.
- The kernel has zero internal allocation – all memory it uses comes from retyped untypeds.
- Memory exhaustion is impossible in the kernel – if a syscall needs memory, user space must have provided it in advance via retype.
- Revoke on an untyped cap destroys ALL objects created from it, reclaiming the memory. This is the mechanism for wholesale cleanup.
3. IPC Fastpath
Overview
seL4’s IPC is synchronous and endpoint-based. An endpoint is a rendezvous point: the sender blocks until a receiver is ready, or vice versa. There is no buffering in the kernel (unlike Mach ports or Linux pipes).
The IPC fastpath is a highly optimized code path for the common case of a short synchronous call/reply between two threads. It is one of seL4’s signature performance features.
How the Fastpath Works
When thread A calls seL4_Call(endpoint_cap, msg):
-
Capability lookup: Resolve the CPtr to find the endpoint cap. In the fastpath, this is optimized to handle the common case of a direct CSlot lookup (single-level CSpace, no guard traversal needed).
-
Receiver check: Is there a thread waiting on this endpoint? If yes, the fastpath applies. If no (receiver isn’t ready), fall to the slowpath which queues the sender.
-
Direct context switch: Instead of the normal path (save sender registers -> return to scheduler -> pick receiver -> restore receiver registers), the fastpath performs a direct register transfer:
- Save the sender’s register state into its TCB.
- Copy the message registers (a small number, typically 4-8 words) from the sender’s physical registers directly into the receiver’s TCB (or leave them in registers if possible).
- Load the receiver’s page table root (vspace) into CR3/TTBR.
- Switch to the receiver’s kernel stack.
- Restore the receiver’s register state.
- Return to user mode as the receiver.
This is a direct context switch – the kernel goes directly from the sender to the receiver without passing through the scheduler. The IPC operation IS the context switch.
-
Reply cap: The sender’s reply cap is set up so the receiver can reply. In the classic (non-MCS) model, a one-shot reply capability is placed in the receiver’s TCB. The receiver calls
seL4_Reply(msg)to send the response directly back.
Performance Characteristics
seL4 IPC is among the fastest measured:
- ARM (Cortex-A9): ~240 cycles for a Call+Reply round-trip (including two privilege transitions, a full context switch, and message transfer).
- x86-64: ~380-500 cycles for a Call+Reply round-trip depending on hardware generation.
- Message size: The fastpath handles small messages (fits in registers). Longer messages require copying from IPC buffer pages and take the slowpath.
For comparison:
- Linux
pipeIPC: ~5,000-10,000 cycles for a round-trip. - Mach IPC (macOS XNU): ~3,000-5,000 cycles.
- L4/Pistachio: ~700-1,000 cycles (seL4 improved on this).
Fastpath Constraints
The fastpath is only taken when ALL of these conditions hold:
- The operation is
seL4_CallorseL4_ReplyRecv(the two most common IPC operations). - The message fits in message registers (no extra caps, no long messages that require the IPC buffer).
- The capability lookup is “simple” – single-level CSpace, direct slot lookup, no guard bits to check.
- There IS a thread waiting at the endpoint (no need to block the sender).
- The receiver is at sufficient priority (in the non-MCS configuration, higher priority than any other runnable thread – or in MCS, the scheduling context can be donated).
- No capability transfer is happening in this message.
- Certain bookkeeping conditions are met (no pending operations on either thread, no debug traps, etc.).
When any condition fails, the kernel falls through to the slowpath, which handles the general case correctly but with more overhead (~5-10x slower than the fastpath).
Direct Switch Mechanics
The key insight is: when thread A calls thread B synchronously, A is going to block until B replies. There is no scheduling decision to make – the only correct action is to run B immediately. So the kernel skips the scheduler entirely:
Thread A (running) Kernel Thread B (blocked on recv)
| | |
| seL4_Call(ep, msg) ---> | |
| | [fastpath] |
| | Save A's regs |
| | Copy msg A -> B |
| | Switch page tables |
| | Restore B's regs |
| | ---------------------->|
| | | [running, processes msg]
| | |
| | <--- seL4_Reply(reply) |
| | [fastpath again] |
| | Save B's regs |
| | Copy reply B -> A |
| | Switch page tables |
| | Restore A's regs |
| <-----------------------| |
| [running, has reply] | |
The entire round-trip involves exactly two kernel entries and two context switches, with no scheduler invocation.
Implications
-
RPC is the natural IPC pattern: seL4’s IPC is optimized for the client-server call/reply pattern. Fire-and-forget or multicast patterns require different mechanisms (notifications, shared memory).
-
Notifications: For async signaling (like interrupts or events), seL4 provides notification objects – a lightweight word-sized bitmask that can be signaled and waited on without message transfer. These are separate from endpoints.
-
Shared memory for bulk transfer: IPC messages are small (register- sized). For large data transfers, the standard pattern is: set up shared memory, then use IPC to synchronize. This is explicit – the kernel doesn’t transparently copy large buffers.
4. CNode/CSpace Architecture in Detail
CNode Structure
A CNode object is a contiguous array of CSlots in kernel memory. The size is always a power of two. The kernel metadata for a CNode includes:
- Radix bits: log2 of the number of slots (e.g., radix=8 means 256 slots).
- Guard value: a bit pattern that must match the CPtr during resolution.
- Guard bits: the number of bits in the guard.
The total bits consumed during resolution of one CNode level is:
guard_bits + radix_bits.
Multi-Level Resolution Example
Consider a two-level CSpace:
Root CNode: guard=0 (0 bits), radix=8 (256 slots)
Slot 5 -> CNode B: guard=0x3 (2 bits), radix=6 (64 slots)
Slot 42 -> Endpoint X
To reach Endpoint X with a 16-bit CPtr at depth 16:
- CPtr = 0b 00000101 11 101010
- Root CNode consumes 8 bits: 00000101 = 5 -> Slot 5 (CNode B cap)
- CNode B guard: next 2 bits = 11 -> matches guard 0x3 -> OK
- CNode B radix: next 6 bits = 101010 = 42 -> Slot 42 (Endpoint X)
- Total bits consumed: 8 + 2 + 6 = 16 = depth -> resolution complete
CSpace Layout Strategies
Flat: One large root CNode with radix=N, no sub-CNodes. Simple, fast lookup (one level). Wastes memory if the CPtr space is sparse.
Two-level: Small root CNode pointing to sub-CNodes. Common for processes that need moderate capability counts.
Deep: Many levels. Useful for delegation: a supervisor gives a child a cap to a sub-CNode, and the child manages its own CSpace subtree below that point.
Comparison with capOS’s Flat Table
| Aspect | seL4 CSpace | capOS CapTable |
|---|---|---|
| Structure | Tree of CNodes | Flat Vec<Option<Arc<dyn CapObject>>> |
| Lookup cost | O(depth) memory indirections | O(1) array index |
| Sparse support | Yes (guards + tree) | No (dense array, holes via free list) |
| Subtree delegation | Yes (grant CNode cap) | No |
| Memory overhead | CNode objects are power-of-2 | Exact-sized Vec |
| Complexity | High (bit-level CPtr resolution) | Low |
| Capability identity | Position in CSpace | CapId (u32 index) |
| Verification burden | Very high | N/A (Rust safety) |
5. MCS (Mixed-Criticality Systems) Scheduling
Background
The original seL4 scheduling model is a simple priority-preemptive scheduler with 256 priority levels and round-robin within each level. This model has a known flaw: priority inversion through IPC. When a high-priority thread calls a low-priority server, the reply might be delayed indefinitely by medium-priority threads preempting the server. The classic solution (priority inheritance) is complex to verify and doesn’t compose well.
The MCS extensions redesign scheduling to solve this and provide temporal isolation.
Key Concepts
Scheduling Context (SC): A new kernel object that represents the “right to execute on a CPU.” An SC holds:
- A budget (microseconds of CPU time per period)
- A period
- A priority
- Remaining budget in the current period
A thread must have a bound SC to be runnable. Without an SC, a thread cannot execute regardless of its priority.
Reply Object: In the MCS model, the one-shot reply capability from classic seL4 is replaced by an explicit Reply kernel object. When thread A calls thread B:
- A’s scheduling context is donated to B.
- A reply object is created to hold A’s return path.
- B now runs on A’s scheduling context (A’s priority and budget).
- When B replies, the SC returns to A.
This solves priority inversion: the server (B) inherits the caller’s priority and budget automatically.
Passive servers: A server thread can exist without its own SC. It only becomes runnable when a client donates an SC via the Call operation. When it replies, it becomes passive again. This is powerful:
- No CPU time is “reserved” for idle servers.
- The server executes on the client’s budget – the client pays for the work it requests.
- Multiple clients can call the same passive server; each brings its own SC.
Temporal Isolation
MCS SCs provide temporal fault isolation:
- Each SC has a fixed budget/period. A thread cannot exceed its budget in any period. When the budget expires, the thread is descheduled until the next period begins.
- This is enforced by hardware timer interrupts – the kernel programs the timer to fire when the current SC’s budget expires.
- A misbehaving (or compromised) component cannot starve other components because its SC bounds its CPU consumption.
- This works even across IPC: if client A calls server B with A’s SC, the combined execution of A+B is bounded by A’s budget.
Comparison with capOS’s Scheduler
capOS currently has a round-robin scheduler (kernel/src/sched.rs) with no
priority levels and no temporal isolation:
#![allow(unused)]
fn main() {
struct Scheduler {
processes: BTreeMap<Pid, Process>,
run_queue: VecDeque<Pid>,
current: Option<Pid>,
}
}
Timer preemption, cap_enter blocking waits, Endpoint IPC, and a baseline
direct IPC handoff are implemented. The MCS model is relevant for the next
scheduling step because the same priority inversion problem arises when a
high-priority client calls a low-priority server through a capability.
6. Relevance to capOS
6.1 Formal Verification
Applicability: Low in the near term. seL4’s verification is done in Isabelle/HOL over C code, which doesn’t transfer to Rust. However, the constraints that verification imposed are valuable design guidance:
- Minimal kernel: seL4’s ~10K lines of C demonstrate how little code a microkernel actually needs. capOS should resist adding kernel features and instead move them to user space.
- No kernel heap allocation on the critical path: seL4’s “untyped memory” approach where user space provides all memory is worth studying. capOS has removed the earlier allocation-heavy synchronous ring dispatch path, but it still uses owned kernel objects and preallocated scratch rather than a user-supplied untyped-memory model.
- No kernel concurrency: seL4 avoids kernel-level concurrency entirely
(SMP uses separate kernel instances or a big lock). capOS currently uses
spin::Mutexaround the scheduler and capability tables. The seL4 approach suggests this is acceptable until/unless per-CPU kernel instances are needed.
Rust alternative: Rust’s type system provides memory safety guarantees that overlap with some of seL4’s verified properties (no buffer overflows, no use-after-free, no null dereference in safe code). This is not a substitute for functional correctness proofs, but it significantly raises the bar compared to unverified C. Ongoing research in Rust formal verification (e.g., Prusti, Creusot, Verus) may eventually enable seL4-style proofs over Rust kernels.
6.2 Capability Model
CNode tree vs. flat table: capOS’s flat CapTable is the right choice
for now. seL4’s CNode tree exists to support delegation (granting a subtree
of your CSpace to a child) and sparse addressing. capOS’s current model
gives each process its own independent flat table and now supports
manifest-provided caps plus explicit copy/move transfer descriptors through
Endpoint IPC. If capOS later needs fine-grained delegation (a parent granting
access to a subset of its caps without copying), it can add a level of
indirection:
Option A: Proxy capability objects that forward to the parent's table
Option B: A two-level table (small root array -> larger sub-arrays)
Option C: Shared capability objects with refcounting
Badge/Mint pattern: seL4’s badge mechanism was initially applied to capOS as endpoint receiver metadata: multiple clients could share one endpoint while the server saw a word-sized caller tag. capOS implemented that substrate by adding badge metadata to capability references and hold edges; endpoint CALL delivery reported the invoked hold badge to the receiver, and copy/move transfer preserved badge metadata.
That model is now historical. Badge-as-service-identity was rejected after
spawn and shell paths exposed delegated-client relabeling hazards. The active
direction is session-bound invocation context: endpoint metadata may remain as
internal receiver metadata or hostile-test fixture, but normal shared-service
identity should come from process session context, broker-granted service
facets, and privacy-bounded disclosure. See
docs/proposals/rejected-endpoint-badges-proposal.md and
docs/proposals/session-bound-invocation-context-proposal.md.
Current ring SQEs carry cap id and method id separately. The cap table stores badge and transfer-mode metadata alongside the object reference:
#![allow(unused)]
fn main() {
struct CapEntry {
object: Arc<dyn CapObject>,
badge: u64,
transfer_mode: CapTransferMode,
}
}
Revocation (CDT): seL4’s Capability Derivation Tree is its most complex internal structure. For capOS, full CDT-style transitive revocation is probably overkill initially. The service-architecture proposal already identifies simpler alternatives:
- Generation counters: Each capability has a generation number. Bumping the generation invalidates all references without traversing a tree.
- Proxy caps: A proxy object that can be invalidated by its creator. Callers hold the proxy, not the real capability.
- Process-lifetime revocation: When a process dies, all caps it held are automatically invalidated (seL4 does this too, but the CDT allows more fine-grained revocation within a living process).
Untyped memory: seL4’s “no kernel allocation” model via untyped memory
is elegant but probably too heavyweight for capOS’s current stage. The key
takeaway is the principle: user space should control resource allocation
as much as possible. capOS’s FrameAllocator capability already moves frame
allocation authority into the capability model.
6.3 IPC Design
This is the most directly actionable area for capOS’s Stage 6.
seL4’s model (synchronous rendezvous + direct switch) vs. capOS’s model (async rings + Cap’n Proto wire format):
| Aspect | seL4 | capOS |
|---|---|---|
| IPC primitive | Synchronous endpoint | Async submission/completion rings |
| Message format | Untyped words in registers | Cap’n Proto serialized messages |
| Bulk transfer | Shared memory (explicit) | TBD (copy in kernel or shared memory) |
| Message size | Small (register-sized, ~4-8 words) | Variable (up to 64KB currently) |
| Scheduling integration | Direct switch (caller -> callee) | Baseline direct IPC handoff implemented |
| Batching | No (one message per syscall) | Yes (io_uring-style ring) |
Key lessons from seL4’s IPC for capOS:
-
Direct switch for synchronous RPC: Even with async rings, capOS needs a synchronous fast path. The baseline single-CPU direct IPC handoff is implemented for the case where process A calls an Endpoint and process B is blocked waiting in RECV. Future work is register payload transfer and measured fastpath tuning.
-
Register-based message transfer for small messages: seL4 avoids copying message bytes through kernel buffers for small messages by transferring them through registers during the context switch. capOS currently moves serialized payloads through ring buffers and bounded kernel scratch. For cross-process IPC, minimizing copies is critical. Options:
- Small messages (<64 bytes) could be transferred in registers during direct switch.
- Large messages could use shared memory regions (mapped into both address spaces) with IPC used only for synchronization.
- The io_uring-style rings are already shared memory – the submission and completion ring buffers could potentially be mapped into both the caller’s and callee’s address spaces for zero-copy IPC.
-
Separate mechanisms for sync and async: seL4 uses endpoints for synchronous IPC and notification objects for async signaling. capOS’s io_uring approach inherently supports batched async operations, but the common case of a simple RPC call-and-wait should have a fast synchronous path too. The two mechanisms complement each other.
-
Notifications for interrupts and events: seL4’s notification objects (lightweight bitmask signal/wait) map well to capOS’s interrupt delivery model. When a hardware interrupt fires, the kernel signals a notification object, and the driver thread waiting on that notification wakes up. This is cleaner than delivering interrupts as full IPC messages.
The Cap’n Proto dimension: capOS’s use of Cap’n Proto wire format for capability messages is a significant divergence from seL4’s untyped word arrays. Tradeoffs:
- Pro: Type safety, schema evolution, language-neutral interfaces, built-in serialization/deserialization, native support for capability references in messages (Cap’n Proto has a “capability table” concept in its RPC protocol).
- Con: Serialization overhead. Even Cap’n Proto’s zero-copy format requires pointer validation and bounds checking that seL4’s raw register transfer does not. For very hot IPC paths, this overhead may be significant.
- Mitigation: For the hot path, capOS could define a “small message” format that bypasses full capnp serialization – just a few raw words, similar to seL4’s register message. Fall back to full capnp for larger or more complex messages.
6.4 MCS Scheduling
Priority donation via IPC: Directly relevant when capOS implements cross-process capability calls. If process A (high priority) calls a capability in process B (low priority), B needs to run at A’s priority to avoid inversion. The seL4 MCS approach of “donating” the scheduling context with the IPC message is clean and composable.
For capOS, the io_uring model complicates this slightly: if submissions are batched, which submitter’s priority should the server inherit? Options:
- Inherit the highest priority among pending submissions.
- Each submission carries its own priority/scheduling context.
- Use the synchronous fast-path (with donation) for priority-sensitive calls, and the async ring for bulk/background operations.
Passive servers: The MCS concept of servers that only consume CPU when called (by borrowing the caller’s scheduling context) maps well to capOS’s capability-based services. A network stack server that only runs when a client sends a request, consuming the client’s CPU budget, is a natural fit for capOS’s service architecture.
Temporal isolation: Budget/period enforcement prevents denial-of-service between capability holders. Even if process A holds a capability to process B, A cannot cause B to consume unbounded CPU time – B’s execution on behalf of A is bounded by A’s scheduling context budget. This is worth considering for capOS’s roadmap, especially for the cloud deployment scenario where isolation is critical.
6.5 Specific Recommendations for capOS
Near-term (Stages 5-6):
-
Badge field on cap holds: Done. Manifest
CapRefbadge metadata is carried into cap-table hold edges, delivered to Endpoint receivers, and preserved across copy/move transfer. -
Implement direct-switch IPC for synchronous calls: Baseline done for Endpoint receivers blocked in RECV. Remaining work is the measured fastpath shape, especially small-message register transfer.
-
Keep the flat CapTable: seL4’s CNode tree complexity is justified by formal verification constraints and subtree delegation. capOS’s flat table is simpler and sufficient. Add proxy/wrapper capabilities for delegation rather than restructuring the table.
-
Add notification objects: A lightweight signaling primitive (word- sized bitmask, signal/wait operations) for interrupt delivery and event notification. Much cheaper than sending a full capnp message for “wake up, there’s work to do.”
Medium-term (post-Stage 6):
-
Scheduling context donation: When implementing priority scheduling, attach a scheduling context to IPC calls so servers inherit caller priority. This prevents priority inversion through the capability graph.
-
Capability rights attenuation: Add a rights mask to capability references so a parent can grant a cap with reduced permissions (e.g., read-only access to a read-write capability). seL4’s rights bits are: Read, Write, Grant (can pass the cap to others), GrantReply (can pass reply cap only).
-
Revocation via generation/epoch counters: Generation-tagged
CapIds catch stale slot reuse, and object-wide epoch revocation now invalidates current child-local grant copies without a seL4-style derivation tree.
Long-term (research directions):
-
Zero-copy IPC via shared memory: For bulk data transfer between processes, map shared memory regions (Cap’n Proto segments) into both address spaces. Use IPC only for synchronization and capability transfer. This combines seL4’s “shared memory + IPC sync” pattern with capOS’s Cap’n Proto wire format.
-
Rust-native verification: Track developments in Verus, Prusti, and other Rust verification tools. capOS’s Rust implementation is better positioned for future formal verification than a C implementation would be, given the type system guarantees already present.
-
Untyped memory model: Consider moving kernel object allocation entirely into capability-gated operations (like seL4’s Retype). User space provides memory for all kernel objects, ensuring the kernel never runs out of memory on its own. This is a significant architectural change but aligns with the “everything is a capability” principle.
Summary Table
| seL4 Feature | Maturity | capOS Equivalent | Recommended Action |
|---|---|---|---|
| Functional correctness proof | Production | None (Rust type safety) | Track Rust verification tools |
| CNode/CSpace tree | Production | Flat CapTable | Keep flat |
| Capability badge/mint | Production | Hold-edge badge | Done baseline |
| Revocation (CDT) | Production | Generation-tagged CapId; object-epoch revocation for child-local grants | Keep epoch revocation instead of adding CDT |
| Untyped memory / Retype | Production | FrameAllocator cap | Consider for hardening phase |
| Synchronous IPC endpoints | Production | Endpoint CALL/RECV/RETURN | Done baseline |
| IPC fastpath (direct switch) | Production | Direct IPC handoff | Done baseline; tune register payload later |
| Notification objects | Production | None | Implement as lightweight signal primitive |
| MCS Scheduling Contexts | Production | Round-robin scheduler | Implement SC donation for IPC |
| Passive servers | Production | None | Natural fit with cap-based services |
| Temporal isolation | Production | None | Consider for cloud deployment |
References
- Klein, G., et al. “seL4: Formal Verification of an OS Kernel.” SOSP 2009.
- seL4 Reference Manual, versions 12.1.0 and 13.0.0.
- “The seL4 Microkernel – An Introduction.” seL4 Foundation Whitepaper, 2020.
- Lyons, A., et al. “Scheduling-context capabilities: A principled, light-weight operating-system mechanism for managing time.” EuroSys 2018.
- Heiser, G., & Elphinstone, K. “L4 Microkernels: The Lessons from 20 Years of Research and Deployment.” SOSP 2016.
- seL4 source code: https://github.com/seL4/seL4
- seL4 API documentation: https://docs.sel4.systems/
seL4 HAMR: Model-Based High-Assurance Engineering
HAMR (High Assurance Modeling and Rapid engineering) is an open-source model-driven development framework for safety-critical embedded systems, developed by the SAnToS Lab at Kansas State University (lead: Prof. John Hatcliff) in collaboration with Collins Aerospace, Dornerworks, and Aarhus University. It was applied on the DARPA CASE (Cyber Assured Systems Engineering) program to generate seL4/C-based applications for UAV mission computing on the Boeing CH-47 Chinook platform.
Primary sources: Sireum HAMR (hamr.sireum.org); Belt et al., “Model-Driven Development for the seL4 Microkernel Using the HAMR Framework” (J. Systems Architecture, 2022); Hatcliff et al., “HAMR: An AADL Multi-Platform Code Generation Toolset” (ICSA 2021); Hatcliff et al., seL4 Summit 2025 keynote (“Model-based Development for seL4 Microkit/Rust with Integrated Formal Methods using HAMR”); GUMBO contract language (Galois / SAnToS Lab); seL4 Foundation CAmkES documentation.
1. What HAMR Is
HAMR operates across three development layers:
-
Architecture modeling – The system is specified in AADL (SAE AS5506) or SysMLv2. The model captures component topology, port-based communication, timing/scheduling properties (periodic, sporadic, aperiodic threads), and GUMBO behavioral contract annotations.
-
Code generation – HAMR generates deployment infrastructure (inter-component communication glue, tasking, platform configuration) and typed component skeletons developers fill with application logic. Output languages: Slang, C, and (as of 2025) Rust.
-
Verification infrastructure – GUMBO model contracts are translated to source-level contracts for Logika (Slang), Verus (Rust), and executable property-based test oracles.
Platform backends: JVM, Linux, seL4/CAmkES (C), and seL4 Microkit (Rust, 2025 work).
2. The AADL Model
AADL is an SAE international standard (AS5506) for architecture description of embedded, real-time, safety-critical systems. Key concepts relevant to the HAMR/seL4 mapping:
-
Components: hierarchically typed –
system,process,thread,device,data,subprogram,bus.threadis the unit of concurrent execution;processis the protected address space containing one or more threads. -
Ports: typed communication endpoints attached to components.
- Data port: most-recent-value semantics; sender writes, receiver reads at next dispatch.
- Event port: queued notification with no data payload.
- Event data port: queued notification with a typed data payload.
-
Connections: directed edges between compatible ports that define the data-flow and event-flow topology of the system. Connections are typed and directional; the model enforces that only compatible port kinds connect.
-
Properties: attach timing, scheduling, size, and other non-functional attributes to components and connections (e.g.
Dispatch_Protocol => Periodic,Period => 10 ms,Queue_Size => 8). -
Behavior Annex (SAE AS5506/3): an optional state-machine sub-language for attaching internal behavioral specifications to components, formalizing the implicit execution semantics of threads.
-
GUMBO: a contract language (developed by Galois and KSU) that extends AADL with
requires/guarantees/computeclauses attached to component implementations, serving as the model-level precondition/postcondition and data-invariant language. GUMBO integrates with the AADL Behavior Annex and is translated by HAMR into Slang/Logika contracts or Verus proof obligations.
3. The seL4/CAmkES Pipeline
HAMR starts from the AADL instance model (as produced by OSATE, the open-source AADL editor) and generates:
3.1 CAmkES Topology Specification
HAMR generates the complete CAmkES .camkes file describing the deployment
topology. The mapping is:
| AADL concept | CAmkES / seL4 concept |
|---|---|
process component | CAmkES component (seL4 protection domain / “partition”) |
thread component | CAmkES component with seL4 domain assignment (1-to-1) |
thread scheduling domain | seL4 domain scheduler domain ID |
| Data port (sender → receiver) | CAmkES dataport (shared memory), write-only cap on sender, read-only cap on receiver |
| Event/event-data port | CAmkES notification or queue construct |
| Connection (A.out → B.in) | CAmkES connection with read/write permission split |
The key isolation invariant: CAmkES read/write permission specifications are used to configure the seL4 kernel to enforce the directionality of AADL ports. The sender component holds a write-only capability to the shared dataport; the receiver holds a read-only capability. The kernel enforces this at the capability level – no bypass is possible without a new capability grant.
3.2 Scheduling
AADL’s timing model (periodic/sporadic threads with bounded periods and deadlines) maps to the seL4 domain scheduler. Each AADL thread gets a static domain assignment. On the DARPA CASE work, time partitioning was enforced via the domain scheduler: each thread’s time slice is determined at build time from the AADL timing properties and the domain schedule is generated as part of the HAMR output.
3.3 Generated Component Skeletons
For each AADL thread, HAMR generates:
- A component skeleton with
initialize,compute(ortimeTriggered/eventTriggered), andfinalizeentry points. - Port API stubs:
get_<portName>(),put_<portName>()functions that hide CAmkES shared-memory / notification mechanics behind a typed, uniform interface. This API is identical in shape across JVM, Linux, and seL4 backends – developer code calls the same interface regardless of platform.
3.4 Slang Reference Implementation
HAMR’s C skeleton APIs are derived from the Slang reference implementation. Slang is Sireum’s safety-critical subset of Scala: immutable-by-default, bounded loops, no reflection, and a restricted type system suited to Logika verification. The Slang implementation serves as a verified reference that the C and Rust backends are expected to match semantically.
3.5 The 2025 seL4 Microkit / Rust Extension
As of the 2025 seL4 Summit work (DARPA PROVERS INSPECTA project), HAMR generates Rust component skeletons deployable in seL4 Microkit protection domains. HAMR auto-generates the Microkit system description file, developer- facing channel/notification APIs for Rust threads, and Verus contract stubs from GUMBO model annotations. This is an active development track; the C/CAmkES backend is the more mature path.
4. Verification Model
HAMR’s verification approach is layered:
-
GUMBO model contracts:
requires/guaranteesclauses on AADL components capture the intended behavioral contract at the architecture level. These are part of the model, not the code. -
Translated code contracts: HAMR translates GUMBO into Slang/Logika proof obligations or Verus specifications. The translation preserves the model-level contract’s semantic intent in the target language’s contract system.
-
Logika / Verus verification: Tools verify that the developer’s component implementation satisfies the translated contracts. Logika operates on Slang; Verus operates on Rust.
-
Property-based test oracles: HAMR also generates executable test harnesses that check GUMBO contract conformance at runtime, complementing formal verification with systematic testing.
-
seL4 kernel verification: The underlying seL4 kernel is formally verified (machine-checked proof of functional correctness in Isabelle/HOL covering integrity and confidentiality). HAMR sits above this: its generated CAmkES specification maps to a seL4 capability topology that the verified kernel enforces. The combination targets the argument that the system’s isolation structure (as modeled in AADL) is correctly realized by the verified kernel.
The assurance case HAMR targets is roughly: AADL model (structural) + GUMBO (behavioral) + Logika/Verus (code-level conformance) + seL4 (kernel-level isolation proof) → high-assurance system suitable for DO-178C / DO-331 objectives. This layered argument is the distinguishing feature versus a conventional RTOS-based development process.
5. Applicability to capOS
5.1 Where the Approaches Align
Both HAMR and capOS treat the formal interface definition as the authoritative contract layer. HAMR uses the AADL model + GUMBO contract annotations; capOS uses the Cap’n Proto schema. Both insist that the interface is the permission: in HAMR, an AADL connection determines which component can write to which port, and the generated CAmkES capability configuration enforces that topology; in capOS, holding a capability to a CapObject determines what methods a caller can invoke, and narrower capabilities enforce tighter access.
Both generate typed, platform-adapted communication glue from the interface definition. HAMR generates port API stubs and CAmkES/Microkit configuration; capOS generates (via capnpc + capos-rt) the typed method dispatch layer that clients call.
5.2 Static vs. Dynamic Capability Topology
The sharpest structural difference: HAMR produces a closed, static topology. All components, connections, and capability distributions are fixed at build time. CAmkES explicitly does not allow runtime changes – the set of components and their communication channels is defined at system configuration time and instantiated at boot. This is intentional: the full topology can be statically analyzed, and the seL4 capability distribution can be checked against a capDL (capability distribution language) model as part of the assurance case.
capOS is designed around dynamic capability routing. The kernel acts as a
capnp-rpc router; new capabilities can be forged by authorized processes,
transferred via Move/Copy grants, and held in per-process CapTables that grow
and change at runtime. The ProcessSpawner, AuthorityBroker, and
SessionManager capabilities enable runtime-created service graphs. This is
not a weakness – it is the whole point of a capability-rpc OS – but it means
the topology at any moment is not checkable against a static model.
For capOS’s current research target, the dynamic model is the right fit. For a flight-critical avionics partition, the static model is the right fit. These are different points on the assurance-vs-flexibility tradeoff.
5.3 Generated Glue vs. Manual CapObject Dispatch
In HAMR, the developer writes only application logic in initialize/
compute/finalize entry points; all communication infrastructure is
generated. The developer-visible API is uniform across backends – the same
get_altimeter() call works on JVM, Linux, and seL4.
In capOS, capability dispatch is currently manual: each CapObject
implementation handles capnp message bytes directly via match-on-method-ID.
The typed client wrappers in capos-rt abstract this for callers, but the
server-side skeleton is hand-written per capability type. HAMR’s approach
suggests an achievable improvement: if capOS had a capnpc plugin or a build
tool that generated CapObject dispatch stubs and server-side skeletons from
.capnp schemas, the authoring burden per capability type would shrink
significantly. The schema already carries everything needed to generate the
match arm, parameter decode, and return encode.
5.4 Model-Driven Partition Generation
HAMR demonstrates the utility of driving the entire partition topology – not
just per-component skeletons – from the model. The CAmkES .camkes file,
the domain schedule, the capability permission split, and the component
binaries all originate from a single AADL instance model. This is “the model
is the system” in the most literal sense.
capOS has no equivalent today. Service-graph topology is described in CUE/AADL
manifests and executed dynamically by init. For future high-assurance work
(e.g., flight-critical or safety-certified deployments), a model-driven
generation step that produces both the system.cue manifest and the capability
grant topology from a formal model would be directly applicable. The capnp
schema would serve as the interface contract (as it already does), while a
system-level architecture model would specify the instantiation and wiring.
5.5 Contract Verification Gap
HAMR demonstrates a full contract pipeline: model annotation (GUMBO) →
generated code contracts (Logika / Verus) → formal verification. capOS has no
equivalent for CapObject implementations. The .capnp schema defines the
method signatures and types, but there is no Logika/Verus-style annotation
layer for pre/postconditions on individual capability method handlers.
For a research OS this is acceptable – capOS’s assurance comes from seL4-style kernel isolation, not from verified component behavior. But the HAMR model shows what the path to component-level behavioral verification looks like when starting from a schema-as-contract baseline.
5.6 AADL vs. Cap’n Proto as the Schema Layer
AADL carries significantly more non-functional information than Cap’n Proto schemas: scheduling properties (period, deadline, dispatch protocol), port queue depths, memory footprint bounds, required hardware (device associations), and safety annex annotations. Cap’n Proto schemas carry method signatures, field types, and (via annotations) some semantic metadata, but scheduling and resource-budget properties are out of scope for the format.
For capOS’s current use – typed RPC dispatch, schema-stable ABI, and code-generation for typed clients and server stubs – Cap’n Proto is the right tool. AADL is not a replacement: it is a higher-level architecture modeling language that sits above the RPC schema layer and consumes it. A future model-driven capOS toolchain would use AADL or SysMLv2 at the system level and capnp schemas at the interface level, not choose one over the other.
6. Open Questions for Future Evaluation
-
capnpc → CapObject stub generation: Given that the capnp schema fully describes method signatures, types, and return shapes, how much of the server-side
CapObjectdispatch boilerplate could a code-gen plugin eliminate? HAMR’s generated skeletons suggest this is tractable. -
System-manifest generation from a topology model: Could a lightweight AADL-or-SysMLv2 instance model (or a CUE-native equivalent) generate the
system.cuemanifest, the initial CapTable grants, and a capDL-style verification model for the static portion of the system graph? -
GUMBO-inspired contract annotations in capnp schemas: Could capnp annotation syntax be used to attach precondition/postcondition stubs (analogous to GUMBO’s
requires/guarantees) to interface methods, enabling future Verus or Creusot verification of CapObject implementations? -
seL4 Microkit vs. CAmkES: The 2025 HAMR work migrates from CAmkES to the newer seL4 Microkit, which uses Rust and a simpler protection-domain model. If capOS ever targets seL4 as an optional verified kernel backend, Microkit + HAMR would be the current recommended entry point.
Sources
- Sireum HAMR – framework home page, pipeline overview, platform backends, GUMBO contract language reference.
- Belt et al., “Model-Driven Development for the seL4 Microkernel Using the HAMR Framework” (J. Systems Architecture, 2022) – primary journal paper on the AADL→seL4/CAmkES mapping.
- Belt et al., preprint (Loonwerks) – open-access preprint of the above.
- Hatcliff et al., “HAMR: An AADL Multi-Platform Code Generation Toolset” (ICSA 2021, Springer) – multi-platform overview paper.
- Hatcliff et al., “Model-based Development for seL4 Microkit/Rust with Integrated Formal Methods using HAMR”, seL4 Summit 2025 keynote – Rust/Microkit extension, Verus integration, Collins/Dornerworks INSPECTA application.
- seL4 Summit 2025 Abstracts – abstract for the HAMR keynote above.
- ResearchGate: HAMR to seL4 Code Generation Concepts diagram – visual of the AADL→CAmkES generation pipeline.
- GUMBO contract language (Galois) – GUMBO overview: model-level contract annotation and auto-insertion into Slang code.
- ACM SIGAda 2023: “An AADL Contract Language Supporting Integrated Model- and Code-Level Verification” – GUMBO design and integration paper.
- SAnToS Lab HAMR system-testing case studies (GitHub) – open-source example systems with GUMBO contracts and generated Logika/Slang verification targets.
- CAmkES GitHub manifest – CAmkES static-architecture model, component/connection model, capDL.
- DARPA CASE program – program context for HAMR’s real-world application.
- Ongoing seL4 Research | seL4 Foundation – includes HAMR as an active seL4 research track.
- SAE AS5506/3: AADL Behavior Model Annex – the AADL Behavior Annex standard that GUMBO extends.
Fuchsia Zircon Kernel: Research Report for capOS
Research into Zircon’s design for informing capOS capability model, IPC, virtual memory, async I/O, and interface definition decisions.
1. Handle-Based Capability Model
Overview
Zircon implements capabilities as handles. A handle is a process-local integer (similar to a Unix file descriptor) that references a kernel object and carries a bitmask of rights. The kernel maintains a per-process handle table that maps handle values to (kernel_object_pointer, rights) pairs. Processes can only interact with kernel objects through handles they hold.
There is no ambient authority in Zircon. A process cannot address kernel objects by name, path, or global ID – it must possess a handle. The initial set of handles is passed to a process at creation time by its parent (or by the component framework).
Handle Representation
Internally, a handle is:
- A process-local 32-bit integer (the “handle value”). The low two bits encode a generation counter to detect use-after-close.
- A reference to a kernel object (refcounted
Dispatcherin Zircon’s C++). - A rights bitmask (
zx_rights_t, auint32_t).
The handle table is per-process, so handle value 0x1234 in process A and
0x1234 in process B refer to completely different objects (or nothing).
Rights
Rights are a bitmask that constrain what operations a handle can perform. Key rights include:
| Right | Meaning |
|---|---|
ZX_RIGHT_DUPLICATE | Can be duplicated via zx_handle_duplicate() |
ZX_RIGHT_TRANSFER | Can be sent through a channel |
ZX_RIGHT_READ | Can read data (channel messages, VMO bytes) |
ZX_RIGHT_WRITE | Can write data |
ZX_RIGHT_EXECUTE | VMO can be mapped as executable |
ZX_RIGHT_MAP | VMO can be mapped into a VMAR |
ZX_RIGHT_GET_PROPERTY | Can query object properties |
ZX_RIGHT_SET_PROPERTY | Can modify object properties |
ZX_RIGHT_SIGNAL | Can set user signals on the object |
ZX_RIGHT_WAIT | Can wait on the object’s signals |
ZX_RIGHT_MANAGE_PROCESS | Can perform management ops on a process |
ZX_RIGHT_MANAGE_THREAD | Can manage threads |
When a syscall is invoked on a handle, the kernel checks that the handle’s
rights include the rights required by that syscall. For example,
zx_channel_write() requires ZX_RIGHT_WRITE on the channel handle.
Rights can only be reduced, never amplified. zx_handle_duplicate() takes
a rights mask and the new handle gets original_rights & requested_rights.
Handle Lifecycle
Creation: Syscalls that create kernel objects return handles. For example,
zx_channel_create() returns two handles (one for each endpoint).
zx_vmo_create() returns a VMO handle. The initial rights are defined per
object type (e.g., a new channel endpoint gets
READ|WRITE|TRANSFER|DUPLICATE|SIGNAL|WAIT).
Duplication: zx_handle_duplicate(handle, rights) -> new_handle. Creates
a second handle to the same kernel object, possibly with reduced rights. The
original is untouched. Requires ZX_RIGHT_DUPLICATE on the source handle.
Transfer: Handles are transferred through channels. When a message is
written to a channel, handles listed in the message are moved from the
sender’s handle table to a transient state inside the channel message. When the
message is read, those handles are installed into the receiver’s handle table
with new handle values. The original handle values in the sender become invalid.
Transfer requires ZX_RIGHT_TRANSFER on each handle being sent.
Replacement: zx_handle_replace(handle, rights) -> new_handle. Atomically
invalidates the old handle and creates a new one with the specified rights
(must be a subset). This avoids a window where two handles exist simultaneously
(unlike duplicate-then-close). Useful for reducing rights before transferring.
Closing: zx_handle_close(handle). Removes the handle from the process’s
table and decrements the kernel object’s refcount. When the last handle to an
object is closed, the object is destroyed (with some exceptions like the
kernel itself keeping references).
Comparison to capOS
capOS’s current CapTable maps CapId (u32) to an Arc<dyn CapObject>. The
shared Arc lets a single kernel capability (for example, a kernel:endpoint
owned by one service and referenced by another through CapSource::Service)
back multiple per-process CapTable slots for cross-process IPC. This is
conceptually similar to Zircon’s handle table, but with key differences:
| Aspect | Zircon | capOS (current) |
|---|---|---|
| Rights | Bitmask per handle | None (all-or-nothing) |
| Object types | Fixed kernel types (Channel, VMO, etc.) | Extensible via CapObject trait |
| Transfer | Move semantics through channels | Copy/move descriptors through Endpoint IPC |
| Duplication | Explicit with rights reduction | Copy transfer for transferable holds |
| Revocation | Close handle; object dies with last ref | Remove from table; no propagation |
| Interface | Fixed syscall per object type | Cap’n Proto method dispatch |
| Generation counter | Low bits of handle value | Upper bits of CapId |
Recommendations for capOS:
-
Keep method authority in typed interfaces for now. Zircon’s rights bitmask is useful for an untyped syscall surface. capOS currently uses narrow Cap’n Proto interfaces plus hold-edge transfer metadata; generic READ/WRITE flags would duplicate schema-level authority unless a concrete cross-interface need appears.
-
Handle generation counters. Implemented: capOS encodes a generation tag in the upper bits of
CapId, with lower bits selecting the table slot. This catches stale CapId use after slot reuse. -
Move semantics for transfer. Implemented for Endpoint CALL/RETURN sideband descriptors. Copy transfer remains explicit and requires a transferable source hold.
-
replaceoperation. An atomic replace (invalidate old, create new with reduced rights) is cleaner than duplicate-then-close for rights attenuation before transfer.
2. Channels
Overview
Zircon channels are the fundamental IPC primitive. A channel is a bidirectional, asynchronous message-passing pipe with two endpoints. Each endpoint is a separate kernel object referenced by a handle.
Creation and Structure
zx_channel_create(options, &handle0, &handle1) creates a channel and returns
handles to both endpoints. Each endpoint can be independently transferred to
different processes. When one endpoint is closed, the other becomes
“peer-closed” (signaled with ZX_CHANNEL_PEER_CLOSED).
Message Format
A channel message consists of:
- Data: Up to 65,536 bytes (64 KiB) of arbitrary byte payload.
- Handles: Up to 64 handles transferred with the message.
Messages are discrete and ordered (FIFO). There is no streaming or partial reads – you read a complete message or nothing.
Write and Read Syscalls
Write: zx_channel_write(handle, options, bytes, num_bytes, handles, num_handles)
- Copies
bytesinto the kernel message queue. - Moves each handle in the
handlesarray from the caller’s handle table into the message. If any handle is invalid or lacksZX_RIGHT_TRANSFER, the entire write fails and no handles are moved. - The write is non-blocking. If the peer has been closed, returns
ZX_ERR_PEER_CLOSED.
Read: zx_channel_read(handle, options, bytes, handles, num_bytes, num_handles, actual_bytes, actual_handles)
- Dequeues the next message. Copies data into
bytes, installs handles into the caller’s handle table, writing new handle values into thehandlesarray. - If the buffer is too small, returns
ZX_ERR_BUFFER_TOO_SMALLand fillsactual_bytes/actual_handlesso the caller can retry with a larger buffer. - Non-blocking by default.
zx_channel_call: A synchronous call primitive. Writes a message to the
channel, then blocks waiting for a reply with a matching transaction ID. This
is the primary mechanism for client-server RPC. The kernel optimizes this path
to avoid unnecessary scheduling: if the server thread is waiting to read, the
kernel can directly switch to it (similar to L4 IPC optimizations).
Handle Transfer Mechanics
When handles are sent through a channel:
- The kernel validates all handles (exist, have
TRANSFERright). - Handles are atomically removed from the sender’s table.
- Handle objects are stored inside the kernel message structure.
- On read, handles are inserted into the receiver’s table with fresh handle values.
- If the channel is destroyed with unread messages containing handles, those handles are closed (objects’ refcounts decremented).
This is critical: handle transfer is move, not copy. The sender loses the
handle. To keep a copy, the sender must duplicate before sending.
Signals
Each channel endpoint has associated signals:
ZX_CHANNEL_READABLE– at least one message is queued.ZX_CHANNEL_PEER_CLOSED– the other endpoint was closed.
Processes can wait on these signals using zx_object_wait_one(),
zx_object_wait_many(), or by binding to a port (see Section 4).
FIDL Relationship
Channels carry raw bytes + handles. FIDL (Section 5) provides the structured protocol layer on top: it defines how bytes are laid out (message header with transaction ID, ordinal, flags; then the payload) and how handles in the message correspond to protocol-level concepts (client endpoints, server endpoints, VMOs, etc.).
Every FIDL protocol communication happens over a channel. A FIDL “client end” is a channel endpoint handle where the client sends requests and reads responses. A “server end” is the other endpoint where the server reads requests and sends responses.
Comparison to capOS
capOS currently uses shared submission/completion rings with Endpoint objects for cross-process CALL/RECV/RETURN routing. Same-process capabilities dispatch directly through the holder’s table; cross-process Endpoint calls queue to the server ring and can trigger a direct IPC handoff when the receiver is blocked.
| Aspect | Zircon Channels | capOS |
|---|---|---|
| Topology | Point-to-point, 2 endpoints | Endpoint-routed capability calls |
| Async | Non-blocking read/write + signal waits | Shared SQ/CQ rings |
| Handle/cap transfer | Embedded in messages | Sideband transfer descriptors |
| Message format | Raw bytes + handles | Cap’n Proto serialized |
| Size limits | 64 KiB data, 64 handles | 64 KiB params (current limit) |
| Buffering | Kernel-side message queue | Endpoint queues plus per-process rings |
Recommendations for capOS:
-
Capability transfer alongside capnp messages. Zircon embeds handles as out-of-band data alongside message bytes. capOS has adopted the same separation with ring sideband transfer descriptors and result-cap records. That keeps the kernel from parsing arbitrary Cap’n Proto payload graphs.
-
Two-endpoint channels vs. Endpoint calls. Zircon’s channels are general-purpose pipes. capOS uses a lighter Endpoint CALL/RECV/RETURN model where a capability invocation is routed to the serving process rather than requiring a channel object per connection.
-
Message size limits. Zircon’s 64 KiB limit has been a pain point (large data must go through VMOs). capOS’s capnp messages naturally handle this because large data can be a separate VMO-like capability referenced in the message. Keep the per-message limit reasonable (64 KiB is a good default) and use capability references for bulk data.
3. VMARs and VMOs
Virtual Memory Objects (VMOs)
A VMO is a kernel object representing a contiguous region of virtual memory that can be mapped into address spaces. VMOs are the fundamental unit of memory in Zircon.
Types:
- Paged VMO: Backed by the page fault handler. Pages are allocated on demand. This is the default.
- Physical VMO: Backed by a specific contiguous range of physical memory. Used for device MMIO.
- Contiguous VMO: Like a paged VMO but guarantees physically contiguous pages. Used for DMA.
Key operations:
zx_vmo_create(size, options) -> handle: Create a paged VMO.zx_vmo_read(handle, buffer, offset, length): Read bytes from a VMO.zx_vmo_write(handle, buffer, offset, length): Write bytes to a VMO.zx_vmo_get_size()/zx_vmo_set_size(): Query/resize.zx_vmo_op_range(): Operations like commit (force-allocate pages), decommit (release pages back to system), cache ops.
VMOs can be read/written directly via syscalls without mapping them. This is useful for small transfers but less efficient than mapping for large data.
Copy-on-Write (CoW) Cloning
zx_vmo_create_child(handle, options, offset, size) -> child_handle
Creates a child VMO that is a CoW clone of a range within the parent. Several clone types exist:
-
Snapshot (
ZX_VMO_CHILD_SNAPSHOT): Point-in-time snapshot. Both parent and child see CoW pages. Writes to either side trigger page copies. The child is fully independent after creation – closing the parent does not affect committed pages in the child. -
Slice (
ZX_VMO_CHILD_SLICE): A window into the parent. No CoW – writes to the slice are visible through the parent and vice versa. The child cannot outlive the parent. -
Snapshot-at-least-on-write (
ZX_VMO_CHILD_SNAPSHOT_AT_LEAST_ON_WRITE): Like snapshot but allows the implementation to share unchanged pages between parent and child more aggressively (pages only diverge when written).
CoW cloning is central to how Fuchsia implements fork()-like semantics for
memory (though Fuchsia doesn’t have fork()) and how it shares immutable data
(e.g., shared libraries are CoW-cloned VMOs).
Virtual Memory Address Regions (VMARs)
A VMAR represents a contiguous range of virtual address space within a process. VMARs form a tree rooted at the process’s root VMAR, which covers the entire user-accessible address space.
Hierarchy:
Root VMAR (entire user address space)
+-- Sub-VMAR A (e.g., 0x1000..0x10000)
| +-- Mapping of VMO X at offset 0x1000
| +-- Sub-VMAR B (0x5000..0x8000)
| +-- Mapping of VMO Y at offset 0x5000
+-- Sub-VMAR C (0x20000..0x30000)
+-- Mapping of VMO Z at offset 0x20000
Key operations:
zx_vmar_map(vmar, options, offset, vmo, vmo_offset, len) -> addr: Map a VMO (or a range of it) into the VMAR at a specific offset or let the kernel choose (ASLR).zx_vmar_unmap(vmar, addr, len): Remove a mapping.zx_vmar_protect(vmar, options, addr, len): Change permissions (read/write/execute) on a mapped range.zx_vmar_allocate(vmar, options, offset, size) -> child_vmar, addr: Create a sub-VMAR.zx_vmar_destroy(vmar): Recursively unmap everything and destroy all sub-VMARs. Prevents new mappings.
ASLR: Zircon implements address space layout randomization through VMARs.
When ZX_VM_OFFSET_IS_UPPER_LIMIT or no specific offset is given, the kernel
randomizes placement within the VMAR.
Permissions: Mapping permissions (R/W/X) are constrained by the VMO
handle’s rights. A VMO handle without ZX_RIGHT_EXECUTE cannot be mapped
as executable, regardless of what the zx_vmar_map() call requests.
Why VMARs Matter
VMARs provide:
- Sandboxing within a process. A component can be given a sub-VMAR handle instead of the root VMAR, limiting where it can map memory.
- Hierarchical cleanup. Destroying a VMAR recursively unmaps everything beneath it.
- Controlled mapping. The parent decides the address space layout for child components by allocating sub-VMARs and passing only sub-VMAR handles.
Comparison to capOS
capOS currently has AddressSpace plus a VirtualMemory capability for
anonymous map/unmap/protect operations. FrameAllocator returns typed
MemoryObject ownership caps rather than raw physical frame grants, but
MemoryObject does not yet provide mapping, cloning, or zero-copy sharing.
| Aspect | Zircon | capOS (current) |
|---|---|---|
| Memory objects | VMO (paged, physical, contiguous) | Owned MemoryObject caps plus anonymous VirtualMemory mappings |
| CoW | VMO child clones (snapshot, slice) | Not implemented |
| Address space | VMAR tree | Flat AddressSpace plus VirtualMemory cap |
| Sharing | Map same VMO in multiple processes | Not implemented |
| Permissions | Per-mapping + per-handle rights | Per-page flags at mapping time |
Recommendations for capOS:
-
VMO-equivalent capability. A “MemoryObject” capability that represents a range of memory (backed by demand-paging or physical pages). This becomes the unit of sharing: pass a MemoryObject cap through IPC, and the receiver maps it into their address space. Define it in
schema/capos.capnp. -
Sub-VMAR capabilities for sandboxing. When spawning a process, instead of granting access to the full address space, grant a sub-region capability. This limits where the process can map memory.
-
CoW cloning is valuable but not urgent. The primary use case (shared libraries, fork) may not apply to capOS’s early stages. Design the VMO interface to support cloning later.
-
VMO read/write without mapping. Zircon allows reading/writing VMO contents via syscall without mapping. This is useful for small IPC data and avoids TLB pressure. Consider supporting this in capOS’s MemoryObject.
4. Async Model (Ports)
Overview
Zircon’s async I/O model is built around ports – kernel objects that
receive event packets. A port is similar to Linux’s epoll but with important
differences. It is the foundation for all async programming in Fuchsia.
Port Basics
A port is a kernel object with a queue of packets (zx_port_packet_t).
Packets arrive either from signal-based waits or from direct user queuing.
Key operations:
zx_port_create(options) -> handle: Create a port.zx_port_wait(port, deadline) -> packet: Dequeue the next packet, blocking until one is available or the deadline expires.zx_port_queue(port, packet): Manually enqueue a user packet.zx_port_cancel(port, source, key): Cancel pending waits.
Signal-Based Async (Object Wait Async)
zx_object_wait_async(object, port, key, signals, options):
This is the primary mechanism. It tells the kernel: “when object has any of
these signals asserted, deliver a packet to port with this key.”
Two modes:
- One-shot (
ZX_WAIT_ASYNC_ONCE): The wait fires once and is automatically removed. The user must re-register after handling. - Edge-triggered (
ZX_WAIT_ASYNC_EDGE): Fires each time a signal transitions from deasserted to asserted. Stays registered.
Packet Format
typedef struct zx_port_packet {
uint64_t key; // User-defined key (set during wait_async)
uint32_t type; // ZX_PKT_TYPE_SIGNAL_ONE, ZX_PKT_TYPE_USER, etc.
zx_status_t status; // Result status
union {
zx_packet_signal_t signal; // Which signals triggered
zx_packet_user_t user; // User-queued packet payload (32 bytes)
zx_packet_guest_bell_t guest_bell;
// ... other packet types
};
} zx_port_packet_t;
The signal variant includes trigger (which signals were waited on),
observed (current signal state), and a count (for edge-triggered, how many
transitions).
Async Dispatching (libasync)
Fuchsia’s userspace async library (libfidl, async-loop) provides a
higher-level event loop:
async::Loop: An event loop that owns a port and dispatches events to registered handlers.async::Wait: Wrapszx_object_wait_async()with a callback. When the signal fires, the loop calls the handler.async::Task: Runs a closure on the loop’s dispatcher.- FIDL bindings: The async FIDL bindings register channel-readable waits on the loop’s port. When a message arrives, the FIDL dispatcher decodes it and calls the appropriate protocol method handler.
The typical pattern:
loop = async::Loop()
loop.port -> zx_port_create()
// Register interest in channel readability
zx_object_wait_async(channel, loop.port, key, ZX_CHANNEL_READABLE)
// Event loop
while True:
packet = zx_port_wait(loop.port)
handler = lookup(packet.key)
handler(packet)
// Re-register if one-shot
Comparison to Linux io_uring
| Aspect | Zircon Ports | Linux io_uring |
|---|---|---|
| Model | Event notification (signals) | Operation submission/completion |
| Submission | No SQ; operations are separate syscalls | SQ ring: batch operations |
| Completion | Port packet queue | CQ ring in shared memory |
| Kernel transitions | One per wait_async + one per port_wait | One per io_uring_enter (batched) |
| Memory sharing | No shared ring buffers | SQ/CQ are mmap’d shared memory |
| Zero-copy | Not for port packets | Registered buffers, fixed files |
| Batching | No inherent batching | Core design: submit N ops, one syscall |
| Chaining | Not supported | SQE linking (sequential/parallel) |
| Scope | Signal notification only | Full I/O operations (read, write, send, recv, fsync, …) |
Key differences:
-
Ports are notification-based; io_uring is operation-based. A port tells you “something happened” (a signal was asserted), then you do separate syscalls to act on it (read the channel, accept the socket, etc.). io_uring lets you submit the actual I/O operation and the kernel does it asynchronously, returning the result in the completion ring.
-
io_uring avoids syscalls for submission. The submission queue is shared memory – userspace writes SQEs and the kernel reads them without a syscall (in polling mode) or with a single
io_uring_enter()for a batch of operations. Ports require a syscall perwait_asyncregistration. -
io_uring supports chaining. SQE linking allows dependent operations (e.g., “read from file, then write to socket”) without returning to userspace between steps.
-
Ports are simpler. The signal model is straightforward and composes well with Zircon’s object model. io_uring’s complexity (dozens of opcodes, registered buffers, fixed files, kernel-side polling) is much higher.
Performance Tradeoffs
Ports:
- Pro: Simple, well-integrated with kernel object model, easy to reason about.
- Con: Extra syscalls per operation (wait_async to register, port_wait to receive, then the actual operation syscall). At least 3 syscalls per async operation.
io_uring:
- Pro: Can batch many operations in a single syscall. Shared-memory rings avoid copies. Kernel-side polling can eliminate syscalls entirely.
- Con: Complex API surface, security attack surface (many kernel bugs have been in io_uring), complex state management.
Comparison to capOS’s Planned Async Rings
capOS plans io_uring-inspired capability rings: an SQ where userspace submits capnp-serialized capability invocations and a CQ where the kernel posts completions.
| Aspect | Zircon Ports | capOS Planned Rings |
|---|---|---|
| Submission | Separate syscalls | SQ in shared memory |
| Completion | Port packet queue (kernel-owned) | CQ in shared memory |
| Operation scope | Signal notification only | Full capability invocations |
| Batching | None | Natural (fill SQ, single syscall) |
| Wire format | Fixed packet struct | Cap’n Proto messages |
Recommendations for capOS:
-
The io_uring model is better than ports for capOS’s use case. Since every operation in capOS is a capability invocation (not just a signal notification), putting the full operation in the submission ring eliminates the extra round-trip that ports require. This is the right choice.
-
Keep a signal/notification mechanism too. Even with async rings, capOS needs a way to wait for events (e.g., “data available on this channel”, “process exited”). Consider a simple signal/wait mechanism alongside the capability rings – perhaps signal delivery goes through the CQ as a special completion type.
-
Study io_uring’s SQE linking. Chaining dependent capability calls (e.g., “read from FileStore, then write to Console”) without returning to userspace is powerful. This maps naturally to Cap’n Proto promise pipelining: “call method A on cap X, then call method B on the result’s capability” – the kernel can chain these internally.
-
Registered/fixed capabilities. io_uring has “fixed files” (registered fd set for faster lookup). capOS could have a “hot set” of capabilities pinned in the SQ context for faster dispatch (avoid per-call table lookup).
-
Completion ordering. io_uring completions can arrive out of order. capOS’s CQ should also support out-of-order completion (each SQE has a user_data tag echoed in the CQE) to enable true async pipelining.
5. FIDL (Fuchsia Interface Definition Language)
Overview
FIDL is Fuchsia’s IDL for defining protocols that communicate over channels. It serves a similar role to Cap’n Proto schemas in capOS: defining the contract between client and server.
FIDL vs. Cap’n Proto: Schema Language
FIDL example:
library fuchsia.example;
type Color = strict enum : uint32 {
RED = 1;
GREEN = 2;
BLUE = 3;
};
protocol Painter {
SetColor(struct { color Color; }) -> ();
DrawLine(struct { x0 float32; y0 float32; x1 float32; y1 float32; }) -> ();
-> OnPaintComplete(struct { num_pixels uint64; });
};
Equivalent Cap’n Proto:
enum Color { red @0; green @1; blue @2; }
interface Painter {
setColor @0 (color :Color) -> ();
drawLine @1 (x0 :Float32, y0 :Float32, x1 :Float32, y1 :Float32) -> ();
}
Key differences in the schema language:
| Feature | FIDL | Cap’n Proto |
|---|---|---|
| Unions | flexible union, strict union | Anonymous unions in structs |
| Enums | strict enum, flexible enum | enum (always strict) |
| Optionality | box<T>, nullable types | Default values, union with Void |
| Evolution | flexible keyword for forward compat | Field numbering, @N ordinals |
| Tables | table (like protobuf, sparse) | struct with default values |
| Events | -> EventName(...) server-sent | No built-in events |
| Error syntax | -> () error uint32 | Must encode in return struct |
| Capability types | client_end:P, server_end:P | interface P as field type |
FIDL’s table type is analogous to Cap’n Proto structs in terms of
evolvability (can add fields without breaking), but Cap’n Proto structs are
more compact on the wire (fixed-size inline section + pointers) while FIDL
tables use an envelope-based encoding.
Wire Format Comparison
FIDL wire format:
- Little-endian, 8-byte aligned.
- Messages have a 16-byte header:
txid(4 bytes), flags (3 bytes), magic byte (0x01), ordinal (8 bytes). - Structs are laid out inline with natural alignment and explicit padding.
- Out-of-line data (strings, vectors, tables) uses offset-based indirection via “envelopes” (inline 8-byte entry: 4 bytes num_bytes, 2 bytes num_handles, 2 bytes flags).
- Handles are out-of-band. The wire format contains
ZX_HANDLE_PRESENT(0xFFFFFFFF) orZX_HANDLE_ABSENT(0x00000000) markers where handles appear. The actual handles are in the channel message’s handle array, consumed in order of appearance in the linearized message. - Encoding is done into a contiguous byte buffer + a separate handle array, matching the channel write API.
- No pointer arithmetic. FIDL v2 uses a “depth-first traversal order” encoding where out-of-line objects are laid out sequentially. Offsets are not stored; the decoder walks the type schema to find boundaries.
Cap’n Proto wire format:
- Little-endian, 8-byte aligned (word-based).
- Messages have a segment table header listing segment sizes.
- Structs have a fixed data section + pointer section. Pointers are relative offsets (self-relative, in words).
- Uses pointer-based random access: can read any field without parsing the entire message.
- Capabilities are indexed. Cap’n Proto’s RPC protocol assigns capability table indices to interface references in messages. The actual capability (file descriptor, handle, etc.) is transferred out-of-band.
- Supports multi-segment messages (FIDL is always single-segment).
- Zero-copy read: can read directly from the wire buffer without deserialization.
Key wire format differences:
| Property | FIDL | Cap’n Proto |
|---|---|---|
| Random access | No (sequential decode) | Yes (pointer-based) |
| Zero-copy read | Partial (decode-on-access for some types) | Full (read from buffer) |
| Segments | Single contiguous buffer | Multi-segment |
| Pointers | Implicit (traversal order) | Explicit (relative offsets) |
| Size overhead | Smaller (no pointer words) | Larger (pointer section) |
| Decode cost | Must validate sequentially | Can validate lazily |
| Handle/cap encoding | Presence markers + out-of-band array | Cap table indices + out-of-band |
FIDL Capability Transfer
FIDL has first-class syntax for capability transfer in protocols:
protocol FileSystem {
Open(resource struct {
path string:256;
flags uint32;
object server_end:File;
}) -> ();
};
protocol File {
Read(struct { count uint64; }) -> (struct { data vector<uint8>:MAX; });
GetBuffer(struct { flags uint32; }) -> (resource struct { buffer zx.Handle:VMO; });
};
server_end:File– a channel endpoint where the server will serve theFileprotocol. The client creates a channel, keeps the client end, and sends the server end through this call.client_end:File– a channel endpoint for a client of theFileprotocol.zx.Handle:VMO– a handle to a specific kernel object type (VMO).- The
resourcekeyword marks types that contain handles (and thus cannot be copied, only moved).
The FIDL compiler tracks handle ownership: types containing handles are
“resource types” with move semantics. This is enforced at the language binding
level (e.g., in C++, resource types are move-only; in Rust, they implement
Drop but not Clone).
Comparison to capOS’s Cap’n Proto Usage
Cap’n Proto natively supports capability transfer through its interface
types:
interface FileSystem {
open @0 (path :Text, flags :UInt32) -> (file :File);
}
interface File {
read @0 (count :UInt64) -> (data :Data);
getBuffer @1 (flags :UInt32) -> (buffer :MemoryObject);
}
In standard Cap’n Proto RPC, file :File in the return type means “a
capability to a File interface.” The RPC system assigns a capability table
index, transfers it out-of-band, and the receiver gets a live reference to
invoke further methods.
Recommendations for capOS:
-
Use out-of-band capability transfer beside Cap’n Proto payloads. Cap’n Proto RPC has capability descriptors indexed into a capability table, but capOS currently keeps kernel transfer semantics in ring sideband records so the kernel can treat Cap’n Proto payload bytes as opaque. Promise pipelining should build on that sideband result-cap namespace rather than requiring general payload traversal in the kernel.
-
No need to switch to FIDL. Cap’n Proto’s wire format is superior for capOS’s use case:
- Random access means runtimes and services can inspect specific fields without full deserialization. The kernel should keep using bounded sideband metadata for transport decisions.
- Zero-copy read means less allocation in userspace protocol handling.
- Multi-segment messages allow avoiding large contiguous allocations.
- Promise pipelining is native to Cap’n Proto RPC, aligning with capOS’s planned async ring chaining.
-
FIDL’s
resourcekeyword is worth imitating. Mark capnp types that contain capabilities differently from pure-data types. This could be done at the schema level (Cap’n Proto already distinguishesinterfacefields) or as a convention. This enables the kernel to fast-path messages that contain no capabilities (no need to scan for capability descriptors). -
FIDL’s
tabletype for evolution. Cap’n Proto structs already support adding fields, but capOS should be aware that FIDL tables are more explicitly designed for cross-version compatibility. For system interfaces that will evolve, consider using Cap’n Proto groups or designing structs with generous ordinal spacing.
6. Synthesis: Relevance to capOS
Handle Model vs. Typed Capability Dispatch
Zircon’s handle model is untyped at the handle level – a handle is just
(object_ref, rights). The type comes from the object. All operations go through
fixed syscalls (zx_channel_write, zx_vmo_read, etc.).
capOS’s model is typed at the capability level – each capability
implements a Cap’n Proto interface with method dispatch. Operations go through
ring SQEs such as CAP_OP_CALL, with Cap’n Proto params and results carried
in userspace buffers.
Both are valid. Zircon’s approach is lower overhead (no serialization for simple
operations like vmo_read), while capOS’s approach gives uniformity (every
operation has the same wire format, enabling persistence and network
transparency).
Hybrid recommendation: For performance-critical operations (memory mapping, signal waiting), consider adding “fast-path” syscalls that bypass capnp serialization, similar to how Zircon has dedicated syscalls per object type. The capnp path remains the general mechanism and the “canonical” interface.
Async Rings vs. Ports: The Right Call
capOS’s io_uring-inspired async rings are a better fit than Zircon’s port model for a capability OS:
- Ports require separate syscalls for registration, waiting, and the actual operation. Async rings batch everything.
- Cap’n Proto’s promise pipelining maps naturally to SQE chaining.
- The shared-memory ring design avoids kernel-side queuing overhead.
However, learn from ports:
- The signal model (each object has a signal set, watchers are notified) is clean and composable. Consider making “wait for signal” a CQ event type.
zx_port_queue()(user-initiated packets) is useful for waking up event loops from user code. Support user-initiated CQ entries.
VMO/VMAR vs. capOS Memory Model
capOS should implement VMO-equivalent capabilities after the current Endpoint and transfer baseline:
- IPC already has shared rings, but bulk data still needs explicit shared memory objects.
- Capability transfer of memory regions (passing a MemoryObject cap through IPC) is the standard pattern for bulk data transfer.
- CoW cloning enables efficient process creation.
Proposed capability interfaces:
interface MemoryObject {
read @0 (offset :UInt64, count :UInt64) -> (data :Data);
write @1 (offset :UInt64, data :Data) -> ();
getSize @2 () -> (size :UInt64);
setSize @3 (size :UInt64) -> ();
createChild @4 (offset :UInt64, size :UInt64, options :UInt32) -> (child :MemoryObject);
}
interface AddressRegion {
map @0 (offset :UInt64, vmo :MemoryObject, vmoOffset :UInt64, len :UInt64, flags :UInt32) -> (addr :UInt64);
unmap @1 (addr :UInt64, len :UInt64) -> ();
protect @2 (addr :UInt64, len :UInt64, flags :UInt32) -> ();
allocateSubRegion @3 (offset :UInt64, size :UInt64) -> (region :AddressRegion, addr :UInt64);
}
FIDL vs. Cap’n Proto: Stay with Cap’n Proto
Cap’n Proto is the right choice for capOS. The advantages over FIDL:
- Language-independent standard. FIDL is Fuchsia-only. Cap’n Proto has implementations in C++, Rust, Go, Python, Java, etc.
- Zero-copy random access. The kernel can inspect message fields without full deserialization.
- Promise pipelining. Native to capnp-rpc, enabling the async ring chaining that capOS plans.
- Persistence. Cap’n Proto messages are self-describing (with schema) and suitable for on-disk storage – important for capOS’s planned capability persistence.
The one thing FIDL does better: tight integration of handle/capability metadata
in the type system (the resource keyword, client_end/server_end syntax,
handle type constraints). capOS should ensure its capnp schemas clearly
distinguish capability-carrying types and that the kernel enforces capability
transfer semantics.
Concrete Action Items for capOS
Ordered by priority and dependency:
-
Keep typed-interface authority model. Do not add a Zircon-style generic rights bitmask until a concrete method-attenuation need beats narrow wrapper capabilities and transfer-mode metadata.
-
Handle generation counters. Done: upper bits of
CapIddetect stale references. -
Design MemoryObject/SharedBuffer capability. Define and implement the shared-memory object that replaces raw-frame transfer for bulk IPC.
-
Design AddressRegion capability (Stage 5). Sub-VMAR-like sandboxing. The root VMAR handle is part of the initial capability set.
-
Capability transfer sideband. Baseline CALL/RETURN copy and move transfer is implemented; promise-pipelined result-cap mapping still needs a precise rule before pipeline dispatch lands.
-
Async rings with signal delivery. SQ/CQ capability rings are implemented for transport; notification objects and promise pipelining remain future work.
-
User-queued CQ entries (with async rings). Allow userspace to post wake-up events to its own CQ, enabling pure-userspace event loop integration.
Appendix: Key Zircon Syscall Reference
For reference, the most architecturally significant Zircon syscalls:
| Syscall | Purpose |
|---|---|
zx_handle_close | Close a handle |
zx_handle_duplicate | Duplicate with rights reduction |
zx_handle_replace | Atomic replace with new rights |
zx_channel_create | Create channel pair |
zx_channel_read | Read message + handles from channel |
zx_channel_write | Write message + handles to channel |
zx_channel_call | Synchronous write-then-read (RPC) |
zx_port_create | Create async port |
zx_port_wait | Wait for next packet |
zx_port_queue | Enqueue user packet |
zx_object_wait_async | Register signal wait on port |
zx_object_wait_one | Synchronous wait on one object |
zx_vmo_create | Create virtual memory object |
zx_vmo_read / write | Direct VMO access |
zx_vmo_create_child | CoW clone |
zx_vmar_map | Map VMO into address region |
zx_vmar_unmap | Unmap |
zx_vmar_allocate | Create sub-VMAR |
zx_process_create | Create process (with root VMAR) |
zx_process_start | Start process execution |
Used By
- Capability Model for the comparison with generation-tagged flat cap tables and rights-bit alternatives.
- Memory Management for VMO/VMAR-style separation between object backing and virtual address-space mappings.
- Go VirtualMemory Contract for commit/decommit and reservation precedent.
Genode OS Framework: Research Report for capOS
Research on Genode’s capability-based component framework, session routing, VFS architecture, and POSIX compatibility – with lessons for capOS.
1. Capability-Based Component Framework
Core Abstraction: RPC Objects
Genode’s fundamental abstraction is the RPC object. Every service in the system is implemented as an RPC object that can be invoked by clients holding a capability to it. The capability is an unforgeable reference – a kernel- protected token that names a specific RPC object and grants the holder the right to invoke its methods.
Genode supports multiple microkernels (NOVA, seL4, Fiasco.OC, a custom base-hw kernel). The capability model is consistent across all of them, though the kernel-level implementation details differ. The framework abstracts kernel capabilities into its own uniform model.
Key properties of Genode capabilities:
- Unforgeable. A capability can only be obtained by delegation from a holder or creation by the kernel. There is no mechanism to synthesize a capability from an integer or address.
- Typed. Each capability refers to an RPC object with a specific interface. The C++ type system enforces interface contracts at compile time.
- Delegatable. A capability holder can pass it to another component via RPC arguments, allowing authority to flow through the system graph.
- Revocable. Capabilities can be revoked (invalidated). When an RPC object is destroyed, all capabilities pointing to it become invalid.
Capability Types in Genode
Genode distinguishes several kinds of capabilities based on what they refer to:
-
Session capabilities. The most common type. A session capability refers to a service session – an ongoing relationship between a client and a server. Example: a
Log_sessioncapability lets a client write log messages to a specific log session on a LOG server. -
Parent capability. Every component holds an implicit capability to its parent. This is the channel through which it requests resources and sessions. The parent capability is never explicitly passed – it’s built into the component framework.
-
Dataspace capabilities. Represent shared-memory regions. A
Ram_dataspacecapability grants access to a specific region of physical memory. Dataspaces are the mechanism for bulk data transfer between components (the RPC path is for small messages and control). -
Signal capabilities. Used for asynchronous notifications. A signal source produces signals; holders of the signal capability can register handlers. Signals are Genode’s primary async notification mechanism – they don’t carry data, just wake up the receiver.
Sessions: The Service Contract
A session is the central concept of Genode’s inter-component communication. It represents an established relationship between a client component and a server component, with negotiated resource commitments.
Session lifecycle:
-
Request. A client asks its parent to create a session of a specific type (e.g.,
Gui::Session,File_system::Session,Nic::Session). The request includes a label string and optional session arguments. -
Routing. The parent routes the session request according to its policy (see Section 2). The request may traverse multiple levels of the component tree.
-
Creation. The server creates a session object, allocates resources for it (e.g., a shared-memory buffer), and returns a session capability to the client.
-
Use. The client invokes RPC methods on the session capability. The server handles the calls. Both sides can use shared dataspaces for bulk data.
-
Close. Either side can close the session. Resources committed to the session are released back.
This model is fundamentally different from Unix IPC (anonymous pipes/sockets). Every session is:
- Typed – the interface is known at compile time.
- Named – sessions carry a label used for routing and policy.
- Resource-accounted – the client explicitly donates RAM to the server via a “session quota” to fund the server-side state for this session. This prevents denial-of-service through resource exhaustion.
Resource Trading
Genode’s resource model is unique and worth studying closely. Resources (primarily RAM) flow through the component tree:
- The kernel grants a fixed RAM budget to core (the root component).
- Core grants budgets to its children (typically just init).
- Init grants budgets to its children according to the deployment config.
- Each component can donate RAM to servers when opening sessions.
The session_quota mechanism works as follows: when a client opens a
session, it specifies how much RAM it donates. This RAM transfer goes
from the client’s budget to the server’s budget. The server uses this
donated RAM to allocate server-side state for the session. When the
session closes, the RAM flows back.
This creates a closed accounting system:
- No component can use more RAM than it was granted.
- Servers don’t need their own large budgets – clients fund their sessions.
- Resource exhaustion is contained: a misbehaving client can only exhaust its own budget, not the server’s.
Capability Invocation vs. Delegation
Genode distinguishes two fundamental operations on capabilities:
Invocation: calling an RPC method on the capability. The caller sends a message to the RPC object named by the capability, the server processes it and returns a result. This is synchronous in Genode – the caller blocks until the server replies. (Asynchronous interaction uses signals and shared memory.)
Delegation: passing a capability as an argument in an RPC call. When a capability appears as a parameter or return value, the kernel transfers the capability reference to the receiving component. The receiver now holds an independent reference to the same RPC object. This is how authority propagates through the system.
Example: when a client opens a File_system::Session, the session
creation returns a session capability. If the file system server needs
to allocate memory, it calls back to the client’s RAM service using a
RAM capability that was delegated during session setup.
Capabilities in Genode RPC are transferred by the kernel during the IPC operation – the framework marshals them into a special “capability argument” slot in the IPC message, and the kernel copies the capability reference into the receiver’s capability space. This is transparent to application code: capabilities appear as typed C++ objects in the RPC interface.
2. Session Routing
The Problem Session Routing Solves
In a traditional OS, services are found via well-known names in a global namespace (D-Bus addresses, socket paths, service names). This creates ambient authority – any process can connect to any service if it knows the name.
Genode has no global service namespace. A component can only obtain sessions through its parent. The parent decides which server to route each session request to. This means:
- Service visibility is controlled structurally.
- A component can only reach services its parent explicitly allows.
- Different children of the same parent can be routed to different servers for the same service type.
Parent-Child Relationship
Every Genode component (except core) has exactly one parent. The parent:
- Created the child (spawned it with an initial set of resources).
- Intercepts all session requests from the child.
- Routes requests according to its routing policy.
- Can deny requests entirely (the child gets an error).
This creates a tree structure where authority flows downward. A child cannot bypass its parent to reach a service the parent didn’t approve.
Init’s Routing Configuration
The init process (Genode’s init) reads an XML configuration that
specifies which services to start and how to route their session requests.
This is the core of system policy.
A minimal init config:
<config>
<parent-provides>
<service name="LOG"/>
<service name="ROM"/>
<service name="CPU"/>
<service name="RAM"/>
<service name="PD"/>
</parent-provides>
<start name="timer">
<resource name="RAM" quantum="1M"/>
<provides> <service name="Timer"/> </provides>
<route>
<service name="ROM"> <parent/> </service>
<service name="LOG"> <parent/> </service>
<service name="CPU"> <parent/> </service>
<service name="RAM"> <parent/> </service>
<service name="PD"> <parent/> </service>
</route>
</start>
<start name="test-log">
<resource name="RAM" quantum="1M"/>
<route>
<service name="Timer"> <child name="timer"/> </service>
<service name="LOG"> <parent/> </service>
<!-- remaining services routed to parent by default -->
<any-service> <parent/> </any-service>
</route>
</start>
</config>
Key routing directives:
<parent/>– route to the parent (upward delegation).<child name="x"/>– route to a specific child (sibling routing).<any-child/>– route to any child that provides the service.<any-service>– catch-all for unspecified service types.
Label-Based Routing
Labels are strings attached to session requests. They carry context about who is requesting and what they want, enabling fine-grained routing decisions.
When a client requests a session, it attaches a label. As the request traverses the routing tree, each intermediate component (typically init) can prepend its own label. By the time the request reaches the server, the label encodes the full path through the component tree.
Example: a component named my-app inside an init subsystem named
apps requests a File_system session with label "data". The
composed label arriving at the file system server is:
"apps -> my-app -> data".
The server can use this label for:
- Access control. Grant different permissions based on who is asking.
- Isolation. Store data in different directories per client.
- Logging. Identify which component generated a message.
Label-based routing in init config:
<start name="fs">
<provides> <service name="File_system"/> </provides>
<route> ... </route>
</start>
<start name="app-a">
<route>
<service name="File_system" label="data">
<child name="fs"/>
</service>
<service name="File_system" label="config">
<child name="config-fs"/>
</service>
</route>
</start>
Here, app-a’s file system requests are split: requests labeled "data"
go to one server, requests labeled "config" go to another. The
application code is unchanged – the routing is entirely a deployment
decision.
Routing as Policy
The critical insight is that routing IS access control. There is no separate permission system. If a component’s route config doesn’t include a path to a network service, that component has no network access – period. It cannot discover the network service because it has no way to name it.
This replaces:
- Firewall rules (routing controls which network services are reachable)
- File permissions (routing controls which file system sessions are available)
- Process isolation policies (routing controls everything)
The routing configuration is equivalent to a whitelist of allowed service connections for each component. Adding or removing access means editing the init config, not modifying the component’s code or the server’s access control lists.
Dynamic Routing and Sculpt
In the static case (Genode’s test scenarios), routing is defined once in init’s config. In Sculpt OS (Section 6), the routing configuration can be modified at runtime, allowing users to install applications and connect them to services dynamically.
3. VFS on Top of Capabilities
The VFS Layer
Genode’s VFS (Virtual File System) is a library-level abstraction, not a kernel feature. It provides a path-based file-like interface implemented as a plugin architecture within a component’s address space.
The VFS exists because many existing applications (and libc) expect file-like access patterns. Rather than forcing all code to use Genode’s native session/capability model, the VFS provides a translation layer.
Architecture:
Application code
|
| POSIX: open(), read(), write()
v
libc (Genode's port of FreeBSD libc)
|
| VFS API: vfs_open(), vfs_read(), vfs_write()
v
VFS library (in-process)
|
| Plugin dispatch based on mount point
v
VFS plugins (in-process)
|
+--> ram_fs plugin (in-memory file system)
+--> <fs> plugin (delegates to File_system session)
+--> <terminal> plugin (delegates to Terminal session)
+--> <log> plugin (delegates to LOG session)
+--> <nic> plugin (delegates to Nic session, for socket layer)
+--> <block> plugin (delegates to Block session)
+--> <dir> plugin (combines subtrees)
+--> <tar> plugin (read-only tar archive)
+--> <import> plugin (populate from ROM)
+--> <pipe> plugin (in-process pipe pair)
+--> <rtc> plugin (system clock)
+--> <zero> plugin (/dev/zero equivalent)
+--> <null> plugin (/dev/null equivalent)
...
VFS Plugin Architecture
Each VFS plugin is a dynamically loadable library (or statically linked module) that implements a file-system-like interface. Plugins handle:
- open/close – create/destroy file handles
- read/write – data transfer
- stat – metadata queries
- readdir – directory enumeration
- ioctl – device-specific control (limited)
Plugins are composed by the VFS configuration, which is XML embedded in the component’s config:
<config>
<vfs>
<dir name="dev">
<log/>
<null/>
<zero/>
<terminal name="stdin" label="input"/>
<inline name="rtc">2024-01-01 00:00</inline>
</dir>
<dir name="tmp"> <ram/> </dir>
<dir name="data"> <fs label="persistent"/> </dir>
<dir name="socket"> <lxip dhcp="yes"/> </dir>
</vfs>
<libc stdout="/dev/log" stderr="/dev/log" stdin="/dev/stdin"
rtc="/dev/rtc" socket="/socket"/>
</config>
This config creates a virtual filesystem tree:
/dev/log– writes go to the LOG session/dev/null,/dev/zero– standard synthetic files/dev/stdin– reads from a Terminal session/tmp/– in-memory filesystem (RAM-backed)/data/– delegates to a File_system session labeled “persistent”/socket/– network sockets via lwIP stack (in-process)
The <fs> plugin is the bridge from VFS to Genode’s capability world.
When the application does open("/data/foo.txt"), the <fs> plugin
translates this into a File_system::Session RPC call to the external
file system server that the component’s routing connects to.
File System Components
Genode has several file system server components:
- ram_fs – in-memory file system server. Multiple components can
share files through it by routing their
File_systemsessions to it. - vfs_server (previously
vfs) – a file system server backed by the VFS plugin architecture itself. This enables recursive composition: a VFS server can mount another VFS server. - fatfs – FAT file system driver over a Block session.
- ext2_fs – ext2/3/4 via a ported Linux implementation (rump kernel).
- store_fs / recall_fs – content-hash-based storage (experimental in some Genode releases).
The file system server is a regular Genode component. It receives a Block session (from a block device driver), provides File_system sessions, and the routing determines who can access what:
block_driver -> provides Block session
|
v
fatfs -> consumes Block session, provides File_system session
|
v
application -> consumes File_system session via VFS <fs> plugin
Libc Integration
Genode ports a substantial subset of FreeBSD’s libc. The integration point is the VFS: libc’s file operations are implemented by calling the VFS layer, which dispatches to plugins, which invoke Genode sessions as needed.
The libc port modifies FreeBSD libc minimally. Most changes are in the “backend” layer that replaces kernel syscalls with VFS calls:
open()->vfs_open()-> VFS plugin dispatchread()->vfs_read()-> VFS pluginsocket()-> via VFS socket plugin (<lxip>or<lwip>)mmap()-> supported for anonymous mappings and file-backed read-onlyfork()-> NOT supported (nofork()in Genode)exec()-> NOT supported (no in-place process replacement)pthreads-> supported via Genode’s Thread APIselect()/poll()-> supported via VFS notification mechanismsignal()-> partial support (SIGCHLD, basic signal delivery)
The key architectural decision: libc talks to the VFS library (in-process), the VFS talks to Genode sessions (cross-process RPC). Application code never directly touches Genode capabilities – the VFS mediates everything.
4. POSIX Compatibility
The Noux Approach (Historical)
Genode’s early POSIX approach was Noux, a process runtime that emulated Unix-like process semantics (fork, exec, pipe) on top of Genode. Noux ran as a single Genode component containing multiple “Noux processes” that shared an address space but had separate VFS views.
Noux supported:
fork()via copy-on-write within the Noux address spaceexec()via in-place program replacementpipe()for inter-process communication- A shared file system namespace
Noux was eventually deprecated because:
- It conflated multiple processes in one address space, undermining Genode’s isolation model.
- Fork emulation was fragile and slow.
- The libc-based VFS approach (Section 3) achieved better compatibility with less complexity.
Current Approach: libc + VFS
The current POSIX compatibility strategy:
-
FreeBSD libc port. Provides standard C library functions. Modified to use Genode’s VFS instead of kernel syscalls.
-
VFS plugins as POSIX backends. Each POSIX I/O pattern maps to a VFS plugin:
- File I/O ->
<fs>plugin -> File_system session - Sockets ->
<lxip>or<lwip>plugin -> Nic session (in-process TCP/IP stack) - Terminal I/O ->
<terminal>plugin -> Terminal session - Device access -> custom VFS plugins
- File I/O ->
-
No fork(). The most significant POSIX omission. Programs that require
fork()must be modified to useposix_spawn()or Genode’s native child-spawning mechanism. In practice, many programs use fork() only for daemon patterns or subprocess creation, and can be adapted. -
No exec(). Related to no fork(): there’s no in-place process replacement. New processes are created as new Genode components.
-
Signals. Basic support – enough for SIGCHLD notification and simple signal handling. Complex signal semantics (real-time signals, signal-driven I/O) are not supported.
-
pthreads. Fully supported via Genode’s native threading.
-
mmap. Anonymous mappings and read-only file-backed mappings work. MAP_SHARED with write semantics is limited.
What Works in Practice
Genode has successfully ported:
- Qt5/Qt6 – the full widget toolkit, including QtWebEngine (Chromium). This is the basis of Sculpt’s GUI.
- VirtualBox – full x86 virtualization (runs Windows, Linux guests).
- Mesa/Gallium – GPU-accelerated 3D graphics.
- curl, wget, fetchmail – network utilities.
- GCC toolchain – compiler, assembler, linker running on Genode.
- bash – with limitations (no job control via signals, no fork-heavy patterns). Works for simple scripting.
- vim, nano – terminal editors.
- OpenSSL/LibreSSL – cryptographic libraries.
- Various system utilities – ls, cp, rm, etc. via Coreutils port.
Applications that don’t port well:
- Anything deeply dependent on fork+exec patterns (e.g., traditional Unix shells for complex scripting).
- Programs relying on procfs, sysfs, or Linux-specific interfaces.
- Daemons using inotify or Linux-specific async I/O.
- Programs that assume global file system namespace visibility.
Practical Porting Effort
For most POSIX applications, porting involves:
- Build the application using Genode’s ports system (downloads upstream source, applies patches, builds with Genode’s toolchain).
- Write a VFS configuration that provides the file-like resources the application expects.
- Write a routing configuration that connects the application to required services.
- Patch
fork()calls if present (usually replacing withposix_spawn()or restructuring to avoid subprocess creation).
The VFS configuration is where the “impedance mismatch” between POSIX
expectations and Genode capabilities is resolved. The application thinks
it’s accessing /etc/resolv.conf – the VFS plugin infrastructure
translates this to capability-mediated access.
5. Component Architecture
Core, Init, and User Components
Core (or base-hw/base-nova/etc.): the lowest-level component,
running directly on the microkernel. Core provides the fundamental
services: RAM allocation, CPU time (PD sessions), ROM access (boot
modules), IRQ delivery, and I/O memory access. Core is the only
component with direct hardware access. Everything else goes through core.
Init: the first user-level component, child of core. Init reads its XML configuration and manages the component tree. Init’s responsibilities:
- Parse
<start>entries and spawn components. - Route session requests between components according to
<route>rules. - Manage component lifecycle (restart policies, resource reclamation).
- Propagate configuration changes (dynamic reconfiguration in Sculpt).
User components: all other components. They can be:
- Servers that provide sessions (drivers, file systems, network stacks).
- Clients that consume sessions (applications).
- Both simultaneously (a network stack consumes NIC sessions and provides socket-level sessions).
- Sub-inits – components that run their own init-like management for a subtree of components.
Resource Trading in Practice
Resources in Genode flow through the tree. A concrete example:
- Core has 256 MB RAM total.
- Core grants 250 MB to init, keeps 6 MB for kernel structures.
- Init grants 10 MB to the timer driver, 50 MB to the GUI subsystem, 20 MB to the network subsystem, 5 MB to a log server.
- When the GUI subsystem starts a framebuffer driver, it donates 8 MB from its 50 MB budget to the driver as a session quota.
- The framebuffer driver uses this donated RAM for the frame buffer allocation.
If the GUI subsystem wants more RAM for a new application, it can reclaim RAM by closing sessions (getting donated RAM back) or requesting more from its parent (init).
The accounting is strict: at any point, the sum of all RAM budgets across all components equals the total system RAM. There is no over-commit. This prevents the “OOM killer” problem – each component knows exactly how much RAM it can use.
Practical Component Patterns
Driver components follow a common pattern:
- Receive: Platform session (for I/O port/memory access), IRQ session
- Provide: A device-specific session (NIC, Block, GPU, Audio, etc.)
- Stateless: all per-client state funded by session quota
Multiplexer components:
- Receive: one instance of a service
- Provide: multiple instances to clients
- Example: NIC router receives one NIC session, provides multiple sessions with packet routing between clients
Proxy components:
- Forward one session type, possibly filtering or transforming
- Example: nic_bridge, nitpicker (GUI multiplexer), VFS server
Subsystem inits:
- A component running its own init for a group of related components
- Isolates the subtree: crash of the subsystem doesn’t affect siblings
- Example: Sculpt’s drivers subsystem, network subsystem
6. Sculpt OS
What Sculpt Demonstrates
Sculpt OS is Genode’s demonstration desktop operating system. It turns the component framework into a usable system where:
- Users install and run applications at runtime.
- Each application runs in its own isolated component with explicitly configured capabilities.
- A GUI lets users connect applications to services (routing).
- The entire system is reconfigurable without reboot.
Architecture
Sculpt’s component tree:
core
|
init
|
+--> drivers subsystem (sub-init)
| +--> platform_drv (PCI, IOMMU)
| +--> fb_drv (framebuffer)
| +--> usb_drv (USB host controller)
| +--> wifi_drv (wireless)
| +--> ahci_drv (SATA)
| +--> nvme_drv (NVMe)
| +--> ...
|
+--> runtime subsystem (sub-init, user-managed)
| +--> (user-installed applications)
|
+--> leitzentrale (management GUI)
| +--> system shell
| +--> config editor
|
+--> nitpicker (GUI multiplexer)
+--> nic_router (network multiplexer)
+--> ram_fs (shared file system)
+--> ...
User Experience of Capabilities
In Sculpt, installing an application means:
- Download the package (a Genode component archive).
- Edit a “deploy” configuration that specifies which services the application can access (routing rules).
- The runtime subsystem spawns the component with the specified routing.
A text editor gets: File_system session (to read/write files), GUI
session (for display), Terminal session (optionally). It does NOT get:
network access, block device access, or access to other applications’
file systems.
A web browser gets: GUI session, Nic session (for network), GPU
session (for rendering), File_system session (for downloads). Each
service connection is an explicit choice.
The deploy config is the security policy. A user can see exactly what authority each application has, and can change it by editing the config.
Lessons from Sculpt
-
Capabilities need a management UI. Raw capability graphs are incomprehensible to users. Sculpt provides a GUI that presents service connections in an understandable way (though it’s still oriented toward power users).
-
Routing is the killer feature. Being able to route the same session type to different servers for different clients is extremely powerful. One application’s “file system” is local storage; another’s is a network share – same code, different routing.
-
Sub-inits provide failure isolation. The drivers subsystem can crash and restart without affecting applications. Sculpt’s robustness comes from this hierarchical isolation.
-
Dynamic reconfiguration is essential. A static boot config (like capOS’s current manifest) is fine for servers and embedded systems, but a general-purpose OS needs to add/remove/reconfigure components at runtime.
-
Package management is a routing problem. Installing an application in Sculpt is not “copy binary to disk” – it’s “add a component to the runtime subsystem with specific routing rules.” The binary is almost secondary to the routing.
-
POSIX compat through VFS works. Sculpt runs real desktop applications (Qt-based apps, VirtualBox, web browser) using the VFS-mediated POSIX layer. The capability model doesn’t prevent running complex existing software – it just requires explicit service configuration.
7. Relevance to capOS
VFS Capability Design
Genode’s approach: The VFS is an in-process library with a plugin architecture. It mediates between libc/POSIX and Genode sessions. The VFS configuration is per-component XML.
Lessons for capOS:
-
Don’t put the VFS in the kernel. Genode’s VFS is entirely userspace, which is correct for a capability OS. capOS should do the same – the VFS is a library linked into processes that need POSIX compatibility, not a kernel subsystem.
-
Plugin model maps well to Cap’n Proto. Each Genode VFS plugin bridges to a specific session type. In capOS, each VFS “backend” would bridge to a specific capability interface:
Genode VFS plugin capOS VFS backend <fs>-> File_system sessionFsBackend-> Namespace + Store caps<terminal>-> Terminal sessionTerminalBackend-> Console cap<lxip>-> Nic sessionNetBackend-> TcpSocket/UdpSocket caps<log>-> LOG sessionLogBackend-> Console cap<ram>-> in-process RAMRamBackend-> in-process (no cap needed) -
VFS config should be declarative. Rather than hardcoding mount points, capOS processes using
libcapos-posixshould receive a VFS mount table as part of their initial capability set. This could be a Cap’n Proto struct:struct VfsMountTable { mounts @0 :List(VfsMount); } struct VfsMount { path @0 :Text; # mount point, e.g. "/data" union { namespace @1 :Void; # use the Namespace cap named in capName console @2 :Void; # use a Console cap ram @3 :Void; # in-memory filesystem socket @4 :Void; # socket interface } capName @5 :Text; # name of the cap in CapSet backing this mount }This separates the VFS topology (a deployment decision) from the application code (which just calls
open()). -
Genode’s
<fs>plugin is the key analog. capOS’s Namespace capability is equivalent to Genode’s File_system session. Thelibcapos-posixpath resolution layer (open()->namespace.resolve()) is exactly Genode’s<fs>VFS plugin. The existing capOS design indocs/proposals/userspace-binaries-proposal.mdis already on the right track. -
Consider streaming for large files. Genode uses shared-memory dataspaces for bulk data transfer in file system sessions. capOS’s current Store interface returns
Data(a capnp blob), which means the entire object is copied perget()call. For large files, a streaming interface (with a shared-memory buffer and cursor) would be more efficient. This is capOS’s Open Question #4.
Session Routing Patterns
Genode’s approach: XML-configured routing in init, label-based dispatch, parent mediates all session requests.
Lessons for capOS:
-
The manifest IS the routing config. capOS’s
SystemManifestwith structuredCapRefsource entries such as{ service = { service = "net-stack", export = "nic" } }is functionally equivalent to Genode’s init routing config. The capOS design already handles the static case well. -
Label-based routing is valuable. Genode’s ability to route different requests from the same client to different servers (based on labels) maps directly to capOS’s capability naming. capOS already does this implicitly – a process can receive separate
Namespacecaps for “config” and “data”. The key insight is that this should be a deployment-time decision, not an application-time decision. -
Consider dynamic routing. capOS’s current manifest is static (baked into the ISO). For a more flexible system, init should support runtime reconfiguration:
- Reload the manifest from a Store cap.
- Add/remove services without reboot.
- Re-route sessions when services restart.
Genode achieves this via init’s config ROM, which can be updated at runtime. capOS could achieve it by having init watch a
Namespacecap for manifest updates. -
Parent-mediated routing has costs. In Genode, every session request traverses the component tree. This adds latency and complexity. capOS’s direct capability passing (a process holds a cap directly, not through its parent) avoids this overhead. The tradeoff: capOS has less runtime control over routing (once a cap is passed, the parent can’t intercept invocations on it).
This is a deliberate design choice. capOS favors direct caps (lower overhead, simpler) over proxied caps (more control). Genode’s session routing is powerful but adds a layer of indirection that may not be worth it for capOS’s use case.
-
Service export needs a protocol. Genode’s session model has server components explicitly
announcewhat services they provide. capOS’sProcessHandle.exported()mechanism serves the same purpose. The manifest’sexportsfield pre-declares what a service will export, which helps init plan the dependency graph before spawning anything.
POSIX Compatibility Without Compromising Capabilities
Genode’s approach: libc port + VFS + per-component VFS config. No global namespace. No fork(). Applications see a curated file tree, not the real system.
Lessons for capOS:
-
The VFS is a capability adapter, not a capability. The VFS library runs inside the process that needs POSIX compatibility. It doesn’t weaken the capability model because it can only access capabilities the process was granted. This matches capOS’s
libcapos-posixdesign exactly. -
musl over FreeBSD libc. Genode uses FreeBSD libc because of its clean backend interface. capOS plans to use musl, which has an even cleaner
__syscall()interface. This is a good choice. Genode’s experience shows that the libc implementation matters less than the VFS/backend layer quality. -
No fork() is fine. Genode has operated without fork() for over 15 years and runs complex software (Qt, VirtualBox, Chromium). The applications that truly need fork() are rare and usually need only
posix_spawn()semantics. capOS should not attempt to implement fork() – focus onposix_spawn()backed byProcessSpawnercap. -
Sockets via in-process TCP/IP stack. Genode’s
<lxip>VFS plugin runs an lwIP TCP/IP stack inside the application process, using the NIC session for raw packet I/O. This avoids the overhead of routing every socket call through a separate network stack component.capOS could offer a similar choice:
- Out-of-process: socket calls go to the network stack
component via
TcpSocket/UdpSocketcaps (safer, more isolated, more overhead). - In-process: an lwIP/smoltcp library runs inside the
application, consuming a raw
Niccap (less isolation, less overhead, more authority).
For most applications, out-of-process sockets via caps are fine. For high-performance networking (database, web server), an in-process stack over a raw NIC cap may be needed.
- Out-of-process: socket calls go to the network stack
component via
-
select/poll/epoll need async caps. Genode implements select/poll via VFS notifications (signals on file readiness). capOS needs the async capability rings (io_uring-inspired) from Stage 4 before select/poll can work. This is a natural fit: each polled fd maps to a pending capability invocation in the completion ring.
Component Patterns for Cap’n Proto Interfaces
Genode’s patterns and their capOS/Cap’n Proto equivalents:
-
Session creation = factory method on a capability.
Genode: client requests a
Nic::Sessionfrom its parent, which routes to a NIC driver server.capOS: client holds a
NetworkManagercap and callscreate_tcp_socket()to get aTcpSocketcap. The factory pattern is the same, but capOS does it via direct cap invocation instead of parent-mediated session requests.Cap’n Proto naturally supports this via interfaces that return interfaces:
interface NetworkManager { createTcpSocket @0 () -> (socket :TcpSocket); createUdpSocket @1 () -> (socket :UdpSocket); createTcpListener @2 (addr :IpAddress, port :UInt16) -> (listener :TcpListener); } -
Resource quotas in session creation.
Genode: session requests include a RAM quota donated from client to server.
capOS should consider this pattern. Currently, capOS processes receive a
FrameAllocatorcap for memory. If a server needs to allocate memory per-client, the client should fund it. Cap’n Proto schema could encode this:interface FileSystem { open @0 (path :Text, bufferPages :UInt32) -> (file :File); # bufferPages: number of pages the client donates for # server-side buffering. Server allocates from a shared # FrameAllocator or the client passes frames explicitly. }This prevents the denial-of-service problem where a client opens many sessions, exhausting the server’s memory.
-
Multiplexer components.
Genode:
nic_routertakes one NIC session, provides many.nitpickertakes one framebuffer, provides many GUI sessions.capOS equivalent: a process that consumes a
Niccap and provides multipleTcpSocket/UdpSocketcaps. This is already what the network stack component does in capOS’s service architecture proposal. Cap’n Proto’s interface model makes this natural – the multiplexer implements one interface (NetworkManager) using another (Nic). -
Attenuation = capability narrowing.
Genode: servers can return restricted capabilities (e.g., a read-only file handle from a read-write file system session).
capOS: already planned via Fetch -> HttpEndpoint narrowing, Store -> read-only Store, Namespace -> scoped Namespace. The pattern is sound. Cap’n Proto interfaces make the attenuation explicit in the schema.
-
Dataspace pattern for bulk data.
Genode uses shared-memory dataspaces for efficient bulk transfer (file contents, network packets, framebuffers). The RPC path carries only small control messages and capability references.
capOS currently moves Cap’n Proto control messages through capability rings and bounded kernel scratch, with no zero-copy bulk-data object yet. For bulk data, capOS should add a
SharedBuffercapability:interface SharedBuffer { # Map a shared memory region into caller's address space map @0 () -> (addr :UInt64, size :UInt64); # Notify that data has been written to the buffer signal @1 (offset :UInt64, length :UInt64) -> (); }File system and network operations would use SharedBuffer for data transfer and capability invocations for control, matching Genode’s split between RPC and dataspaces.
-
Sub-init pattern for failure domains.
Genode: a sub-init manages a subtree of components. If the subtree crashes, only the sub-init restarts it.
capOS: a supervisor process (not necessarily init) holds a
ProcessSpawnercap and manages a group of services. This is already described in the service architecture proposal’s supervision tree. The key addition from Genode: make sub- supervisors a first-class pattern with their own manifest fragments, not just ad-hoc supervision loops.
Summary of Key Takeaways for capOS
| Area | Genode approach | capOS adaptation |
|---|---|---|
| Capability model | Kernel-enforced caps to RPC objects | Kernel-enforced caps to Cap’n Proto objects (aligned) |
| Service discovery | Parent-mediated session routing | Manifest-driven cap passing at spawn (simpler, less dynamic) |
| VFS | In-process library with plugin architecture | libcapos-posix with mount table from CapSet (same pattern) |
| POSIX | FreeBSD libc + VFS backends | musl + libcapos-posix backends (same architecture) |
| fork() | Not supported | Not supported (use posix_spawn -> ProcessSpawner) |
| Bulk data | Shared-memory dataspaces | SharedBuffer design exists; implementation pending |
| Resource accounting | Session quotas (RAM donated per session) | Authority-accounting design exists; unified ledgers pending |
| Routing labels | String labels on session requests, routed by init | Cap naming in manifest serves same purpose |
| Dynamic reconfig | Init config ROM updated at runtime | Manifest reload via Store cap (future) |
| Failure isolation | Sub-inits as failure domains | Supervisor processes (same concept, different mechanism) |
| Async notification | Signal capabilities | Async cap rings / io_uring model (more general) |
Top Recommendations
-
Add session quotas / resource trading. This is the most important Genode pattern capOS hasn’t adopted yet. Without it, a malicious client can exhaust a server’s memory by opening many capability sessions. Design resource donation into the Cap’n Proto schema for session-creating interfaces.
-
Design a SharedBuffer capability. Copying capnp messages through the kernel works for control messages but not for bulk data. A shared-memory mechanism (like Genode’s dataspaces) is essential for file I/O, networking, and GPU rendering.
-
Keep VFS as a library, not a service. Genode’s in-process VFS is the right pattern. capOS’s
libcapos-posixshould work the same way – a library that translates POSIX calls to capability invocations within the process. No VFS server component needed (though a file system server implementing the Namespace/Store interface is separate). -
Add a declarative VFS mount table to process init. Each POSIX-compat process should receive a mount table (as a capnp struct) that maps paths to capabilities. This separates deployment policy from application code, matching Genode’s per-component VFS config.
-
Plan for dynamic reconfiguration. The static manifest is fine for now, but Sculpt shows that a usable capability OS needs runtime service management. Design init so it can accept manifest updates through a cap, not just from the boot image.
-
Don’t over-engineer routing. Genode’s parent-mediated session routing is powerful but complex. capOS’s direct capability passing is simpler and sufficient for most use cases. Add proxy/mediator patterns only when specific needs arise (e.g., capability revocation, load balancing).
References
- Genode Foundations book (genode.org/documentation/genode-foundations/) – the authoritative source for architecture, session model, routing, VFS, and component composition.
- Norman Feske, “Genode Operating System Framework” (2008-2025) – release notes and design documentation at genode.org.
- Sculpt OS documentation at genode.org/download/sculpt – practical deployment of the capability model.
- Genode source repository: github.com/genodelabs/genode – reference implementations of VFS plugins, file system servers, libc port.
Research: Plan 9 from Bell Labs and Inferno OS
Lessons for a capability-based OS using Cap’n Proto wire format.
Table of Contents
- Per-Process Namespaces
- The 9P Protocol
- File-Based vs Capability-Based Interfaces
- 9P as IPC
- Inferno OS
- Relevance to capOS
1. Per-Process Namespaces
Overview
Plan 9’s most significant architectural contribution is per-process namespaces.
Every process has its own view of the file hierarchy – not a shared global
filesystem tree as in Unix. A process’s namespace is a mapping from path names
to file servers (channels to 9P-speaking services). Two processes running on
the same machine can see completely different contents at /dev, /net,
/proc, or any other path.
Namespaces are inherited by child processes (fork copies the namespace) but can be modified independently afterward. This provides a form of resource isolation that is orthogonal to traditional access control: a process simply cannot name resources that aren’t in its namespace.
The Three Namespace Operations
Plan 9 provides three system calls for namespace manipulation:
bind(name, old, flags) – Takes an existing file or directory name
already visible in the namespace and makes it also accessible at path old.
This is purely a namespace-level alias – no new file server is involved. The
name argument must resolve to something already in the namespace.
Example: bind("#c", "/dev", MREPL) makes the console device (#c is a
kernel device designator) appear at /dev. The # prefix addresses kernel
devices directly before they have been bound into the namespace.
mount(fd, old, flags, aname) – Like bind, but the source is a file
descriptor connected to a 9P server rather than an existing namespace path.
The kernel speaks 9P over fd to serve requests for paths under old. The
aname parameter selects which file tree the server should export (a single
server can serve multiple trees).
Example: mount(fd, "/net", MREPL, "") where fd is a connection to the
network stack’s file server, makes the TCP/IP interface appear at /net.
unmount(name, old) – Removes a previous bind or mount from the
namespace.
Flags and Union Directories
The flags argument to bind and mount controls how the new binding
interacts with existing content at the mount point:
MREPL(replace) – The new binding completely replaces whatever was at the mount point. Only the new server’s files are visible.MBEFORE(before) – The new binding is placed before the existing content. When looking up a name, the new binding is searched first. If not found there, the old content is searched.MAFTER(after) – The new binding is placed after the existing content. The old content is searched first.MCREATE– Combined withMBEFOREorMAFTER, controls which component of the union receives create operations.
Union directories are the result of stacking multiple bindings at one mount point. When a directory has multiple bindings, a directory listing returns the union of all names from all components. A lookup walks the bindings in order and returns the first match.
This is how Plan 9 constructs /bin: multiple directories (for different
architectures, local overrides, etc.) are union-mounted at /bin. The
shell finds commands by simple path lookup – no $PATH variable needed.
bind /rc/bin /bin # shell built-ins (MAFTER)
bind /386/bin /bin # architecture binaries (MAFTER)
bind $home/bin/386 /bin # personal overrides (MBEFORE)
A lookup for /bin/ls searches the personal directory first, then the
architecture directory, then the shell builtins – all via a single path.
Namespace Inheritance and Isolation
The rfork system call controls what the child inherits:
RFNAMEG– Child gets a copy of the parent’s namespace. Subsequent modifications by either side are independent.RFCNAMEG– Child starts with a clean (empty) namespace.- Without either flag, parent and child share the namespace (modifications by one affect the other).
This gives fine-grained control: a shell can construct a restricted namespace for a sandboxed command, or a server can create an isolated namespace for each client connection.
Namespace Construction at Boot
Plan 9’s boot process constructs the initial namespace step by step:
- The kernel provides “kernel devices” accessed via
#designators:#c(console),#e(environment),#p(proc),#I(IP stack), etc. - The boot script binds these into conventional paths:
bind "#c" /dev,bind "#p" /proc, etc. - Network connections mount remote file servers: the CPU server’s file system, the user’s home directory, etc.
- Per-user profile scripts further customize the namespace.
The result is that the “standard” file hierarchy is a convention, not a kernel requirement. Any process can rearrange it.
Namespace as Security Boundary
Plan 9 namespaces provide a form of capability-like access control:
- A process cannot access resources outside its namespace
- A parent can restrict a child’s namespace before exec
- There is no way to “escape” a namespace – there is no
..that crosses a mount boundary unexpectedly, and#designators can be restricted
However, this is not a formal capability system:
- The namespace contains string paths, which are ambient authority within the namespace
- Any process can
open("/dev/cons")if/dev/consis in its namespace – there is no per-open-call authorization - The isolation depends on correct namespace construction, not structural properties
2. The 9P Protocol
Overview
9P (and its updated version 9P2000) is the protocol spoken between clients and file servers. Every resource in Plan 9 is accessed through 9P – local kernel devices, remote file systems, user-space services, and network resources all speak the same protocol.
9P is a request-response protocol with fixed message types. It is connection-oriented: a client establishes a session, authenticates, walks paths to obtain file handles (fids), and then reads/writes through those handles.
Message Types (9P2000)
9P2000 defines the following message pairs (T = request from client, R = response from server):
Session management:
Tversion/Rversion– Negotiate protocol version and maximum message size. Must be the first message. The client proposes a version string (e.g.,"9P2000") and amsize(maximum message size in bytes). The server responds with the agreed version and msize.Tauth/Rauth– Establish an authentication fid. The client provides a user name and ananame(the file tree to access). The server returns anafidthat the client reads/writes to complete an authentication exchange.Tattach/Rattach– Attach to a file tree. The client provides theafidfrom authentication, a user name, and theaname. The server returns aqid(unique file identifier) for the root of the tree. Thisfidbecomes the client’s handle for the root directory.
Navigation:
Twalk/Rwalk– Walk a path from an existing fid. The client provides a starting fid and a sequence of name components (up to 16 per walk). The server returns a new fid pointing to the result and the qids of each intermediate step. Walk is how you traverse directories – there is noopen-by-pathoperation.
File operations:
Topen/Ropen– Open an existing file (by fid, obtained via walk). The client specifies a mode (read, write, read-write, exec, truncate). The server returns the qid and aniounit(maximum I/O size for atomic operations).Tcreate/Rcreate– Create a new file in a directory fid. The client specifies name, permissions, and mode.Tread/Rread– Readcountbytes atoffsetfrom an open fid. The server returns the data.Twrite/Rwrite– Writecountbytes atoffsetto an open fid. The server returns the number of bytes actually written.Tclunk/Rclunk– Release a fid. The server frees associated state. Equivalent toclose().Tremove/Rremove– Remove the file referenced by a fid and clunk the fid.Tstat/Rstat– Get file metadata (name, size, permissions, access times, qid, etc.).Twstat/Rwstat– Modify file metadata.
Error handling:
Rerror– Any T-message can receive anRerrorinstead of its normal response. Contains a text error string (9P2000) or an error number (9P2000.u).
Message Format
Every 9P message starts with a 4-byte length (little-endian, including the length field itself), a 1-byte type, and a 2-byte tag. The tag is chosen by the client and echoed in the response, enabling multiplexed operations over a single connection.
[4 bytes: size][1 byte: type][2 bytes: tag][... type-specific fields ...]
Field types are simple: 1/2/4/8-byte integers (little-endian), counted strings (2-byte length prefix + UTF-8), and counted data blobs (4-byte length prefix + raw bytes).
Qids and File Identity
A qid is a server-assigned 13-byte file identifier:
[1 byte: type][4 bytes: version][8 bytes: path]
- type – Bits indicating directory, append-only, exclusive-use, authentication file, etc.
- version – Incremented when the file is modified. The client can detect changes by comparing versions.
- path – A unique identifier for the file within the server. Typically a hash or inode number.
Qids allow clients to detect file identity (same path through different walks = same qid) and staleness (version changed = re-read needed).
Authentication
9P2000 authentication is pluggable. The protocol provides the Tauth/Rauth
mechanism to establish an authentication fid, but the actual authentication
exchange happens by reading and writing this fid – the protocol itself is
agnostic to the authentication method.
Plan 9’s standard mechanism is p9sk1, a shared-secret protocol using an authentication server. The flow:
- Client sends
Tauthto get anafid - Client and server exchange challenge-response messages by reading/writing
the
afid, mediated by the authentication server - Once authentication succeeds, the client uses the
afidinTattach
The key insight: authentication is just another read/write conversation over a special fid. New authentication methods can be implemented without changing the protocol.
Concurrency
9P supports concurrent operations through tags. A client can send multiple T-messages without waiting for responses. Each has a unique tag, and the server can respond out of order. The client matches responses to requests by tag.
A special tag value NOTAG (0xFFFF) is used for Tversion, which must
complete before any other messages.
The OEXCL open mode provides exclusive access to a file – only one client
can open it at a time. This is used for locking (e.g., the #l lock device
in some Plan 9 variants).
Fids are per-connection, not global. Different clients on different connections have independent fid spaces. A server maintains per-connection state.
Maximum Message Size
The msize negotiated in Tversion bounds all subsequent messages. A
typical default is 8192 or 65536 bytes. The iounit returned by Topen
tells the client the maximum useful count for read/write on that fid,
which may be less than msize minus the message header overhead.
This bounding is important for resource management – a server can limit memory consumption per connection.
3. File-Based vs Capability-Based Interfaces
Plan 9: Everything is a File
Plan 9 takes Unix’s “everything is a file” philosophy further than Unix itself ever did:
- Network stack – TCP connections are managed by reading/writing files
in
/net/tcp:clone(allocate a connection),ctl(write commands likeconnect 10.0.0.1!80),data(read/write payload),status(read connection state). - Window system – The
riowindow manager exports a file system: each window has acons,mouse,winname, etc. A program draws by writing to/dev/draw/*. - Process control –
/proc/<pid>/containsctl(writekillto signal),status(read state),mem(read/write process memory),text(read executable),note(signals). - Hardware devices – Kernel devices export file interfaces directly. The audio device is files, the graphics framebuffer is files, etc.
The interface contract is: open a file, read/write bytes, stat for metadata.
The semantics of those bytes are defined by the file server – there is no
ioctl().
Strengths of the file model:
- Universal tools work everywhere:
cat /net/tcp/0/status,echo kill > /proc/1234/ctl - Shell scripts can compose services trivially
- Network transparency is automatic: mount a remote file server, same tools work
- The interface is self-documenting:
lsshows available operations - Simple tools like
cat,echo,grepbecome universal adapters
Weaknesses of the file model:
- Type erasure. Everything is bytes. The protocol cannot express
structured data without conventions layered on top (text formats, fixed
layouts, etc.). A
read()returns raw bytes – the client must know the expected format. - Limited operation set. The only verbs are open, read, write, stat,
create, remove. Complex operations must be encoded as write-command /
read-response sequences (e.g.,
echo "connect 10.0.0.1!80" > /net/tcp/0/ctl). Error handling is ad-hoc. - No schema or type checking. Nothing prevents writing garbage to a ctl file. Errors are detected at runtime, often with cryptic messages.
- No structured errors. 9P errors are text strings. No error codes, no machine-parseable error metadata.
- Byte-stream orientation. 9P read/write are offset-based byte operations. This fits files naturally but is awkward for RPC-style request/response interactions. File servers work around this with conventions (write a command, read the response from offset 0).
- No pipelining of operations. You cannot say “open this file, then read it, and if that succeeds, write to this other file” atomically. Each step is a separate round-trip (though 9P’s tag multiplexing helps amortize latency).
Capability Systems: Everything is a Typed Interface
In a capability system like capOS, resources are accessed through typed interface references:
interface Console {
write @0 (data :Data) -> ();
writeLine @1 (text :Text) -> ();
}
interface NetworkManager {
createTcpSocket @0 (addr :Text, port :UInt16) -> (socket :TcpSocket);
}
interface TcpSocket {
read @0 (count :UInt32) -> (data :Data);
write @1 (data :Data) -> (written :UInt32);
close @2 () -> ();
}
Strengths of the capability model:
- Type safety. The interface contract is machine-checked. You cannot
call
writeon aNetworkManager– the type system prevents it. - Rich operations. Interfaces can define arbitrary methods with typed parameters and return values. No need to encode everything as byte read/writes.
- Structured errors. Return types can include error variants. Capabilities can define error enums in the schema.
- Schema evolution. Cap’n Proto supports backwards-compatible schema changes (adding fields, adding methods). Both old and new clients/servers interoperate.
- No ambient authority. A process has precisely the capabilities it
was granted. No path-based discovery, no
/procto enumerate. - Attenuation. A broad capability can be narrowed to a restricted
version (e.g.,
Fetch->HttpEndpoint). The restriction is structural, not a permission check.
Weaknesses of the capability model:
- No universal tools.
catandechodo not work on capabilities. Each interface needs its own client tool or library. Debugging requires interface-aware tools. - Harder composition. Shell pipes compose byte streams trivially. Capability composition requires typed adapters or a capability-aware shell.
- Discovery problem.
lsshows files. What shows capabilities? A management-onlyCapabilityManager.list()call, but that requires holding the manager cap and a tool that can render the result. - Steeper learning curve. A new developer can
ls /netto understand the network stack. Understanding a capability interface requires reading the schema definition. - Verbosity. Opening a TCP connection in Plan 9 is four file operations (clone, ctl, data, status). In a capability system, it is one typed method call. But defining the interface in the schema is more upfront work than just exporting files.
Synthesis
The file model and the capability model are not opposed – they are different points on a trade-off curve between universality and type safety. Plan 9 chose maximal universality (everything reduces to bytes + paths). Capability systems choose maximal type safety (everything has a schema).
The interesting question is whether a capability system can recover the ergonomic benefits of the file model while maintaining type safety. This is addressed in section 6.
4. 9P as IPC
File Servers as Services
In Plan 9, a “service” is simply a process that speaks 9P. When a client mounts a file server’s connection at some path, all file operations on that path become 9P messages to the server. This is the universal IPC mechanism – there are no Unix-domain sockets, no D-Bus, no shared memory primitives for service communication. Everything goes through 9P.
Examples of services-as-file-servers:
exportfs– Re-exports a subtree of the current namespace over a network connection, letting remote clients mount it.ramfs– A RAM-backed file server. Mount it and you have a tmpfs.ftpfs– Mounts a remote FTP server as a local directory. Programs read/write files; the file server translates to FTP protocol.mailfs– Presents a mail spool as a directory of messages. Each message is a directory withheader,body,rawbody, etc.plumber– The inter-application message router exports a file interface: write a message to/mnt/plumb/send, and it arrives in the target application’s plumb port.acme– The Acme editor exports its entire UI as a file system: windows, buffers, tags, event streams. External programs can control Acme by reading/writing these files.
The srv Device and Connection Passing
The kernel #s (srv) device provides a namespace for posting file
descriptors. A server process creates a pipe, starts serving 9P on one end,
and posts the other end as /srv/myservice. Other processes open
/srv/myservice to get a connection to the server, then mount it into
their namespace.
# Server side:
pipe = pipe()
post(pipe[0], "/srv/myfs")
serve_9p(pipe[1])
# Client side:
fd = open("/srv/myfs", ORDWR)
mount(fd, "/mnt/myfs", MREPL, "")
# Now /mnt/myfs/* are served by the server process
This decouples service registration from namespace mounting. Multiple clients can mount the same service at different paths in their own namespaces.
Performance and Overhead
9P’s overhead compared to direct function calls or shared memory:
- Serialization – Every operation is a 9P message: header parsing, field encoding/decoding. Messages are simple binary (not XML/JSON), so this is fast but nonzero.
- Copying – Data passes through the kernel (pipe or network): user buffer -> kernel pipe buffer -> server process buffer (and back for responses). This is at least two copies per direction.
- Context switches – Each request/response is a write (client) + read (server) + write (server) + read (client) = four context switches for a round-trip.
- No zero-copy – 9P does not support shared memory or page remapping. Large data transfers pay the full copy cost.
For metadata-heavy operations (stat, walk, open/close), the overhead is dominated by context switches, not data copying. Plan 9 is designed for networks where latency matters – the protocol’s simplicity and multiplexability help here.
For bulk data, the overhead is significant. Plan 9 compensates somewhat with
the iounit mechanism (encouraging large reads/writes to amortize per-call
costs) and the fact that most I/O is streaming (sequential reads/writes, not
random access).
In practice, Plan 9 systems are not optimized for raw throughput on local IPC. The design prioritizes simplicity and network transparency over local performance. The assumption is that the network is the bottleneck, so local protocol overhead is acceptable.
Network Transparency
9P’s power lies in its network transparency. The same protocol runs over:
- Pipes – Local IPC between processes on the same machine.
- TCP connections – Remote file access across the network.
- Serial lines – Early Plan 9 terminals connected to CPU servers.
- TLS/SSL – Encrypted connections (added later).
A CPU server is accessed by mounting its file system over the network. The
Plan 9 cpu command:
- Connects to a remote CPU server over TCP
- Authenticates
- Exports the local namespace (via
exportfs) to the remote side - The remote side mounts the local namespace, overlaying it with its own kernel devices
- A shell runs on the remote CPU, but with access to local files
The result: you work on the remote machine but your files, windows, and devices are local. This is more powerful than SSH because the integration is at the namespace level, not the terminal level.
Factoid: In the Plan 9 computing model, terminals were intentionally underpowered. The expensive hardware was the CPU server. Users mounted the CPU server’s filesystem and ran programs there, with the terminal providing I/O devices (keyboard, mouse, display) exported as files back to the CPU server.
5. Inferno OS
What Inferno Adds Beyond Plan 9
Inferno (also from Bell Labs, originally by the same team) took the Plan 9 architecture and adapted it for portable, networked computing. It can run as a native OS on bare hardware, as a hosted application on other OSes (Linux, Windows, macOS), or as a virtual machine.
Key additions and differences:
- Dis virtual machine – All user-space code runs on a register-based VM, not native machine code.
- Limbo language – A type-safe, garbage-collected, concurrent language (influenced Plan 9 C, CSP, Newsqueak, and Alef). All applications are written in Limbo.
- Styx protocol – Inferno’s name for its 9P variant (functionally identical to 9P2000 with minor encoding differences in early versions, later fully aligned with 9P2000).
- Portable execution – The same Limbo bytecode runs on any platform where the Dis VM is available. No recompilation needed.
- Built-in cryptography – TLS, certificate-based authentication, and signed modules are integrated into the system, not bolted on.
The Dis Virtual Machine
Dis is a register-based virtual machine (unlike the JVM, which is stack-based). Key characteristics:
- Memory model – Dis uses a module-based memory model. Each loaded module has its own data segment (frame). Instructions reference memory operands by offset within the current module’s frame, the current function’s frame, or a literal (mp, fp, or immediate addressing).
- Instruction set – CISC-inspired, with three-address instructions:
add src1, src2, dst. Opcodes cover arithmetic, comparison, branching, string operations, channel operations, and system calls. Around 80-90 opcodes. - Type descriptors – Each allocated block has a type descriptor that identifies which words are pointers. This enables exact garbage collection (no conservative scanning).
- Garbage collection – Reference counting with cycle detection. Deterministic deallocation for acyclic structures (important for resource management), with periodic cycle collection.
- Module loading – Dis modules are loaded on demand. A module declares its type signature (exported functions and their types), and the loader verifies type compatibility at link time.
- JIT compilation – On supported architectures (x86, ARM, MIPS, SPARC, PowerPC), Dis bytecode is compiled to native code at load time. This removes the interpretation overhead for hot code.
- Concurrency – Dis natively supports concurrent threads of execution within a module. Threads communicate via typed channels (from CSP/Limbo).
The Limbo Language
Limbo is Inferno’s application language. Its design reflects the system’s values:
- Type-safe – No pointer arithmetic, no unchecked casts, no buffer overflows. The type system is enforced at compile time and verified at module load time.
- Garbage collected – Programmers do not manage memory. Reference counting provides deterministic resource cleanup.
- Concurrent – First-class
chantypes (typed channels) andspawnfor creating threads. This is CSP-style concurrency, predating (and influencing) Go’s goroutines and channels. - Module system – Modules declare interfaces (like header files with
type signatures). A module
imports another module’s interface, and the runtime verifies type compatibility at load time. - ADTs – Algebraic data types with
pick(tagged unions). Pattern matching over variants. - Tuples – First-class tuple types for returning multiple values.
- No inheritance – Limbo has ADTs and modules, not objects and classes.
Example – a simple file server in Limbo:
implement Echo;
include "sys.m";
include "draw.m";
include "styx.m";
sys: Sys;
Echo: module {
init: fn(nil: ref Draw->Context, argv: list of string);
};
init(nil: ref Draw->Context, argv: list of string)
{
sys = load Sys Sys->PATH;
# ... set up Styx server, handle read/write on echo file
}
Limbo and the Namespace Model
Limbo programs interact with the namespace through the Sys module’s file
operations (open, read, write, mount, bind, etc.) – the same
operations as in Plan 9. The namespace model is identical:
- Each process group has its own namespace
bindandmountmanipulate the namespace- File servers (Styx servers) provide services
- Union directories compose multiple servers
The difference is that Limbo’s type safety extends to the file descriptors
and channels used to communicate. A Sys->FD is a reference type, not a
raw integer. You cannot fabricate a file descriptor from nothing.
Limbo’s channel type (chan of T) provides typed communication between
concurrent threads within a process. Channels are a local IPC mechanism
complementary to Styx, which handles inter-process and inter-machine
communication.
Styx (Inferno’s 9P)
Styx is Inferno’s name for the 9P2000 protocol. In the current version of Inferno, Styx and 9P2000 are wire-compatible – the same byte format, the same message types, the same semantics. The renaming reflects Inferno’s origin as a commercial product from Vita Nuova (and before that, Lucent Technologies) with its own branding.
The Inferno kernel includes a Styx library (Styx and Styxservers
modules) that makes implementing file servers straightforward in Limbo.
The Styxservers module provides a framework: you implement a navigator
(for walk/stat) and a file handler (for read/write), and the framework
handles the protocol boilerplate.
include "styx.m";
include "styxservers.m";
styx: Styx;
styxservers: Styxservers;
Srv: adt {
# ... file tree definition
};
# The framework calls navigator.walk(), navigator.stat() for metadata
# and file.read(), file.write() for data operations.
Inferno also provides the 9srvfs utility for mounting external 9P servers
and the mount command for attaching Styx servers to the namespace – the
same patterns as Plan 9.
Security Model
Inferno’s security model builds on namespaces with additional mechanisms:
- Signed modules – Dis modules can be cryptographically signed. The loader can verify signatures before executing code.
- Certificate-based authentication – Inferno uses a certificate infrastructure (not Kerberos like Plan 9) for authenticating connections.
- Namespace restriction – The
wm/shshell and other supervisory programs can construct restricted namespaces for untrusted code. - Type safety as security – Since Limbo prevents pointer forgery and buffer overflows, type safety is a security boundary. A Limbo program cannot escape its type system to forge file descriptors or access arbitrary memory.
6. Relevance to capOS
6.1 Namespace Composition via Capabilities
Plan 9 lesson: Per-process namespaces are a powerful isolation and composition mechanism. A process’s “view of the world” is constructed by its parent through bind/mount operations. The child cannot escape this view.
capOS parallel: Per-process capability tables serve an analogous role. A process’s “view of the world” is its set of granted capabilities. The child cannot discover or access capabilities outside its table.
What capOS could adopt:
The existing Namespace interface in the storage proposal
(docs/proposals/storage-and-naming-proposal.md) already captures some of this –
resolve, bind, list, and sub provide name-to-capability mappings.
But Plan 9’s namespace model suggests a more dynamic composition pattern:
interface Namespace {
# Resolve a name to a capability reference
resolve @0 (name :Text) -> (capId :UInt32, interfaceId :UInt64);
# Bind a capability at a name in this namespace
bind @1 (name :Text, capId :UInt32) -> ();
# Create a union: multiple capabilities behind one name
union @2 (name :Text, capId :UInt32, position :UnionPosition) -> ();
# List available names
list @3 () -> (entries :List(NamespaceEntry));
# Get a restricted sub-namespace
sub @4 (prefix :Text) -> (ns :Namespace);
}
enum UnionPosition {
before @0; # searched first (like Plan 9 MBEFORE)
after @1; # searched last (like Plan 9 MAFTER)
replace @2; # replaces existing (like Plan 9 MREPL)
}
struct NamespaceEntry {
name @0 :Text;
interfaceId @1 :UInt64;
label @2 :Text;
}
The key insight from Plan 9 is union composition – multiple capabilities can be bound at the same name, searched in order. This is useful for overlay patterns: a local cache capability layered before a remote store capability, or a per-user config namespace layered before a system-wide default.
Differences from Plan 9:
Plan 9 namespaces map names to file servers. capOS namespaces map names to typed capabilities. The advantage: capOS can verify at bind time that the capability matches the expected interface. Plan 9 cannot – you mount a file server and discover at runtime whether it exports the files you expect.
6.2 Cap’n Proto RPC vs 9P
Protocol comparison:
| Aspect | 9P2000 | Cap’n Proto RPC |
|---|---|---|
| Message format | Fixed binary fields, counted strings/data | Capnp wire format (pointer-based, zero-copy decode) |
| Operations | Fixed set (walk, open, read, write, stat, …) | Arbitrary per-interface (schema-defined methods) |
| Typing | Untyped bytes | Strongly typed (schema-checked) |
| Multiplexing | Tag-based (16-bit tags) | Question ID-based (32-bit) |
| Pipelining | Not supported (each op is independent) | Promise pipelining (call method on not-yet-returned result) |
| Authentication | Pluggable via auth fid | Application-level (not protocol-specified) |
| Capabilities | No (file fids are unforgeable handles, but no transfer/attenuation) | Native capability passing and attenuation |
| Maximum message | Negotiated msize | No inherent limit (segmented messages) |
| Schema evolution | N/A (fixed protocol) | Forward/backward compatible schema changes |
| Network transparency | Native design goal | Native design goal |
Key differences for capOS:
-
Promise pipelining – This is capnp RPC’s strongest advantage over 9P. In 9P, opening a TCP connection requires: walk to
/net/tcp-> walk toclone-> open clone -> read (get connection number) -> walk toctl-> open ctl -> write “connect …” -> walk todata-> open data. Eight round-trips minimum. With capnp pipelining:net.createTcpSocket("10.0.0.1", 80)returns a promise, and you can immediately call.write(data)on the promise – the runtime chains the calls without waiting for the first to complete. One logical round-trip. -
Typed interfaces – 9P’s strength is that
catworks on any file. Capnp’s strength is that the compiler catchesconsole.allocFrame()at compile time. capOS should not try to make everything a “file” – typed interfaces are the right abstraction for a capability system. But aFileServercapability interface could provide Plan 9-like flexibility where needed (see below). -
Capability passing – 9P has no way to pass a fid through a file server to a third party. (The
srvdevice is a workaround, not a protocol feature.) Capnp RPC natively supports passing capability references in messages. This is fundamental to capOS’s model.
6.3 File Server Pattern as a Capability
Plan 9’s file server pattern is useful and should not be discarded just
because capOS is capability-based. Instead, define a generic FileServer
capability interface:
interface FileServer {
walk @0 (names :List(Text)) -> (fid :FileFid);
list @1 (fid :FileFid) -> (entries :List(DirEntry));
}
interface FileFid {
open @0 (mode :OpenMode) -> (iounit :UInt32);
read @1 (offset :UInt64, count :UInt32) -> (data :Data);
write @2 (offset :UInt64, data :Data) -> (written :UInt32);
stat @3 () -> (info :FileInfo);
close @4 () -> ();
}
A FileServer capability enables:
/proc-like introspection – A debugging service exports process state as a file tree. Tools read files to inspect state.- Config storage – A configuration namespace can be exposed as files for tools that work with text.
- POSIX compatibility – The POSIX shim layer maps
open()/read()/write()toFileServercapability calls. - Shell scripting – A capability-aware shell could mount
FileServercaps and usecat/echo-style tools on them.
The point: FileServer is one capability interface among many. It is not
the universal abstraction (as in Plan 9), but it is available where the
file metaphor is natural.
6.4 IPC Lessons
Plan 9 lesson: 9P works as universal IPC because the protocol is simple and the kernel handles the plumbing (mount, pipe, network). The cost is per-message overhead (copies, context switches).
capOS implications:
-
Minimize copies. 9P’s two-copies-per-direction (user -> kernel pipe buffer -> server) is acceptable for networks but expensive for local IPC. capOS should investigate shared-memory regions for bulk data transfer between co-located processes, with capnp messages as the control plane. The roadmap’s io_uring-inspired submission/completion rings already point in this direction.
-
Direct context switch. The L4/seL4 IPC fast-path (direct switch from caller to callee without choosing an unrelated runnable process) now exists as a baseline for blocked Endpoint receivers. Plan 9 does not do this – every 9P round-trip goes through the kernel’s pipe/network layer. capOS can tune this further because capability calls have a known target process.
-
Batching. Plan 9 mitigates round-trip costs through large reads/ writes (the iounit mechanism). Capnp’s promise pipelining is the typed equivalent – batch multiple logical operations into a dependency chain that executes without intermediate round-trips.
6.5 Inferno Lessons
Dis VM / type safety: Inferno’s bet on a managed runtime (Dis + Limbo) gives it type safety as a security boundary. capOS, being written in Rust for kernel code and targeting native binaries, does not have this luxury for arbitrary user-space code. However:
- WASI support (on the roadmap) provides a sandboxed execution environment with type-checked interfaces, similar in spirit to Dis.
- Cap’n Proto schemas provide interface-level type safety even for native code. The schema is the contract, enforced at message boundaries.
Channel-based concurrency: Limbo’s chan of T type is a local IPC
mechanism within a process. capOS does not currently have this (it relies
on kernel-mediated capability calls for all IPC). For in-process threading
(on the roadmap), typed channels between threads could be useful –
implemented as a library on top of shared memory + futex, without kernel
involvement.
Portable execution: Inferno’s ability to run the same bytecode everywhere is appealing but orthogonal to capOS’s goals. The WASI runtime item on the roadmap serves this purpose for capOS.
6.6 Concrete Recommendations
Based on this research, the following items are most relevant to capOS development:
-
Add a
Namespacecapability with union semantics. Extend the existing Namespace design (from the storage proposal) with Plan 9-style union composition (before/after/replace). This enables overlay patterns for configuration, caching, and modularity. -
Implement a
FileServercapability interface. Not as the universal abstraction, but as one interface for resources that are naturally file-like (config trees, debug introspection, POSIX compatibility). AFileServercap is just another capability – no special kernel support needed. -
Prioritize promise pipelining. This is capnp’s killer feature over 9P and the biggest performance advantage for IPC-heavy workloads. Multiple logical operations collapse into one network/IPC round-trip. Async rings are in place; the remaining work is the Stage 6 pipeline dependency/result-cap mapping rule.
-
Plan 9-style namespace construction in init. The boot manifest already describes which capabilities each service receives. Consider adding namespace-level composition to the manifest: “this service sees capability X as
data/primaryand capability Y asdata/cache, with cache searched first” – union directory semantics expressed in capability terms. -
Study 9P’s
exportfspattern for network transparency. Plan 9’sexportfsre-exports a namespace subtree over the network. The capOS equivalent would be a proxy service that takes a set of local capabilities and makes them available as capnp RPC endpoints on the network. This is the “network transparency” roadmap item – 9P’s design proves it is achievable, and capnp’s richer type system makes it more robust. -
Do not replicate 9P’s weaknesses. The untyped byte-stream interface, the lack of structured errors, and the fixed operation set are 9P’s costs for universality. capOS pays none of these costs with Cap’n Proto. The temptation to “make everything a file for simplicity” should be resisted – typed capabilities are strictly more powerful, and the
FileServerinterface provides the file metaphor where needed without compromising the rest of the system.
Summary
| Plan 9 / Inferno Concept | capOS Equivalent | Gap / Action |
|---|---|---|
| Per-process namespace (bind/mount) | Per-process capability table | Add Namespace cap with union semantics |
| 9P protocol (file operations) | Cap’n Proto RPC (typed method calls) | capnp is strictly superior for typed IPC; FileServer cap provides file semantics where needed |
| Union directories | No current equivalent | Add union composition to Namespace interface |
| File servers as services | Capability-implementing processes | Already the model; manifest-driven service graph is close to Plan 9’s boot namespace construction |
| Network transparency via 9P | Network transparency via capnp RPC | Same goal, capnp adds promise pipelining and typed interfaces |
exportfs (namespace re-export) | Capability proxy service | Not yet designed; high-value future work |
| Styx/9P as universal IPC | Capnp messages as universal IPC | Already the model; prioritize fast-path and pipelining |
| Dis VM (portable, type-safe execution) | WASI runtime (roadmap) | Same goal, different mechanism |
| Limbo channels (typed local IPC) | Not yet present | Consider for in-process threading |
| Authentication via auth fid | Not yet designed | Cap’n Proto RPC has no built-in auth; needs design |
References
- Rob Pike, Dave Presotto, Sean Dorward, Bob Flandrena, Ken Thompson, Howard Trickey, Phil Winterbottom. “Plan 9 from Bell Labs.” Computing Systems, Vol. 8, No. 3, Summer 1995, pp. 221-254.
- Rob Pike, Dave Presotto, Ken Thompson, Howard Trickey, Phil Winterbottom. “The Use of Name Spaces in Plan 9.” Operating Systems Review, Vol. 27, No. 2, April 1993, pp. 72-76.
- Plan 9 Manual: intro(1), bind(1), mount(1), intro(5) (the 9P manual section).
- Russ Cox, Eric Grosse, Rob Pike, Dave Presotto, Sean Quinlan. “Security in Plan 9.” USENIX Security 2002.
- Sean Dorward, Rob Pike, Dave Presotto, Dennis Ritchie, Howard Trickey, Phil Winterbottom. “The Inferno Operating System.” Bell Labs Technical Journal, Vol. 2, No. 1, Winter 1997.
- Phil Winterbottom, Rob Pike. “The Design of the Inferno Virtual Machine.” Bell Labs, 1997.
- Vita Nuova. “The Dis Virtual Machine Specification.” 2003.
- Vita Nuova. “The Limbo Programming Language.” 2003.
- Sape Mullender (editor). “The 9P2000 Protocol.” Plan 9 manual, section 5 (intro(5)).
- Kenichi Okada. “9P Resource Sharing Protocol.” IETF Internet-Draft, 2010.
Research: EROS, CapROS, and Coyotos
Deep analysis of persistent capability operating systems and their relevance to capOS.
1. EROS (Extremely Reliable Operating System)
1.1 Overview
EROS was designed and implemented by Jonathan Shapiro and collaborators at the University of Pennsylvania, starting in the late 1990s. It is a pure capability system descended from KeyKOS (developed at Key Logic in the 1980s). EROS’s defining feature is orthogonal persistence: the entire system state – processes, memory, capabilities – is transparently persistent. There is no distinction between “in memory” and “on disk.”
Key papers:
- Shapiro, J. S., Smith, J. M., & Farber, D. J. “EROS: A Fast Capability System” (SOSP 1999)
- Shapiro, J. S. “EROS: A Capability System” (PhD dissertation, 1999)
- Shapiro, J. S. & Weber, S. “Verifying the EROS Confinement Mechanism” (IEEE S&P 2000)
1.2 The Single-Level Store
In a conventional OS, memory and storage are separate address spaces with different APIs (read/write vs mmap/file I/O). The programmer is responsible for explicitly loading data from disk into memory, modifying it, and writing it back. This creates an impedance mismatch that is the source of enormous complexity (serialization, caching, crash consistency, etc.).
EROS eliminates this distinction with a single-level store:
- All objects (processes, memory pages, capability nodes) exist in a unified persistent object space.
- There is no “file system” and no “load/save.” Objects simply exist.
- The system periodically checkpoints the entire state to disk. Between checkpoints, modified pages are held in memory. After a crash, the system restores to the last consistent checkpoint.
- From the application’s perspective, memory IS storage. There is no API for persistence – it happens automatically.
The single-level store in EROS operates on two primitive object types:
- Pages – 4KB data pages (the equivalent of both memory pages and file blocks).
- Nodes – 32-slot capability containers (the equivalent of both process state and directory entries).
Every page and node has a persistent identity (an Object ID, or OID). The kernel maintains an in-memory object cache and demand-pages objects from disk as needed. Modified objects are written back during checkpoints.
1.3 Checkpoint/Restart
EROS uses a consistent checkpoint mechanism inspired by KeyKOS:
How it works:
- The kernel periodically initiates a checkpoint (KeyKOS used a 5-minute interval; EROS used a configurable interval, typically seconds to minutes).
- All processes are momentarily frozen.
- The kernel snapshots the current state:
- All dirty pages are marked for write-back.
- All node state (capability tables, process descriptors) is serialized.
- A consistent snapshot of the entire system is captured.
- Processes resume immediately – they continue modifying their own copies of pages (copy-on-write semantics ensure the checkpoint image is stable while new modifications accumulate).
- The snapshot is written to disk asynchronously while processes continue running.
- Once the write completes, the checkpoint is atomically committed (a checkpoint header on disk is updated).
What state is captured:
- All memory pages (dirty pages since last checkpoint).
- All nodes (capability slots, process registers, scheduling state).
- The kernel’s object table (mapping OIDs to disk locations).
- The capability graph (which process holds which capabilities).
Recovery after crash:
- On boot, the kernel reads the last committed checkpoint header.
- The system resumes from that exact state. All processes continue as if nothing happened (they may have lost a few seconds of work since the last checkpoint).
- No fsck, no journal replay, no application-level recovery logic.
Performance characteristics:
- Checkpoint cost is proportional to the number of dirty pages since the last checkpoint, not total system size.
- Copy-on-write minimizes pause time – processes are frozen only long enough to mark pages, not to write them.
- EROS achieved checkpoint times of a few milliseconds for the freeze phase, with asynchronous write-back taking longer depending on dirty set size.
- The 1999 SOSP paper reported IPC performance within 2x of L4 (the fastest microkernel at the time) despite the persistence overhead.
1.4 Capabilities: Keys, Nodes, and Domains
EROS (following KeyKOS) uses a specific capability model with three fundamental concepts:
Keys (capabilities):
A key is an unforgeable reference to an object. Keys are the ONLY way to access anything in the system. There are several types:
- Page keys – reference a persistent page. Can be read-only or read-write.
- Node keys – reference a node (a 32-slot capability container). Can be read-only.
- Process keys (called “domain keys” in KeyKOS) – reference a process, allowing control operations (start, stop, set registers).
- Number keys – encode a 96-bit value directly in the key (no indirection). Used for passing constants through the capability mechanism.
- Device keys – reference hardware device registers.
- Forwarder keys – indirection keys used for revocation (see below).
- Void keys – null/invalid keys, used as placeholders.
Nodes:
A node is a persistent container of exactly 32 key slots (in KeyKOS; EROS varied this slightly). Nodes serve multiple purposes:
- Address space description: A tree of nodes with page keys at the leaves defines a process’s virtual address space. The kernel walks this tree to resolve virtual addresses to physical pages (analogous to page tables, but persistent and capability-based).
- Capability storage: A process’s “capability table” is a node tree.
- General-purpose data structure: Any capability-based data structure (directories, lists, etc.) is built from nodes.
Domains (processes):
A domain is EROS’s equivalent of a process. It consists of:
- A domain root node with specific slots for:
- Slot 0-15: general-purpose key registers (the process’s capability table)
- Address space key (points to the root of the address space node tree)
- Schedule key (determines CPU time allocation)
- Brand key (identity for authentication)
- Other control keys
- The domain’s register state (general-purpose registers, IP, SP, flags)
- A state (running, waiting, available)
The entire domain state is captured during checkpoint because it’s all stored in persistent nodes and pages.
1.5 The Keeper Mechanism
Each domain has a keeper key – a capability to another domain that acts as its fault handler. When a domain faults (page fault, capability fault, exception), the kernel invokes the keeper:
- The faulting domain is suspended.
- The kernel sends a message to the keeper describing the fault.
- The keeper can inspect and modify the faulting domain’s state (via the domain key), fix the fault (e.g., map a page, supply a capability), and restart it.
This is EROS’s equivalent of signal handlers or exception ports, but capability-mediated and fully general. Keepers enable:
- Demand paging (the space bank keeper maps pages on fault)
- Capability interposition (a keeper can wrap/restrict capabilities)
- Process supervision (restart on crash)
1.6 Capability Revocation
Capability revocation – the ability to invalidate all copies of a capability – is one of the hardest problems in capability systems. EROS solves it with forwarder keys (called “sensory keys” in some descriptions):
How forwarders work:
- Instead of giving a client a direct key to a resource, the server creates a forwarder node.
- The forwarder contains a key to the real resource in one of its slots.
- The client receives a key to the forwarder, not the resource.
- When the client invokes the forwarder key, the kernel transparently redirects to the real resource.
- To revoke: the server rescinds the forwarder (sets a bit on the forwarder node). All outstanding forwarder keys become void keys. Invocations fail immediately.
Properties:
- Revocation is O(1) – flip a bit on the forwarder node. No need to scan all processes for copies.
- Revocation is transitive – if the revoked key was used to derive other keys (via further forwarders), those are also invalidated.
- The client cannot distinguish a forwarder key from a direct key (the kernel handles the indirection transparently).
- Revocation is immediate and irrevocable.
Space banks and revocation:
EROS uses space banks (inspired by KeyKOS) to manage resource allocation. A space bank is a capability that allocates pages and nodes. When a space bank is destroyed, ALL objects allocated from it are reclaimed. This provides bulk revocation of an entire subsystem.
1.7 Confinement
EROS provides a formally verified confinement mechanism. A confined subsystem cannot leak information to the outside world except through channels explicitly provided to it. Shapiro and Weber (IEEE S&P 2000) proved that EROS can construct a confined subsystem using:
- A constructor creates the confined process.
- The confined process receives ONLY the capabilities explicitly granted to it. It has no ambient authority, no access to timers (to prevent timing channels), and no access to storage (to prevent storage channels).
- The constructor verifies that no covert channels exist in the granted capability set.
This is relevant to capOS’s capability model: the same structural properties that make EROS confinement possible (no ambient authority, capabilities as the only access mechanism) are present in capOS’s design.
2. CapROS
2.1 Relationship to EROS
CapROS (Capability-based Reliable Operating System) is the direct successor to EROS. It was started by Charles Landau (who also worked on KeyKOS) and continues development based on the EROS codebase. CapROS is essentially “EROS in production” – the same architecture with engineering improvements.
2.2 Improvements Over EROS
Practical engineering focus:
- EROS was a research system; CapROS aims to be deployable.
- CapROS added support for modern hardware (PCI, USB, networking).
- Improved build system and development toolchain.
Persistence improvements:
- CapROS refined the checkpoint mechanism for better performance with modern disk characteristics (SSDs change the cost model significantly – random writes are cheap, so the checkpoint layout can be optimized differently than for spinning disks).
- Added support for larger persistent object spaces.
- Improved crash recovery speed.
Device driver model:
- CapROS runs device drivers as user-space processes (like EROS), each receiving only the device capabilities they need.
- A device driver receives: device register keys (MMIO access), interrupt keys (to receive interrupts), and DMA buffer keys.
- The driver CANNOT access other devices, other processes’ memory, or arbitrary I/O ports. It is confined to its specific device.
- This is directly analogous to capOS’s planned device capability model (see the networking and cloud deployment proposals).
Linux compatibility layer:
- CapROS includes a partial Linux kernel compatibility layer that allows some Linux device drivers to be compiled and run as CapROS user-space drivers. This pragmatically addresses the “driver availability” problem without compromising the capability model.
2.3 Current Status
CapROS development continued into the 2010s but has been relatively quiet. The codebase exists and runs on real x86 hardware. It is not widely deployed and remains primarily a research/demonstration system. The key contribution is demonstrating that the EROS/KeyKOS persistent capability model is viable on modern hardware and can support real device drivers and applications.
2.4 Device Drivers and Hardware Access
CapROS’s device driver isolation is worth examining in detail because capOS faces the same design decisions:
Device capability model:
Kernel
│
├── DeviceManager capability
│ │
│ ├── grants DeviceMMIO(base, size) to driver
│ ├── grants InterruptCap(irq_number) to driver
│ └── grants DMAPool(phys_range) to driver
│
└── Driver process
│
├── uses DeviceMMIO to read/write registers
├── uses InterruptCap to wait for interrupts
├── uses DMAPool to allocate DMA-safe buffers
└── exports higher-level capability (e.g., NIC, Block)
The driver has no way to access memory outside its granted ranges. A buggy NIC driver cannot corrupt disk I/O or access other processes’ pages.
3. Coyotos
3.1 Design Philosophy
Coyotos was Jonathan Shapiro’s next-generation project after EROS, started around 2004. Where EROS was an implementation of the KeyKOS model in C, Coyotos aimed to be a formally verifiable capability OS from the ground up.
Key differences from EROS:
- Verification-oriented design: Every kernel mechanism was designed to be amenable to formal verification. If a feature couldn’t be verified, it was redesigned or removed.
- BitC language: A new programming language (BitC) was designed specifically for writing verified systems software.
- Simplified object model: Coyotos reduced the number of primitive object types compared to EROS, making the verification target smaller.
- No inline assembly in the verified core: The verified kernel core was to be written entirely in BitC, with a thin hardware abstraction layer underneath.
3.2 BitC Language
BitC was an ambitious attempt to create a language suitable for both systems programming and formal verification:
Design goals:
- Type safety: Sound type system that prevents memory errors at compile time.
- Low-level control: Direct memory layout control, no garbage collector, suitable for kernel code.
- Formal reasoning: Type system designed so that proofs about programs could be mechanically checked.
- Mutability control: Explicit distinction between mutable and immutable references (predating Rust’s borrow checker by several years).
Relationship to capability verification:
The key insight was that if the kernel is written in a language with a sound type system, and capabilities are represented as typed references in that language, then many capability safety properties (no forgery, no amplification) follow from type safety rather than requiring separate proofs.
Specifically:
- Capabilities are opaque typed references – the type system prevents construction of capabilities from raw integers.
- The lack of arbitrary pointer arithmetic prevents capability forgery.
- Type-based access control means a read-only capability reference cannot be cast to a read-write one.
Outcome:
BitC was never completed. The language design proved extremely difficult – combining low-level systems programming with formal verification requirements created unsolvable tensions in the type system. Shapiro eventually acknowledged that the BitC approach was overambitious and shelved the project. (Rust, which appeared later, solved many of the same problems with a different approach – borrowing and lifetimes rather than full dependent types.)
3.3 Formal Verification Approach
Coyotos aimed to verify several key properties:
- Capability safety: No process can forge, modify, or amplify a capability. This was to be proved as a consequence of BitC’s type safety.
- Confinement: A confined subsystem cannot leak information except through authorized channels. EROS proved this informally; Coyotos aimed for machine-checked proofs.
- Authority propagation: Formal model of how authority flows through the capability graph, allowing static analysis of security policies.
- Memory safety: The kernel never accesses memory it shouldn’t, never double-frees, never uses after free. Type safety + linear types in BitC were intended to guarantee this.
The verification approach influenced later work on seL4, which successfully achieved formal verification of a capability microkernel (though in C with Isabelle/HOL proofs, not in a verification-oriented language).
3.4 Coyotos Memory Model
Coyotos simplified the EROS memory model while retaining persistence:
Objects:
- Pages: 4KB data pages (same as EROS).
- CapPages: Pages that hold capabilities instead of data. This replaced EROS’s fixed-size nodes with variable-size capability containers.
- GPTs (Guarded Page Tables): A unified abstraction for address space construction. Instead of EROS’s separate node trees for address spaces, Coyotos uses GPTs that combine guard bits (for sparse address space construction, similar to Patricia trees) with page table semantics.
- Processes: Similar to EROS domains but with a cleaner structure.
- Endpoints: IPC communication endpoints (similar to L4 endpoints, replacing EROS’s direct domain-to-domain calls).
GPTs (Guarded Page Tables):
This was Coyotos’s most innovative memory model contribution. A GPT node has:
- A guard value and guard length (for address space compression).
- Multiple capability slots pointing to sub-GPTs or pages.
- Hardware-independent address space description that the kernel translates to actual page tables on TLB miss.
The guard mechanism allows sparse address spaces without allocating intermediate page table levels. For example, a process that uses only two memory regions at addresses 0x1000 and 0x7FFF_F000 needs only a few GPT nodes, not a full 4-level page table tree.
Persistence:
Coyotos retained EROS’s checkpoint-based persistence but with a cleaner separation between the persistent object store and the in-memory cache. The simpler object model (fewer object types) made the checkpoint logic easier to verify.
3.5 Current Status
Coyotos was never completed. The BitC language proved too difficult, and Shapiro moved on to other work. However, Coyotos’s design documents and specifications remain valuable as a carefully reasoned evolution of the EROS model. The key ideas (GPTs, endpoint-based IPC, verification-oriented design) influenced other systems work.
4. Single-Level Store: Deep Dive
4.1 The Core Concept
The single-level store unifies two traditionally separate abstractions:
| Traditional OS | Single-Level Store |
|---|---|
| Virtual memory (RAM, volatile) | Unified persistent object space |
| File system (disk, persistent) | Same unified space |
| mmap (bridge between the two) | No bridge needed |
| Serialization (convert objects to bytes for storage) | Objects are always in storable form |
| Crash recovery (fsck, journal replay) | Checkpoint restore |
In a single-level store, the programmer never thinks about persistence. Objects are created, modified, and eventually garbage collected. The system ensures they survive power failure without any explicit save operation.
4.2 Implementation in EROS
EROS’s single-level store works as follows:
Object storage on disk:
- The disk is divided into two regions: the object store and the checkpoint log.
- The object store holds the canonical copy of all objects (pages and nodes), indexed by OID.
- The checkpoint log holds the most recently checkpointed versions of modified objects.
Object lifecycle:
- An object is created (allocated from a space bank). It receives a unique OID.
- The object exists in the in-memory object cache. It may be modified arbitrarily.
- During checkpoint, if the object is dirty, its current state is written to the checkpoint log.
- After the checkpoint commits, the logged version may be migrated to the object store (or left in the log until the next checkpoint).
- If the object is evicted from memory (memory pressure), it can be demand-paged back from disk.
Demand paging:
When a process accesses a virtual address that isn’t currently in physical memory:
- Page fault occurs.
- The kernel looks up the OID for that virtual page (by walking the address space capability tree).
- If the object is on disk, the kernel reads it into the object cache.
- The page is mapped into the process’s address space.
- The process continues, unaware that I/O occurred.
This is similar to demand paging in a conventional OS, but with a critical difference: the “backing store” is the persistent object store, not a swap partition. There is no separate swap space.
4.3 Performance Implications
Advantages:
- No serialization overhead for persistence. Objects are stored in their in-memory format.
- No double-buffering. A conventional OS may have a page in both the page cache and a file buffer; EROS has one copy.
- Checkpoint cost is proportional to mutation rate, not data size.
- Recovery is instantaneous – resume from last checkpoint, no log replay.
Disadvantages:
- Checkpoint pause: Even with copy-on-write, there is a brief pause to snapshot the system state. KeyKOS/EROS measured this at milliseconds, but it can grow with the number of dirty pages.
- Write amplification: Every modified page must be written to the checkpoint log, even if only one byte changed. This is worse than a log-structured filesystem that can coalesce small writes.
- Memory pressure: The object cache competes with application working sets. Under heavy memory pressure, the system may thrash between paging objects in and checkpointing them out.
- Large object stores: The OID-to-disk-location mapping must be kept in memory (or itself paged, adding complexity). For very large stores, this overhead grows.
- No partial persistence: You can’t choose to make some objects transient and others persistent. Everything is persistent. This wastes disk bandwidth on objects that don’t need persistence (temporary buffers, caches, etc.).
4.4 Relationship to Persistent Memory (PMEM/Optane)
Intel Optane (3D XPoint, now discontinued but conceptually important) and other persistent memory technologies provide byte-addressable storage that survives power loss. This is remarkably close to what EROS simulates in software:
| EROS Single-Level Store | PMEM Hardware |
|---|---|
| Software checkpoint to disk | Hardware persistence on every write |
| Object cache in DRAM | Data in persistent memory |
| Demand paging from disk | Direct load/store to persistent media |
| Crash = lose since last checkpoint | Crash = lose in-flight stores (cache lines) |
PMEM makes the single-level store cheaper:
- No checkpoint writes needed for objects stored in PMEM – they’re already persistent.
- No demand paging from disk – PMEM is directly addressable.
- Consistency requires cache line flush + fence (much cheaper than disk I/O).
But PMEM doesn’t eliminate the need for the store abstraction:
- PMEM capacity is limited (compared to SSDs/HDDs). The object store may still need to tier between PMEM and block storage.
- PMEM has higher latency than DRAM. The object cache still has value as a fast-path.
- Crash consistency with PMEM requires careful ordering of writes (cache line flushes). The checkpoint model actually simplifies this – you don’t need per-object crash consistency, just per-checkpoint consistency.
Relevance to capOS:
Even without PMEM hardware, understanding the single-level store model informs how capOS can design its persistence layer. The key insight is that separating “in-memory format” from “on-disk format” creates unnecessary complexity. Cap’n Proto’s zero-copy serialization already blurs this line – a capnp message in memory has the same byte layout as on disk.
5. Persistent Capabilities
5.1 How Persistent Capabilities Survive Restarts
In EROS/KeyKOS, capabilities survive restarts because they are part of the checkpointed state:
- A capability is stored as a key in a node slot.
- The key contains: (object type, OID, permissions, other metadata).
- During checkpoint, all nodes (including their key slots) are written to disk.
- On restart, nodes are restored. Keys reference objects by OID. Since objects are also restored, the key resolves to the same object.
The critical property: capabilities are named by the persistent identity of their target, not by a volatile address. A key says “page #47293” not “memory address 0x12345.” Since page #47293 is persistent, the key is meaningful across restarts.
5.2 Consistency Model
EROS guarantees checkpoint consistency: the entire system is restored to the state at the last committed checkpoint. This means:
- If process A sent a message to process B, and both the send and receive completed before the checkpoint, both see the message after restart.
- If the send completed but the receive didn’t (checkpoint happened between them), both are rolled back to before the send. The message is lost, but the system is consistent.
- There is no scenario where A thinks it sent a message but B never received it (or vice versa). The checkpoint captures a consistent global snapshot.
This is analogous to database transaction atomicity but applied to the entire system state.
5.3 Volatile State and Capabilities
Some capabilities reference inherently volatile state. EROS handles this through the object re-creation pattern:
Hardware devices:
- Device keys reference hardware registers that don’t survive reboot.
- On restart, the kernel re-initializes device state and re-creates device keys.
- Processes that held device keys get valid keys again (pointing to the re-initialized device), but the device state itself is reset.
- The process’s device driver is responsible for re-initializing the device to the desired state (this is application logic, not kernel logic).
Network connections:
- EROS doesn’t have a native networking stack in the kernel, so this is handled at the application level.
- A network service process re-establishes connections on restart.
- Clients that held capabilities to network endpoints would invoke them, and the network service would transparently reconnect.
- The capability abstraction hides the reconnection – the client’s code doesn’t change.
General pattern:
When a capability references state that can’t survive restart:
- The capability itself persists (it’s in a node slot, checkpointed).
- On restart, invoking the capability may trigger re-initialization.
- The keeper mechanism handles this: the target object’s keeper detects the stale state and re-initializes before completing the call.
- The client is unaware of the restart (or sees a transient error if re-initialization fails).
5.4 The Space Bank Model
Persistent capabilities create a garbage collection problem: when is it safe to reclaim a persistent object? EROS solves this with space banks:
- A space bank is a capability that allocates objects (pages and nodes).
- Every object is allocated from exactly one space bank.
- Space banks can be hierarchical (a bank allocates from a parent bank).
- Destroying a space bank reclaims ALL objects allocated from it.
This provides:
- Bulk deallocation: Terminate a subsystem by destroying its bank.
- Resource accounting: Each bank tracks how much space it has consumed.
- Revocation: Destroying a bank revokes all capabilities to objects allocated from it (the objects cease to exist).
The space bank model avoids the need for a global garbage collector scanning the capability graph. Instead, resource lifetimes are explicitly managed through the bank hierarchy.
6. Relevance to capOS
6.1 Cap’n Proto as Persistent Capability Format
EROS stores capabilities as (type, OID, permissions) tuples in fixed-size node slots. capOS can do something analogous but more naturally, because Cap’n Proto already provides a serialization format for structured data:
A persistent capability in capOS could be a capnp struct:
struct PersistentCapRef {
interfaceId @0 :UInt64; # which capability interface
objectId @1 :UInt64; # persistent object identity
permissions @2 :UInt32; # bitmask of allowed methods
epoch @3 :UInt64; # revocation epoch (see below)
}
Why this works well with Cap’n Proto:
- Zero-copy persistence: A capnp message in memory has the same byte layout as on disk. No serialization/deserialization step for persistence. This is the closest a modern system can get to EROS’s single-level store without hardware support.
- Schema evolution: Cap’n Proto’s backwards-compatible schema evolution means persistent capability formats can evolve without breaking existing stored capabilities.
- Cross-machine references: The same
PersistentCapRefcan reference a local or remote object. TheobjectIdcan include a machine/node identifier for distributed capabilities. - Type safety: The
interfaceIdfield provides runtime type checking that EROS’s keys lacked (EROS keys are untyped references; the type is determined by the target object, not the key).
Difference from EROS:
EROS capabilities are kernel objects – the kernel knows about every key and
mediates every invocation. In capOS, PersistentCapRef could be a
user-space construct – a serialized reference that is resolved by the
kernel (or a userspace capability manager) when invoked. This is a
deliberate trade-off: less kernel complexity, more flexibility, but the
kernel must validate references on use rather than at creation time.
6.2 Checkpoint/Restart Patterns for capOS
EROS’s checkpoint model provides several patterns capOS could adopt:
Pattern 1: Application-Level Checkpointing (Recommended as Phase 1)
This is what capOS’s storage proposal already describes: services serialize their own state to the Store capability. This is simpler than EROS’s transparent persistence but requires application cooperation.
Service state → capnp serialize → Store.put(data) → persistent hash
On restart: Store.get(hash) → capnp deserialize → restore state
Advantages over EROS transparent persistence:
- No kernel complexity for checkpointing.
- Services control what is persistent and what is transient.
- No “checkpoint pause” – services choose when to persist.
- Natural fit with Cap’n Proto (state is already capnp).
Disadvantages:
- Every service must implement save/restore logic.
- No automatic consistency across services (each saves independently).
- Programmer error can lead to inconsistent state after restart.
Pattern 2: Kernel-Assisted Checkpointing (Phase 2)
Add a Checkpoint capability that captures process state:
interface Checkpoint {
# Save the calling process's state (registers, memory, cap table)
save @0 () -> (handle :Data);
# Restore a previously saved state
restore @1 (handle :Data) -> ();
}
This is analogous to CRIU (Checkpoint/Restore in Userspace) on Linux but capability-mediated:
- The kernel captures the process’s address space, register state, and capability table.
- State is serialized as capnp messages and stored via the Store capability.
- Restore creates a new process from the saved state.
Advantages:
- Transparent to the application (no save/restore logic needed).
- Can capture the full capability graph of a process.
- Enables process migration between machines.
Disadvantages:
- Kernel complexity for state capture.
- Must handle capabilities that reference volatile state (open network connections, device handles).
- Memory overhead for copy-on-write snapshots.
Pattern 3: Consistent Multi-Process Checkpointing (Phase 3)
EROS’s global checkpoint extended to capOS:
- A
CheckpointCoordinatorservice initiates a distributed snapshot. - All participating services freeze, checkpoint their state, then resume.
- The coordinator records a consistent cut across all services.
- Recovery restores all services to the same consistent point.
This requires:
- A coordination protocol (similar to distributed database commit).
- Services must participate in the protocol (register with the coordinator, respond to freeze/checkpoint/resume signals).
- The coordinator must handle failures during the checkpoint itself.
This is the most complex option but provides the strongest consistency guarantees. It’s appropriate for capOS’s later stages when multi-service reliability matters.
6.3 Capability-Native Filesystem Design
EROS’s model and capOS’s Store proposal can be synthesized into a capability-native filesystem design:
Hybrid approach: Content-Addressed Store + Capability Metadata
capOS’s current Store proposal uses content-addressed storage (hash-based). This is good for immutable data but awkward for capability references (a capability’s target may change without the capability itself changing).
A better model, informed by EROS:
Persistent Object = (ObjectId, Version, CapnpData, CapSlots[])
Where:
ObjectIdis a persistent identity (like EROS’s OID).Versionis a monotonic counter (for optimistic concurrency).CapnpDatais the object’s data payload as a capnp message.CapSlots[]is a list of capability references embedded in the object (like EROS’s node slots).
This separates data from capability references, which is important because:
- Data can be content-addressed (deduplicated by hash).
- Capability references must be identity-addressed (two identical-looking references to different objects are different).
- Revocation operates on capability references, not data.
The Namespace as Directory
capOS’s Namespace capability is the capability-native equivalent of a
directory:
| Unix | EROS | capOS |
|---|---|---|
| Directory (inode + dentries) | Node with keys in slots | Namespace capability |
| Path traversal | Node tree walk | Namespace.resolve() chain |
| Permission bits | Key type + slot permissions | Capability attenuation |
| Hard links | Multiple keys to same object | Multiple refs to same hash |
| Symbolic links | Forwarder keys | Redirect capabilities |
Journaling and Crash Consistency
EROS avoids journaling by using checkpoint-based consistency. capOS’s Store service needs its own consistency story:
Option A: Checkpoint-based (EROS-style)
- Store service maintains an in-memory cache of recent modifications.
- Periodically flushes a consistent snapshot to disk.
- On crash, recovers to last flush point.
- Simple but may lose recent writes.
Option B: Log-structured (modern)
- All writes go to an append-only log.
- A background compaction process builds indexed snapshots from the log.
- On crash, replay the log from the last snapshot.
- More complex but no data loss window.
Option C: Hybrid
- Capability metadata (the namespace bindings) uses a write-ahead log for crash consistency.
- Object data (capnp blobs in the content-addressed store) uses checkpoint-based consistency (losing a few blobs is tolerable; losing a namespace binding is not).
Option C is recommended for capOS: it provides strong consistency for the critical metadata while keeping the data path simple.
6.4 Transparent vs Explicit Persistence: Tradeoffs
| Aspect | EROS Transparent | capOS Explicit | Hybrid |
|---|---|---|---|
| Application complexity | None (automatic) | High (must implement save/restore) | Medium (opt-in transparency) |
| Kernel complexity | Very high (checkpoint, COW, object store) | Low (just IPC and memory) | Medium (checkpoint capability) |
| Consistency | Strong (global checkpoint) | Weak (per-service) | Medium (coordinator) |
| Control | None (everything persists) | Full (choose what to save) | Selective |
| Performance | Checkpoint pauses | No pauses, explicit I/O cost | Configurable |
| Volatile state | Keeper mechanism handles | Service handles reconnection | Annotated capabilities |
| Debuggability | Hard (system is a black box) | Easy (state is explicit capnp) | Medium |
| Cap’n Proto fit | Neutral | Excellent (state = capnp) | Good |
Recommendation for capOS:
Start with explicit persistence (Phase 1 in the storage proposal) because:
- It’s dramatically simpler to implement.
- Cap’n Proto makes serialization nearly free anyway.
- It gives services control over what is persistent.
- It aligns with capOS’s existing Store/Namespace design.
- The kernel stays simple.
Then add opt-in kernel-assisted checkpointing (like the Checkpoint capability described above) for services that want transparent persistence. This gives the benefits of EROS’s model without forcing it on everything.
Never implement EROS’s fully transparent global persistence – the kernel complexity is enormous, the debugging experience is poor, and modern systems (with fast SSDs and capnp zero-copy serialization) don’t need it. The explicit model with good tooling is strictly better for a research OS.
6.5 Capability Revocation in capOS
EROS’s forwarder key model translates directly to capOS:
Epoch-based revocation:
Each capability includes a revocation epoch. The kernel (or capability manager) maintains a per-object epoch counter. When a capability is invoked:
- Check that the capability’s epoch matches the object’s current epoch.
- If it doesn’t match, the capability has been revoked – return an error.
- To revoke all capabilities to an object, increment the object’s epoch.
This is O(1) revocation (increment a counter) with O(1) check per invocation (compare two integers). It’s simpler than EROS’s forwarder mechanism and fits naturally into a capnp-serialized capability reference:
struct CapRef {
objectId @0 :UInt64;
epoch @1 :UInt64; # revocation epoch
permissions @2 :UInt32; # method bitmask
interfaceId @3 :UInt64; # type of the capability
}
Space bank analog:
capOS can implement EROS’s space bank pattern using the Store:
- Each “bank” is a Namespace prefix in the Store.
- Objects allocated by a service are stored under its namespace.
- Destroying the service’s namespace revokes access to all its objects.
- Resource accounting is done by the Store service (track bytes per namespace).
6.6 Summary of Recommendations
| EROS/CapROS/Coyotos Concept | capOS Recommendation |
|---|---|
| Single-level store | Don’t implement (too complex for research OS). Use Cap’n Proto zero-copy as a lightweight equivalent. |
| Checkpoint/restart | Phase 1: application-level (explicit capnp save/restore). Phase 2: Checkpoint capability for opt-in transparent persistence. |
| Persistent capabilities | Use capnp PersistentCapRef struct with objectId + epoch. Store capability graph in the Store service. |
| Capability revocation | Epoch-based revocation (increment counter, check on invocation). Simpler than EROS forwarders, same O(1) cost. |
| Space banks | Map to Store namespaces. Destroying a namespace reclaims all objects. |
| Keeper/fault handler | Map to capOS’s supervisor mechanism (service-architecture proposal). Supervisor receives fault notifications and can restart/repair. |
| GPTs (Coyotos) | Not needed – capOS uses hardware page tables directly. The sparse address-space idea remains relevant for future SharedBuffer/AddressRegion work beyond the current VirtualMemory cap. |
| Confinement | capOS already has the structural prerequisites (no ambient authority). Formal confinement proofs are a future research direction. |
| Device isolation | Already planned in capOS (device capabilities with MMIO/interrupt/DMA grants). CapROS validates this approach works in practice. |
Key References
- Shapiro, J. S., Smith, J. M., Farber, D. J. “EROS: A Fast Capability System.” Proceedings of the 17th ACM Symposium on Operating Systems Principles (SOSP), 1999.
- Shapiro, J. S. “EROS: A Capability System.” PhD dissertation, University of Pennsylvania, 1999.
- Shapiro, J. S. & Weber, S. “Verifying the EROS Confinement Mechanism.” IEEE Symposium on Security and Privacy, 2000.
- Hardy, N. “The Confused Deputy.” ACM SIGOPS Operating Systems Review, 1988. (Motivates capability-based access control.)
- Hardy, N. “KeyKOS Architecture.” Operating Systems Review, 1985.
- Landau, C. R. “The Checkpoint Mechanism in KeyKOS.” Proceedings of the Second International Workshop on Object Orientation in Operating Systems, 1992.
- Shapiro, J. S. et al. “Coyotos Microkernel Specification.” Technical report, Johns Hopkins University, 2004-2008.
- Shapiro, J. S. et al. “BitC Language Specification.” Technical report, Johns Hopkins University, 2004-2008.
- Dennis, J. B. & Van Horn, E. C. “Programming Semantics for Multiprogrammed Computations.” Communications of the ACM, 1966. (Original capability concept.)
- Levy, H. M. “Capability-Based Computer Systems.” Digital Press, 1984. (Comprehensive survey of capability systems including CAP, Hydra, iAPX 432, StarOS.)
LLVM Target Customization for capOS
Deep research report on creating custom LLVM/Rust/Go targets for a capability-based OS.
Status 2026-04-30 00:41 UTC: capOS keeps the kernel on
x86_64-unknown-none, while userspace builds through the checked-in
x86_64-unknown-capos target plus the runtime linker-script path. Since this
report was first written, PT_TLS parsing, userspace TLS block setup, FS-base
save/restore, the VirtualMemory capability, a #[thread_local] QEMU smoke,
Timer now/sleep, current-execution-context ThreadControl FS-base updates,
the single-thread runtime checkpoint, process-local thread lifecycle, and
private ParkSpace wait/wake have landed. Anonymous VirtualMemory
unmap/decommit and explicit MemoryObject.unmap now drain private park waiters
before address reuse. Runtime park clients, Go futexsleep/futexwake glue,
per-thread TLS ownership for full multi-thread runtime use, shared park words,
address-space generation cleanup, and a Go port remain future work.
Table of Contents
- Custom OS Target Triple
- Calling Conventions
- Relocations
- TLS (Thread-Local Storage) Models
- Rust Target Specification
- Go Runtime Requirements
- Relevance to capOS
1. Custom OS Target Triple
Target Triple Format
LLVM target triples follow the format <arch>-<vendor>-<os> or
<arch>-<vendor>-<os>-<env>:
- arch:
x86_64,aarch64,riscv64gc, etc. - vendor:
unknown,apple,pc, etc. (oftenunknownfor custom OSes) - os:
linux,none,redox,hermit,fuchsia, etc. - env (optional):
gnu,musl,eabi, etc.
For capOS, the eventual userspace target triple should be
x86_64-unknown-capos. The kernel should keep using a freestanding target
(x86_64-unknown-none) unless a kernel-specific target file becomes useful
for build hygiene.
What LLVM Needs
LLVM’s target description consists of:
- Target machine: Architecture (instruction set, register file, calling conventions). x86_64 already exists in LLVM.
- Object format: ELF, COFF, Mach-O. capOS uses ELF.
- Relocation model: static, PIC, PIE, dynamic-no-pic.
- Code model: small, kernel, medium, large.
- OS-specific ABI details: Stack alignment, calling convention defaults, TLS model, exception handling mechanism.
LLVM does NOT need kernel-level knowledge of your OS. It needs to know how to generate correct object code for the target environment. The OS name in the triple primarily affects:
- Default calling convention selection
- Default relocation model
- TLS model selection
- Object file format and flags
- C library assumptions (relevant for C compilation, less for Rust no_std)
Creating a New OS in LLVM (Upstream Path)
To add capos as a recognized OS in LLVM itself:
- Add the OS to
llvm/include/llvm/TargetParser/Triple.h(theOSTypeenum) - Add string parsing in
llvm/lib/TargetParser/Triple.cpp - Define ABI defaults in the relevant target (
llvm/lib/Target/X86/) - Update Clang’s driver for the new OS
(
clang/lib/Driver/ToolChains/,clang/lib/Basic/Targets/)
This is significant upstream work and not necessary initially. The pragmatic path is using Rust’s custom target JSON mechanism (see Section 5).
What Other OSes Do
| OS | LLVM status | Approach |
|---|---|---|
| Redox | Upstream in Rust; no dedicated LLVM OS enum in current LLVM | Full triple x86_64-unknown-redox, Tier 2 in Rust |
| Hermit | Upstream in LLVM and Rust | x86_64-unknown-hermit, Tier 3, unikernel |
| Fuchsia | Upstream in LLVM and Rust | x86_64-unknown-fuchsia, Tier 2 |
| Theseus | Custom target JSON | Uses x86_64-unknown-theseus JSON spec, not upstream |
| Blog OS (phil-opp) | Custom target JSON | Uses JSON target spec, targets x86_64-unknown-none base |
| seL4/Robigalia | Custom target JSON | Modified from x86_64-unknown-none |
Recommendation for capOS: keep the kernel on x86_64-unknown-none.
Introduce a userspace-only custom target JSON when cfg(target_os = "capos")
or toolchain packaging becomes valuable. Do not upstream a capos OS triple
until the userspace ABI is stable.
Treat the userspace target as build hygiene and runtime scaffolding for now. It
does not promise a stable language ABI, Rust std, Go, C runtime, or upstream
target contract beyond the current static no_std userspace model.
2. Calling Conventions
LLVM Calling Conventions
LLVM supports numerous calling conventions. The ones relevant to capOS:
| CC | LLVM ID | Description | Relevance |
|---|---|---|---|
| C | 0 | Default C calling convention (System V AMD64 ABI on x86_64) | Primary for interop |
| Fast | 8 | Optimized for internal use, passes in registers | Rust internal use |
| Cold | 9 | Rarely-called functions, callee-save heavy | Error paths |
| GHC | 10 | Glasgow Haskell Compiler, everything in registers | Not relevant |
| HiPE | 11 | Erlang HiPE, similar to GHC | Not relevant |
| WebKit JS | 12 | JavaScript JIT | Not relevant |
| AnyReg | 13 | Dynamic register allocation | JIT compilers |
| PreserveMost | 14 | Caller saves almost nothing | Interrupt handlers |
| PreserveAll | 15 | Caller saves nothing | Context switches |
| Swift | 16 | Swift self/error registers | Not relevant |
| CXX_FAST_TLS | 17 | C++ TLS access optimization | TLS wrappers |
| X86_StdCall | 64 | Windows stdcall | Not relevant |
| X86_FastCall | 65 | Windows fastcall | Not relevant |
| X86_RegCall | 95 | Register-based calling | Performance-critical code |
| X86_INTR | 83 | x86 interrupt handler | IDT handlers |
| Win64 | 79 | Windows x64 calling convention | Not relevant |
System V AMD64 ABI (The Default for capOS)
On x86_64, the System V AMD64 ABI (CC 0, “C”) is the standard:
- Integer args: RDI, RSI, RDX, RCX, R8, R9
- Float args: XMM0-XMM7
- Return: RAX (integer), XMM0 (float)
- Caller-saved: RAX, RCX, RDX, RSI, RDI, R8-R11, XMM0-XMM15
- Callee-saved: RBX, RBP, R12-R15
- Stack alignment: 16-byte at call site
- Red zone: 128 bytes below RSP (unavailable in kernel mode)
capOS already uses this convention – the syscall handler in
kernel/src/arch/x86_64/syscall.rs maps syscall registers to System V
registers before calling syscall_handler.
Customizing for a New OS Target
For a custom OS, calling convention customization is usually minimal:
-
Kernel code: Disable the red zone (capOS already does this via
x86_64-unknown-nonewhich sets"disable-redzone": true). The red zone is unsafe in interrupt/syscall contexts. -
Userspace code: Standard System V ABI is fine. The red zone is safe in userspace.
-
Syscall convention: This is an OS design choice, not an LLVM CC. capOS uses: RAX=syscall number, RDI-R9=args (matching System V for easy dispatch). Linux uses a slightly different register mapping (R10 instead of RCX for arg4, because SYSCALL clobbers RCX).
-
Interrupt handlers: Use
X86_INTR(CC 83) or manual save/restore. capOS currently uses manual asm stubs.
Cross-Language Interop Implications
| Languages | Convention | Notes |
|---|---|---|
| Rust <-> Rust | Rust ABI (unstable) | Internal to a crate, not stable across crates |
| Rust <-> C | extern "C" (System V) | Stable, well-defined. Used for libcapos API |
| Rust <-> Go | Complex (see Section 6) | Go has its own internal ABI (ABIInternal) |
| C <-> Go | extern "C" via cgo | Go’s cgo bridge, heavy overhead |
| Any <-> Kernel | Syscall convention | Register-based, OS-defined, not a CC |
Key point: The System V AMD64 ABI is the lingua franca. All languages
can produce extern "C" functions. capOS should standardize on System V
for all cross-language boundaries and capability invocations.
Go’s internal ABI (ABIInternal, using R14 as the g register) is different
from System V. Go functions called from outside Go must go through a
trampoline. This is handled by the Go runtime, not something capOS needs
to solve at the LLVM level.
3. Relocations
LLVM Relocation Models
| Model | Flag | Description |
|---|---|---|
| static | -relocation-model=static | All addresses resolved at link time. No GOT/PLT. |
| pic | -relocation-model=pic | Position-independent code. Uses GOT for globals, PLT for calls. |
| dynamic-no-pic | -relocation-model=dynamic-no-pic | Like static but with dynamic linking support (macOS legacy). |
| ropi | -relocation-model=ropi | Read-only position-independent (ARM embedded). |
| rwpi | -relocation-model=rwpi | Read-write position-independent (ARM embedded). |
| ropi-rwpi | -relocation-model=ropi-rwpi | Both ROPI and RWPI (ARM embedded). |
Code Models (x86_64)
| Model | Flag | Address Range | Use Case |
|---|---|---|---|
| small | -code-model=small | 0 to 2GB | Userspace default |
| kernel | -code-model=kernel | Top 2GB (negative 32-bit) | Higher-half kernel |
| medium | -code-model=medium | Code in low 2GB, data anywhere | Large data sets |
| large | -code-model=large | No assumptions | Maximum flexibility, worst performance |
What capOS Currently Uses
From .cargo/config.toml:
[target.x86_64-unknown-none]
rustflags = ["-C", "link-arg=-Tkernel/linker-x86_64.ld", "-C", "code-model=kernel", "-C", "relocation-model=static"]
-
Kernel:
code-model=kernel+relocation-model=static. Correct for a higher-half kernel at0xffffffff80000000. All kernel symbols are in the top 2GB of virtual address space, so 32-bit sign-extended addressing works. -
Init/demos/capos-rt/shell/libcapos/libcapos-posix/capos-wasm userspace: All standalone userspace crates build against
targets/x86_64-unknown-capos.json(checked in at that path) via thebuild-*-caposCargo aliases in.cargo/config.toml. The target setscode-model = "small",relocation-model = "static",os = "capos",has-thread-local = true, andtls-model = "local-exec". The pinned nightly toolchain isnightly-2026-04-20; verify the effective LLVM version withrustc --version --verboseagainst that toolchain date.
Kernel vs. Userspace Requirements
Kernel:
- Static relocations, kernel code model.
- No PIC overhead needed – the kernel is loaded at a known address.
- The linker script places everything in the higher half.
- This is the correct and standard approach (Linux kernel does the same).
Userspace (current – static binaries):
- Static relocations. A future custom userspace target should choose the small code model explicitly.
- Simple, no runtime relocator needed.
- Binary is loaded at a fixed address (
0x200000). - Works perfectly for single-binary-per-address-space.
Userspace (future – if shared libraries or ASLR desired):
- PIE (Position-Independent Executable) = PIC + static linking.
- Requires a dynamic loader or kernel-side relocator.
- Enables ASLR (Address Space Layout Randomization) for security.
- Adds GOT indirection overhead (typically < 5% performance impact).
Position-Independent Code in a Capability Context
PIC/PIE is relevant to capOS for several reasons:
-
ASLR: PIE enables loading binaries at random addresses, making ROP attacks harder. Even in a capability system, defense-in-depth matters.
-
Shared libraries: If capOS ever supports shared objects (e.g., a shared
libcapos.so), PIC is required for the shared library. -
WASI/Wasm: Not relevant – Wasm has its own memory model.
-
Multiple instances: With static linking, two instances of the same binary can share read-only pages (text, rodata) if loaded at the same address. PIC/PIE allows sharing even at different addresses (copy-on-write for the GOT).
Recommendation for capOS: Keep static relocation for now. Consider PIE for userspace when implementing ASLR (after threading and IPC are stable). The kernel should remain static forever.
4. TLS (Thread-Local Storage) Models
LLVM TLS Models
LLVM supports four TLS models, in order from most dynamic to most constrained:
| Model | Description | Runtime Requirement | Performance |
|---|---|---|---|
| general-dynamic | Any module, any time | Full __tls_get_addr via dynamic linker | Slowest (function call per access) |
| local-dynamic | Same module, any time | __tls_get_addr for module base, then offset | Slow (one call per module per thread) |
| initial-exec | Only modules loaded at startup | GOT slot populated by dynamic linker | Fast (one memory load) |
| local-exec | Main executable only | Direct FS/GS offset, known at link time | Fastest (single instruction) |
How TLS Works on x86_64
On x86_64, TLS is accessed via the FS segment register:
- The OS sets the FS base address for each thread (via
MSR_FS_BASEorarch_prctl(ARCH_SET_FS)). - TLS variables are accessed as offsets from FS base:
local-exec:mov %fs:OFFSET, %rax(offset known at link time)initial-exec:mov %fs:0, %rax; mov GOT_OFFSET(%rax), %rcx; mov %fs:(%rcx), %rdxgeneral-dynamic:call __tls_get_addr(returns pointer to TLS block)
Which Model for capOS?
Kernel:
- The kernel does not use compiler TLS. Current TLS support is for loaded userspace ELF images only.
- For SMP: per-CPU data via GS segment register (the standard approach).
Set
MSR_GS_BASEon each CPU to point to aPerCpustruct.swapgson kernel entry switches between user and kernel GS base. - Kernel TLS model: Not applicable (per-CPU data is accessed via GS, not the compiler’s TLS mechanism).
Userspace (static binaries, no dynamic linker):
- local-exec is the only correct choice. There’s no dynamic linker to resolve TLS relocations, so general-dynamic and initial-exec won’t work.
- Implemented for the current single-threaded process model: the ELF parser
records
PT_TLS, the loader maps a Variant II TLS block plus TCB self pointer, and the scheduler saves/restores FS base on context switch. - Implemented for the current execution context:
ThreadControl.setFsBasegives a runtime a capability-authorized equivalent toarch_prctl(ARCH_SET_FS). ThreadControl.setFsBaseaffects only the current thread or execution context. There is no process-global FS-base mutation.- Still missing for future threading and full Go: per-thread TLS state and independently settable FS bases for each user thread.
- Future thread creation must allocate or receive a distinct TLS block and FS
base per
ThreadRef; treating TLS as process-global would break Rust#[thread_local], Gogstate, and any C runtime that assumes per-thread TLS. - Current-process/current-thread FS-base operations are useful for the single-thread runtime checkpoint, but they are not the final threading ABI. True multi-threaded Go or C/POSIX-like runtime support requires per-ThreadRef TLS allocation, per-thread FS-base ownership, and context switches that save/restore FS base as thread state.
Userspace (with dynamic linker, future):
- initial-exec for the main executable and preloaded libraries.
- general-dynamic for
dlopen()-loaded libraries. - Requires implementing
__tls_get_addrin the dynamic linker.
TLS Initialization Sequence
For a statically-linked userspace binary with local-exec TLS:
1. Kernel creates thread
2. Kernel allocates TLS block (size from ELF TLS program header)
3. Kernel copies .tdata (initialized TLS) into TLS block
4. Kernel zeros .tbss (uninitialized TLS) in TLS block
5. Kernel sets FS base = TLS block address (writes MSR_FS_BASE)
6. Thread starts executing; %fs:OFFSET accesses TLS directly
The ELF file contains two TLS sections:
.tdata(PT_TLS segment, initialized thread-local data).tbss(zero-initialized thread-local data, like.bssbut per-thread)
The PT_TLS program header tells the loader:
- Virtual address and file offset of
.tdata p_memsz= total TLS size (including.tbss)p_filesz= size of.tdataonlyp_align= required alignment
FS/GS Base Register Usage Plan
| Register | Used By | Purpose |
|---|---|---|
| FS | Userspace threads | Thread-local storage (set per-thread by kernel) |
| GS | Kernel (via swapgs) | Per-CPU data (set per-CPU during boot) |
This is the standard Linux convention and what Go expects (Go uses
arch_prctl(ARCH_SET_FS) to set the FS base for each OS thread).
What capOS Has and Still Needs
- Implemented: parse
PT_TLSincapos-lib/src/elf.rs. - Implemented: allocate/map a TLS block during process image load in
kernel/src/spawn.rs. - Implemented: copy
.tdata, zero.tbss, and write the TCB self pointer for the current Variant II static TLS layout. - Implemented: save/restore FS base through
kernel/src/sched.rsandkernel/src/arch/x86_64/tls.rs. - Implemented for the current process execution context:
ThreadControl.getFsBaseandThreadControl.setFsBase. - Still needed: per-thread FS-base state for future multi-threaded userspace.
5. Rust Target Specification
How Custom Targets Work
Rust supports custom targets via JSON specification files. The workflow:
- Create a
<target-name>.jsonfile - Pass it to rustc:
--target path/to/x86_64-unknown-capos.json - Use with cargo via
-Zbuild-stdto build core/alloc/std from source
Target lookup priority:
- Built-in target names
- File path (if the target string contains
/or.json) RUST_TARGET_PATHenvironment variable directories
The Rust target JSON schema is explicitly unstable. Generate examples from the
pinned compiler with rustc -Z unstable-options --print target-spec-json and
validate against that same compiler’s target-spec-json-schema before checking
in a target file.
Viewing Existing Specs
# Print the JSON spec for a built-in target:
rustc +nightly -Z unstable-options --target=x86_64-unknown-none --print target-spec-json
# Print the JSON schema for all available fields:
rustc +nightly -Z unstable-options --print target-spec-json-schema
Example: x86_64-unknown-capos Kernel Target
Based on the current x86_64-unknown-none target, with capOS-specific
adjustments. This is a sketch; regenerate from the pinned rustc schema before
using it.
{
"llvm-target": "x86_64-unknown-none-elf",
"metadata": {
"description": "capOS kernel (x86_64)",
"tier": 3,
"host_tools": false,
"std": false
},
"data-layout": "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128",
"arch": "x86_64",
"cpu": "x86-64",
"target-endian": "little",
"target-pointer-width": 64,
"target-c-int-width": 32,
"os": "none",
"env": "",
"vendor": "unknown",
"linker-flavor": "gnu-lld",
"linker": "rust-lld",
"pre-link-args": {
"gnu-lld": ["-Tkernel/linker-x86_64.ld"]
},
"features": "-mmx,-sse,-sse2,-sse3,-ssse3,-sse4.1,-sse4.2,-avx,-avx2,+soft-float",
"disable-redzone": true,
"panic-strategy": "abort",
"code-model": "kernel",
"relocation-model": "static",
"rustc-abi": "softfloat",
"executables": true,
"exe-suffix": "",
"has-thread-local": false,
"position-independent-executables": false,
"static-position-independent-executables": false,
"plt-by-default": false,
"max-atomic-width": 64,
"stack-probes": { "kind": "inline" }
}
Example: x86_64-unknown-capos Userspace Target
{
"llvm-target": "x86_64-unknown-none-elf",
"metadata": {
"description": "capOS userspace (x86_64)",
"tier": 3,
"host_tools": false,
"std": false
},
"data-layout": "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128",
"arch": "x86_64",
"cpu": "x86-64",
"target-endian": "little",
"target-pointer-width": 64,
"target-c-int-width": 32,
"os": "capos",
"env": "",
"vendor": "unknown",
"linker-flavor": "gnu-lld",
"linker": "rust-lld",
"pre-link-args": {
"gnu-lld": ["-Tinit/linker.ld"]
},
"features": "-mmx,-sse,-sse2,-sse3,-ssse3,-sse4.1,-sse4.2,-avx,-avx2,+soft-float",
"disable-redzone": false,
"panic-strategy": "abort",
"code-model": "small",
"relocation-model": "static",
"rustc-abi": "softfloat",
"executables": true,
"exe-suffix": "",
"has-thread-local": true,
"position-independent-executables": false,
"static-position-independent-executables": false,
"max-atomic-width": 64,
"plt-by-default": false,
"stack-probes": { "kind": "inline" },
"tls-model": "local-exec"
}
Key JSON Fields
| Field | Purpose | Typical Values |
|---|---|---|
llvm-target | LLVM triple for code generation | x86_64-unknown-none-elf (reuse existing backend) |
os | OS name (affects cfg(target_os = "...")) | "none", "capos", "linux" |
arch | Architecture name | "x86_64", "aarch64" |
data-layout | LLVM data layout string | Copy from same-arch target |
linker-flavor | Which linker to use | "gnu-lld", "gcc", "msvc" |
linker | Linker binary | "rust-lld", "ld.lld" |
features | CPU features to enable/disable | Disable SIMD/FPU until context switching saves that state |
disable-redzone | Disable System V red zone | true for kernel, false for userspace |
code-model | LLVM code model | "kernel", "small" |
relocation-model | LLVM relocation model | "static", "pic" |
panic-strategy | How to handle panics | "abort", "unwind" |
has-thread-local | Enable #[thread_local] | true for userspace now that PT_TLS/FS base works |
tls-model | Default TLS model | "local-exec" for static binaries |
max-atomic-width | Largest atomic type (bits) | 64 for x86_64 |
pre-link-args | Arguments passed to linker before user args | Linker script path |
position-independent-executables | Generate PIE by default | false for now |
exe-suffix | Executable file extension | "" for ELF |
stack-probes | Stack overflow detection mechanism | {"kind": "inline"} in the current freestanding x86_64 spec |
The SIMD/FPU-disabled userspace target is a temporary runtime constraint, not a
long-term property of x86_64-unknown-capos. It is acceptable only while the
kernel lacks full FPU/SIMD context switching and language runtimes are confined
to the current static no_std subset. Before Go, C, or full Rust std support,
validate the target against each runtime’s amd64 codegen assumptions; mainstream
amd64 runtimes may assume SSE2/FPU state even when application code does not
explicitly use vector types.
Do not let the custom userspace target accidentally ossify a weaker ABI solely because early kernel context switching does not yet save full FPU/SIMD state. The final language-runtime target must be selected after the kernel’s amd64 context-switch state and the runtime’s codegen assumptions are both reviewed.
no_std vs std Support Path
Current state: capOS uses no_std + alloc. This works with any
target, including x86_64-unknown-none.
Path to std support (what Redox, Hermit, and Fuchsia did):
-
Phase 1: Custom target with
os: "capos"(current report). Use-Zbuild-std=core,allocto build core and alloc. No std. -
Phase 2: Add capOS to Rust’s
stdlibrary. This requires:- Adding
mod caposunderlibrary/std/src/sys/with OS-specific implementations of: filesystem, networking, threads, time, stdio, process spawning, etc. - Each of these maps to capOS capabilities
- Use
cfg(target_os = "capos")throughout std - Build with
-Zbuild-std=std
- Adding
-
Phase 3: Upstream the target (optional). Submit the target spec and std implementations to the Rust project. Requires sustained maintenance.
What Redox did: Redox implemented a full POSIX-like userspace (relibc)
and added std support by implementing the sys module in terms of relibc
syscalls. This made Redox a Tier 2 target with pre-built std artifacts.
What Hermit did: Hermit is a unikernel, so std is implemented directly in terms of Hermit’s kernel-level APIs. Tier 3, community maintained.
What Fuchsia did: Fuchsia implemented std using Fuchsia’s native
zircon syscalls (handles, channels, VMOs – similar in spirit to
capabilities). Tier 2.
Recommendation for capOS: Stay on no_std + alloc with the custom
target JSON. std support is a large effort that should wait until the
syscall surface is stable and threading works. When the time comes, Fuchsia’s
approach (std over native capability syscalls) is the best model, since
Fuchsia’s handle-based API is conceptually close to capOS’s capabilities.
Other OS Projects Reference
| OS | Target | Tier | std | Approach |
|---|---|---|---|---|
| Redox | x86_64-unknown-redox | 2 | Yes | relibc (custom libc) over Redox syscalls |
| Hermit | x86_64-unknown-hermit | 3 | Yes | std directly over kernel API |
| Fuchsia | x86_64-unknown-fuchsia | 2 | Yes | std over zircon handles (capability-like) |
| Theseus | x86_64-unknown-theseus | N/A | No | Custom JSON, no_std, research OS |
| Blog OS | Custom JSON | N/A | No | Based on x86_64-unknown-none |
| MOROS | Custom JSON | N/A | No | Simple hobby OS |
6. Go Runtime Requirements
Go’s Runtime Architecture
Go’s runtime is essentially a userspace operating system. It manages goroutine scheduling, garbage collection, memory allocation, and I/O multiplexing. The runtime interfaces with the actual OS through a narrow set of functions that each GOOS must implement.
Minimum OS Interface for a Go Port
Based on analysis of runtime/os_linux.go, runtime/os_plan9.go, and
runtime/os_js.go, here is the minimum interface:
Tier 1: Absolute Minimum (single-threaded, like GOOS=js)
These functions are needed for “Hello, World!”:
func osinit() // OS initialization
func write1(fd uintptr, p unsafe.Pointer, n int32) int32 // stdout/stderr output
func exit(code int32) // process termination
func usleep(usec uint32) // sleep (can be no-op initially)
func readRandom(r []byte) int // random data (for maps, etc.)
func goenvs() // environment variables
func mpreinit(mp *m) // pre-init new M on parent thread
func minit() // init new M on its own thread
func unminit() // undo minit
func mdestroy(mp *m) // destroy M resources
Plus memory management (in runtime/mem_*.go):
func sysAllocOS(n uintptr) unsafe.Pointer // allocate memory (mmap)
func sysFreeOS(v unsafe.Pointer, n uintptr) // free memory (munmap)
func sysReserveOS(v unsafe.Pointer, n uintptr) unsafe.Pointer // reserve VA range
func sysMapOS(v unsafe.Pointer, n uintptr) // commit reserved pages
func sysUsedOS(v unsafe.Pointer, n uintptr) // mark as used
func sysUnusedOS(v unsafe.Pointer, n uintptr) // mark as unused (madvise)
func sysFaultOS(v unsafe.Pointer, n uintptr) // remove access
func sysHugePageOS(v unsafe.Pointer, n uintptr) // hint: use huge pages
Tier 2: Multi-threaded (real goroutines)
func newosproc(mp *m) // create OS thread (clone)
func exitThread(wait *atomic.Uint32) // exit current thread
func futexsleep(addr *uint32, val uint32, ns int64) // futex wait
func futexwakeup(addr *uint32, cnt uint32) // futex wake
func settls() // set FS base for TLS
func nanotime1() int64 // monotonic nanosecond clock
func walltime() (sec int64, nsec int32) // wall clock time
func osyield() // sched_yield
Tier 3: Full Runtime (signals, profiling, network poller)
func sigaction(sig uint32, new *sigactiont, old *sigactiont)
func signalM(mp *m, sig int) // send signal to thread
func setitimer(mode int32, new *itimerval, old *itimerval)
func netpollopen(fd uintptr, pd *pollDesc) uintptr
func netpoll(delta int64) (gList, int32)
func netpollBreak()
Linux Syscalls Used by Go Runtime (Complete List)
From runtime/sys_linux_amd64.s:
| Syscall | # | Go Wrapper | capOS Equivalent |
|---|---|---|---|
read | 0 | runtime.read | Store cap |
write | 1 | runtime.write1 | Console cap |
close | 3 | runtime.closefd | Cap drop |
mmap | 9 | runtime.sysMmap | VirtualMemory cap |
munmap | 11 | runtime.sysMunmap | VirtualMemory.unmap |
brk | 12 | runtime.sbrk0 | VirtualMemory cap |
rt_sigaction | 13 | runtime.rt_sigaction | Signal cap (future) |
rt_sigprocmask | 14 | runtime.rtsigprocmask | Signal cap (future) |
sched_yield | 24 | runtime.osyield | sys_yield |
mincore | 27 | runtime.mincore | VirtualMemory.query |
madvise | 28 | runtime.madvise | Future VirtualMemory decommit/query semantics, or unmap/remap policy |
nanosleep | 35 | runtime.usleep | Timer cap |
setitimer | 38 | runtime.setitimer | Timer cap |
getpid | 39 | runtime.getpid | Process info |
clone | 56 | runtime.clone | Thread cap |
exit | 60 | runtime.exit | sys_exit |
sigaltstack | 131 | runtime.sigaltstack | Not needed initially |
arch_prctl | 158 | runtime.settls | ThreadControl.setFsBase |
gettid | 186 | runtime.gettid | Thread info |
futex | 202 | runtime.futex | ParkSpace compact CAP_OP_PARK / CAP_OP_UNPARK |
sched_getaffinity | 204 | runtime.sched_getaffinity | CPU info |
timer_create | 222 | runtime.timer_create | Timer cap |
timer_settime | 223 | runtime.timer_settime | Timer cap |
timer_delete | 226 | runtime.timer_delete | Timer cap |
clock_gettime | 228 | runtime.nanotime1 | Timer cap |
exit_group | 231 | runtime.exit | sys_exit |
tgkill | 234 | runtime.tgkill | Thread signal (future) |
openat | 257 | runtime.open | Namespace cap |
pipe2 | 293 | runtime.pipe2 | IPC cap |
Go’s TLS Model
Go uses arch_prctl(ARCH_SET_FS, addr) to set the FS segment base for
each OS thread. The convention:
- FS base points to the thread’s
m.tlsarray - Goroutine pointer
gis stored at-8(FS)(ELF TLS convention) - In Go’s ABIInternal, R14 is cached as the
gregister for performance - On signal entry or thread start,
gis loaded from TLS into R14
Go does NOT use the compiler’s TLS mechanisms (no __thread or
thread_local!). It manages TLS entirely in its own runtime via the FS
register.
For capOS, this means the kernel needs:
arch_prctl(ARCH_SET_FS)equivalent capability method- The kernel must save/restore FS base on context switch
- Each thread’s FS base must be independently settable
Adding GOOS=capos to Go
Files that need to be created/modified in a Go fork:
src/runtime/
os_capos.go // osinit, newosproc, futexsleep, etc.
os_capos_amd64.go // arch-specific OS functions
sys_capos_amd64.s // syscall wrappers in assembly
mem_capos.go // sysAlloc/sysFree/etc. over VirtualMemory cap
signal_capos.go // signal stubs (no real signals initially)
stubs_capos.go // misc stubs
netpoll_capos.go // network poller (stub initially)
defs_capos.go // OS-level constants
vdso_capos.go // VDSO stubs (no VDSO)
src/syscall/
syscall_capos.go // Go's syscall package
zsyscall_capos_amd64.go
src/internal/platform/
(modifications to supported.go, zosarch.go)
src/cmd/dist/
(modifications to add capOS to known OS list)
Estimated: ~2000-3000 lines for Phase 1 (single-threaded).
Feasibility Assessment
| Feature | Difficulty | Blocked On |
|---|---|---|
| Hello World (write + exit) | Easy | Console capability plus exit syscall |
| Memory allocator (mmap) | Medium | VirtualMemory capability exists; Go glue and any missing query/decommit semantics remain |
| Single-threaded goroutines (M=1) | Medium | VirtualMemory and Timer capabilities exist; Go runtime glue remains |
| Multi-threaded (real threads) | Hard | capos-rt thread/park clients, Go newosproc and futexsleep/futexwake glue, per-ThreadRef TLS ownership, GC/runtime coordination |
| Network poller | Hard | Async cap invocation, networking stack |
| Signal-based preemption | Hard | Signal delivery mechanism |
| Full stdlib | Very Hard | POSIX layer or native cap wrappers |
7. Relevance to capOS
Practical Scope of Work
Phase 1: Custom Target JSON (done)
What: A targets/x86_64-unknown-capos.json target spec is checked into
the repo. All userspace crates (init, demos, shell, capos-rt, libcapos,
libcapos-posix, capos-wasm) build against it via Cargo aliases in
.cargo/config.toml. The kernel stays on x86_64-unknown-none.
Why: Enables cfg(target_os = "capos"), sets code-model = "small" and
tls-model = "local-exec" explicitly, and removes the dependency on
per-crate rustflag overrides.
Recurring maintenance: Rust target JSON fields are not stable; validate
the checked-in file against rustc -Z unstable-options --print target-spec-json-schema when upgrading the pinned nightly.
Phase 2: TLS Support (mostly landed, required for Go)
What: Parse PT_TLS from ELF, allocate per-thread TLS blocks, set FS base
on context switch, add arch_prctl-equivalent syscall.
Why: Required for Go runtime (Go’s settls() sets FS base), for Rust
#[thread_local] in userspace, and for C’s __thread.
Current state: PT_TLS parsing, static TLS mapping, FS-base context-switch
state, runtime-controlled current FS-base updates, and Rust #[thread_local]
smokes are implemented. Process-local thread lifecycle also exists. Remaining
work is allocating and owning distinct TLS blocks and FS-base state per
ThreadRef for Go’s multi-thread runtime path.
Blockers: per-ThreadRef TLS ownership rules and Go newosproc integration
for the multi-threaded case.
Phase 3: VirtualMemory Capability (implemented baseline, required for Go)
What: Implement the VirtualMemory capability interface. The current schema has map, unmap, and protect; Go may need decommit/query semantics later.
Why: Go’s memory allocator (sysAlloc, sysReserve, sysMap, etc.)
needs mmap-like functionality. This is the single biggest kernel-side
requirement for Go.
Current state: VirtualMemoryCap implements map/unmap/protect over the
existing page-table code with ownership tracking and quota checks. Go-specific
work still has to map runtime sysAlloc/sysReserve/sysMap expectations
onto that interface.
Blockers: None for the baseline capability. Useful Go still needs runtime glue for VirtualMemory/Timer, capos-rt park clients, Go futex glue, Go thread integration, and address-space generation cleanup for reusable private park words outside the landed explicit unmap/decommit paths.
Phase 4: ParkSpace Go Futex Glue (Low-medium effort, required for Go threading)
What: map Go’s futex(WAIT) and futex(WAKE) runtime hooks onto the
implemented ParkSpace compact wait/wake operations.
Why: Go’s runtime synchronization (lock_futex.go) is built on futexes.
The entire goroutine scheduler depends on futex-based sleeping.
Effort: the compact park ABI already exists as CAP_OP_PARK and
CAP_OP_UNPARK; Go futex glue should target that ParkSpace contract instead
of inventing a parallel wait namespace.
Private futex authority and keying rules: use ParkSpace as the normative design. Private futex keys are generation-bearing address-space keys:
#![allow(unused)]
fn main() {
ParkKey::Private {
address_space_id,
address_space_generation,
uaddr,
}
}
WAITvalidates that the address is mapped readable in the caller’s current address space and that the expected value still matches under the same page-table stability rules used for process-buffer validation.- The value check and waiter insertion are one atomic kernel operation with
respect to
WAKE, unmap, process exit, and address-space teardown. WAKEfor a private futex can only wake waiters with the sameaddress_space_idandaddress_space_generation; a raw virtual address is never a cross-process sync key.- Unmap, revoke, or address-space teardown drains or fails waiters for the old key before the virtual address can be reused as unrelated state.
- A future shared-futex design must use
ParkKey::Sharedwithmemory_object_id,memory_object_generation, and aligned object offset, not raw user virtual address.
The authority boundary stays the caller’s ParkSpace capability for private
parks and a future SharedParkSpace for MemoryObject-derived shared parks. Do
not introduce a global futex namespace or a generation-less duplicate key shape.
Blockers: capos-rt park clients, Go futexsleep/futexwake glue, and full
multi-thread runtime integration.
Phase 5: Go Thread Runtime Integration (High effort, required for Go GOMAXPROCS>1)
What: connect Go’s newosproc, TLS ownership, futex glue, and GC
coordination to the implemented process-local thread lifecycle and private
ParkSpace wait/wake substrate.
Why: Go’s newosproc() creates OS threads via clone(). Without real
threads, Go is limited to GOMAXPROCS=1.
Effort: still high, but the kernel substrate is no longer a blank scheduler extension. The remaining work is capos-rt clients, Go runtime glue, per-ThreadRef TLS ownership, and validation under Go’s scheduler.
Blockers: capos-rt thread and park clients, newosproc glue,
futexsleep/futexwake glue, per-ThreadRef TLS ownership rules, GC
coordination across kernel threads, address-space generation cleanup for
reusable private park-word memory outside explicit unmap/decommit paths, and
shared park words for future cross-process futexes. Per-CPU data and SMP are
later blockers for multi-core scaling, not for the first single-CPU Go thread
integration.
Biggest Blockers for Go
In priority order after the 2026-04-24 TLS, VirtualMemory, Timer, ThreadControl, single-thread runtime-checkpoint, process-local thread lifecycle, and private ParkSpace work:
-
Go park/futex glue – Go’s M:N scheduler depends on futex-shaped sleeping/waking. The kernel has private ParkSpace wait/wake; the Go port still needs capos-rt clients and
futexsleep/futexwakeintegration. -
Go thread integration – Required for
GOMAXPROCS > 1. The kernel has process-local thread lifecycle; the Go port still needsnewosproc, per-ThreadRef TLS ownership, and GC coordination across those threads. -
Go runtime port glue – the capOS capability side now has a single-thread checkpoint for VirtualMemory and Timer, but a real Go fork still needs to map
sysAlloc/write1/exit/random/env/time to capOS runtime and capabilities.
Biggest Blockers for C
C is much simpler than Go:
- Linker and toolchain setup – Need a cross-compilation toolchain targeting capOS (Clang with the custom target, or GCC cross-compiler).
libcapos.awith C headers – Rust library withextern "C"API.- musl integration (optional) – For full libc, replace musl’s
__syscall()with capability invocations.
Recommended Implementation Order
1. Custom userspace target JSON [done: targets/x86_64-unknown-capos.json]
|
2. VirtualMemory capability [done: baseline map/unmap/protect]
|
3. TLS support (PT_TLS, FS base) [done: static ELF + ThreadControl]
|
4. ParkSpace compact wait/wake [done: private path; clients open]
|
5. Timer capability (monotonic clock) [done: monotonic now/sleep]
|
6. Go Phase 1: minimal GOOS=capos [checkpoint done; Go fork remains]
|
7. Kernel threading for Go runtime [partial thread lifecycle; Go integration open]
|
8. Go Phase 2: multi-threaded [GOMAXPROCS>1, concurrent GC]
|
9. C toolchain + libcapos [parallel with Go work]
|
10. Go Phase 3: network poller [depends on networking stack]
Steps 1-5 are kernel prerequisites. Step 6 is the Go fork. Steps 7-10 are incremental improvements that can proceed in parallel.
Key Architectural Decisions for capOS
-
Keep
x86_64-unknown-nonefor kernel,x86_64-unknown-caposfor userspace. The kernel does not benefit from a custom OS target (it’s freestanding). Userspace benefits fromcfg(target_os = "capos"). -
Use local-exec TLS model for static binaries. No dynamic linker means no general-dynamic or initial-exec TLS. local-exec is zero-overhead.
-
Implement FS base save/restore early. Both Go and Rust
#[thread_local]need it. It’s a small addition to context switch code. -
VirtualMemory cap stays on the Go critical path. The baseline exists; the Go port still needs exact runtime allocator semantics and any missing query/decommit behavior.
-
Futex is the synchronization primitive. Both Go and any future pthreads implementation need futex-shaped wait/wake. The capOS authority surface is
ParkSpace, using compactCAP_OP_PARK/CAP_OP_UNPARKtransport rather than generic Cap’n Proto method dispatch on the hot path. -
Signals can be deferred. Go can start with cooperative-only preemption (no
SIGURG). Signal delivery is complex and can come much later.
Used By
- Go Runtime for the native
GOOS=caposruntime plan. - Go VirtualMemory Contract for
the
sysReserve/sysMap/sysUnusedallocator contract. - Userspace Runtime for the
capos-rthooks a language runtime calls.
Research: Linux Sandboxes And Virtualization For Workloads
capOS needs a credible way to run Linux-native software before every useful application, language runtime, package manager, development workflow, and desktop or server tool has a native capOS port. Users may want a familiar Linux environment. Agents may need a bounded place to run build systems, interpreters, package managers, browsers, command-line tools, scientific software, or model-generated code. Operators may need a compatibility bridge while capOS-native services are still emerging.
This note separates the available Linux isolation choices and records how they should map to generic capOS capability services. Scientific tooling is one important consumer of this substrate, but the substrate itself should be a general Linux workload sandbox.
The important distinction is between compatibility wrappers and isolation boundaries. Namespaces, cgroups, seccomp, Landlock, User-Mode Linux, containers, gVisor, and KVM microVMs all run “Linux things”, but they do not provide the same boundary, timing behavior, device model, or operational cost.
Source Baseline
External sources checked:
- Linux kernel documentation, Namespaces
- Linux kernel documentation, Control Group v2
- Linux kernel documentation, Seccomp BPF
- Linux kernel documentation, Landlock unprivileged access control
- Linux kernel documentation, User Mode Linux HOWTO
- Linux kernel documentation, CPU Isolation
- Linux kernel documentation, Housekeeping
- Linux kernel documentation, KVM halt polling
- Linux kernel documentation, Guest halt polling
- Open Container Initiative, Runtime Specification
- Open Container Initiative, Image Specification
- bubblewrap: https://github.com/containers/bubblewrap
- nsjail: https://github.com/google/nsjail
- systemd-nspawn: https://www.freedesktop.org/software/systemd/man/systemd-nspawn.html
- gVisor: https://gvisor.dev/docs/
- QEMU: https://qemu-project.gitlab.io/qemu/about/index.html
- Firecracker: https://github.com/firecracker-microvm/firecracker
- Cloud Hypervisor: https://www.cloudhypervisor.org/docs/prologue/introduction/
- Kata Containers virtualization design: https://github.com/kata-containers/kata-containers/blob/main/docs/design/virtualization.md
Local grounding:
- Userspace Binaries
- Storage and Naming
- Browser Capability and Agent Web Sessions
- Scientific Agent-Lab Software Stack
- NO_HZ, SQPOLL, and Realtime Scheduling
- Tickless and Realtime Scheduling
- System Performance Benchmarks
- HPC Parallel Processing Patterns
Isolation Layers
Namespaces, cgroups, seccomp, And Landlock
The basic Linux sandbox stack is:
- namespaces for separate views of process ids, mounts, users, networks, IPC, UTS names, time, and related global resources;
- cgroup v2 for resource accounting, placement, and limits;
- seccomp-BPF for syscall filtering;
- Landlock for unprivileged filesystem access restriction;
- rlimits and ordinary Unix credentials for process-local bounds.
This stack is useful for trusted or semi-trusted tools that need quick
startup and native Linux performance. It is not a hard boundary against all
kernel attack surface: a namespaced process still talks to the host Linux
kernel through syscalls, page faults, filesystem code, networking, and device
interfaces. For capOS, a namespace/cgroup/seccomp/Landlock sandbox is a good
early backend for trusted batch tools, shell commands, build steps,
formatters, language package commands, and scientific-base tools such as
PARI/GP, Z3, cvc5, HiGHS, or Lean when the tools and inputs are trusted by the
same operator.
The capOS wrapper should generate the sandbox policy from capability grants: read-only input directories, a scratch/output directory, optional loopback or egress network, CPU/memory/pids/io quotas, and a syscall profile. The policy is an implementation detail; the capOS-visible object is still a typed command, job, shell, build, solver, proof, CAS, notebook, or application capability.
bubblewrap, nsjail, systemd-nspawn, And OCI Runtimes
bubblewrap is a low-level unprivileged sandboxing tool used by Flatpak-style systems. It is appropriate for single-process or small interactive tools where the desired policy is mostly mount and namespace shaping.
nsjail combines namespaces, cgroups, rlimits, and seccomp-BPF policies with a compact configuration format. It is a strong fit for early batch jobs, command-wrapper services, solver/proof-checker tasks, package commands, and agent tool calls because it already models the same inputs capOS cares about: uid/gid, chroot/root, mounts, network mode, time limits, memory limits, cgroups, and syscall policy.
systemd-nspawn is better for booting or debugging a full Linux userspace tree than for narrow per-tool sandboxing. It is useful for stateful development images and package-build roots, but it should not be the default tool executor because its shape encourages broad OS-in-container authority.
OCI runtimes and images are valuable for supply-chain compatibility. capOS should be able to import OCI image metadata and run image contents through a chosen sandbox backend, but it should not treat “OCI container” as a security claim. The security claim depends on the runtime and host policy.
User-Mode Linux
User-Mode Linux is a Linux kernel port that runs as a normal Linux process and talks to the host kernel instead of hardware. It is useful as a compatibility, debugging, and low-privilege Linux-kernel experiment path. It can contain a guest Linux userspace without requiring hardware virtualization.
UML is not the same category as a hardware-backed Linux guest. It does not give
the same boundary as KVM/microVM execution because the UML kernel and guest
work ultimately run as host Linux processes and depend heavily on the host
kernel surface. For capOS Linux workload execution, UML can be a convenient
developer backend when /dev/kvm is unavailable, but it should not be the
default answer for untrusted multi-tenant sessions, model-generated code,
networked tools, or package-build execution.
gVisor
gVisor moves many host-kernel-facing interfaces into a per-sandbox application
kernel and exposes an OCI runtime, runsc. This is an attractive middle tier:
it keeps container-like resource behavior and tooling while reducing direct
host kernel exposure for many syscalls.
The tradeoff is compatibility and performance. General Linux workloads can exercise native runtimes, dynamic loaders, filesystems, signals, threading, shared memory, networking, debuggers, browser sandboxes, package managers, and sometimes GPU/device paths. gVisor should be treated as a backend to test per workload class, not assumed compatible with every developer tool, package manager, browser, desktop app, scientific stack, proof assistant, or solver.
Hardware-Backed Linux Guests
For stronger isolation, use a Linux guest under hardware virtualization: QEMU/KVM, Firecracker, Cloud Hypervisor, or Kata Containers.
QEMU/KVM is the broadest compatibility target. It can run a full Linux guest with familiar device models, disks, networking, and debugging hooks. It is the right default for compatibility breadth, reproducibility, and complex package systems that expect a normal Linux distribution.
Firecracker is a narrow microVM monitor designed for serverless-style workloads. Its reduced device model and operational focus are attractive for batch jobs, command execution workers, stateless build/test workers, solver workers, and proof-check workers where the rootfs, network, block devices, and API surface can be kept small.
Kata Containers runs container workloads inside lightweight VMs and integrates
with container orchestration. It is a good reference for mapping container
workload semantics onto VM isolation. capOS does not need to import the full
Kubernetes/Kata stack, but the pod-as-VM-sandbox idea maps well to an
LinuxWorkloadVm, AgentJobVm, or other specialized Linux workload service.
Hardware-backed Linux guests should be the default for:
- untrusted interactive Linux shells or familiar Linux workspaces;
- untrusted notebook execution;
- model-generated code that may exploit native extensions;
- package builds from untrusted recipes;
- network-enabled data processing;
- multi-tenant hosted agent jobs;
- browser, GUI, or desktop-like Linux application sessions;
- workflows that need a full Linux distribution but should not share the host kernel attack surface.
Dedicated Host Isolation
VM and microVM boundaries reduce direct host-kernel sharing, but they do not remove every shared-hardware or operator-domain risk. Dedicated hosts, single-tenant nodes, or separately owned external hardware are appropriate when the workload has unusually high tenant risk, handles sensitive data, requires GPU or device passthrough, runs long-lived browser/GUI sessions with large attack surface, or must limit the blast radius of a VMM, firmware, driver, or VM-escape failure.
Dedicated hardware should be modeled as a deployment and tenancy property,
not as a different Linux API. A QemuKvmVm or FirecrackerMicroVm running on
a single-tenant host still exposes the same guest workload interface, but its
security and scheduler evidence should record that the host was not shared
with unrelated tenants. Conversely, a hardware-backed guest on a shared host is
still a VM boundary, but it is not the strongest isolation class capOS can
offer.
Virtualized Workloads And capOS Auto Full-NOHZ
For capOS scheduling design, Linux sandboxes are modeled as host-visible workloads when making native Tickless and Realtime Scheduling decisions. VMs, microVMs, UML processes, gVisor sandboxes, external sidecars, and VMM helper threads affect capOS through the host-visible set of runnable work, timers, IRQs, polling loops, and housekeeping obligations.
For capOS-native auto full-nohz scheduling:
- capOS policy applies to the outer capOS-scheduled entity: VMM processes, vCPU threads, I/O helper threads, proxy processes, and native capOS services.
- Guest Linux scheduler state is opaque. Guest
CONFIG_NO_HZ_IDLE,nohz_full, cpuidle, and halt-poll settings may be recorded for diagnostics or benchmark interpretation, but they do not grant capOS CPU-isolation authority. - Ordinary Linux sandboxes should run as ordinary scheduled workloads unless the capOS-visible outer backend receives an explicit low-noise placement lease.
- A sandbox descriptor must not set capOS auto full-nohz, CPU isolation, or exclusive CPU placement by itself. Those are scheduler-authority decisions with global cost.
Idle behavior still needs backend research because it determines whether an
“idle” guest is actually idle from the host scheduler’s perspective. Linux
CONFIG_NO_HZ_IDLE stops the guest scheduling-clock tick when a guest CPU is
idle, which reduces guest-generated timer interrupts and vCPU wakeups. That
does not enable capOS host tick suppression by itself. It only helps by making
the VMM’s host-visible vCPU thread block more often and wake less often.
KVM prior art shows the boundary clearly. When a guest vCPU halts, the host may block the vCPU thread or poll briefly for a wakeup. Host-side KVM halt polling trades latency for CPU use, and large polling intervals can turn idle guest time into host kernel time. Guest-side halt polling makes the guest vCPU poll before halting and can run even when other host tasks are runnable. A capOS backend intended for low-noise placement therefore needs explicit accounting for VMM/vCPU polling, helper threads, virtio event loops, host timers, and IRQ placement.
The validation target is backend quietness, not Linux nohz integration:
- idle vCPUs should block or halt instead of forcing periodic outer work;
- one-shot guest timer deadlines should wake the vCPU correctly without a host periodic tick dependency;
- VMM helper threads, block/network event loops, and virtio queues should be visible to capOS placement and accounting;
- halt-polling or busy guest kernel threads should make the outer workload ineligible for low-noise placement rather than silently degrading a capOS scheduler claim;
- benchmark reports should distinguish guest Linux tickless state from capOS outer scheduler state.
capOS Linux Workload Service Model
The capOS-visible service should hide the backend without hiding the security claim:
LinuxWorkloadSandbox {
backend: NamespaceSandbox | GVisor | UserModeLinux | QemuKvmVm |
FirecrackerMicroVm | KataVm | NativeCapos;
isolationClass: Compatibility | ProcessSandbox | SyscallSandbox |
ApplicationKernel | HardwareVm | DedicatedHost;
deployment: ExternalLinuxHost | CaposScheduledProxy |
CaposScheduledVmm | DedicatedExternalHost | NativeCapos;
workloadClass: InteractiveShell | BatchCommand | BuildJob |
PackageInstall | BrowserBackend | Notebook |
ScientificJob | AgentTool | ServiceDaemon;
trustClass: SameOperator | UntrustedCode | MultiTenant | FamiliarWorkspace;
placement: Ordinary | AutoNoHzEligible | CpuIsolationLease;
packageClosure: PackageClosureId;
inputCaps: ArtifactId[] | NamespaceGrant[];
outputCaps: ArtifactSinkId[] | NamespaceGrant[];
networkPolicy: None | Loopback | BrokeredEgress;
resourceEnvelope: CpuMemoryIoPidGpuLimits;
auditPolicy: ProvenanceRequired;
}
The wrapper should record:
- backend and version;
- kernel, rootfs, image, and package closure hashes;
- seccomp/Landlock/cgroup/namespace policy or VM device model;
- deployment location, distinguishing external Linux-host policy from capOS-scheduled proxy/VMM/native state;
- CPU affinity, cgroup CPU quota or VM vCPU placement, capOS
NoHzEligibility/NoHzActivationstate, and outer housekeeping CPU set when the workload is capOS-scheduled; - external host CPU/isolation/nohz metadata when the workload runs outside capOS, recorded as host evidence rather than capOS scheduler proof;
- guest tickless/nohz state when a Linux guest is used, recorded separately from the capOS outer scheduler state;
- network and block-device grants;
- input and output artifact ids;
- exit status, signal, timeout, OOM, or backend failure.
Recommendation
Use a tiered sidecar strategy:
- Namespace sandbox tier. Use nsjail or bubblewrap for trusted
commands, package steps, build/test tools, and
scientific-basebatch tools, with cgroup v2 quotas, seccomp, Landlock where available, read-only inputs, and immutable output capture. - gVisor tier. Test high-risk but container-compatible Linux workloads where syscall mediation is useful and full VM overhead is not justified.
- Hardware VM tier. Use QEMU/KVM for broad compatibility and Firecracker or Kata-style microVMs for repeated batch jobs. This is the default for untrusted familiar Linux workspaces, notebooks, model-generated code, package builds, networked tools, and multi-tenant agent work.
- Dedicated host tier. Use single-tenant nodes or separately owned external hosts for high-risk tenants, sensitive data, GPU/device passthrough, long-lived browser/GUI workloads, side-channel-sensitive jobs, and cases where VM escape or VMM compromise must have a smaller blast radius.
- UML tier. Keep User-Mode Linux as a developer/debug/compatibility fallback when KVM is unavailable, not as the primary strong-isolation backend.
- Native capOS tier. Migrate stable, small, well-understood services into native capOS userspace after the capability interfaces are proven.
The first serious hardware-backed proof should run a Linux guest workload under QEMU/KVM, expose a narrow Cap’n Proto capability proxy to capOS, and execute a mix of familiar Linux commands plus one or two specialized workloads with artifact capture. Good first cases are a shell/build job, a package-manager or compiler invocation, and a scientific batch job such as PARI/GP, Z3/cvc5, HiGHS, or Lean. A later Firecracker proof can optimize startup and attack surface for stateless command, solver, proof-check, and agent-tool workers.
For browser use, this service is only a possible backend behind the BrowserSession capability. It must not expose a parallel browser authority model: origins, profiles, downloads, uploads, automation, and audit still belong to the browser capability surface, even if the actual browser engine runs in a Linux sandbox or hardware-backed Linux guest.
Research: Out-of-Kernel Scheduling
Survey of whether capOS can move CPU scheduler implementation out of the kernel, which parts are normally kept privileged, and which policy has been moved to user-space services or loadable policy modules in prior systems.
Scope
“User-space scheduler” is an overloaded term. The question here is narrower than language/runtime scheduling: can the OS CPU scheduler itself be moved out of the kernel?
This report separates the relevant models:
| Model | Schedules | Kernel sees | Examples |
|---|---|---|---|
| User-controlled kernel scheduling | Kernel threads / scheduling contexts | Privileged mechanism plus user policy inputs | L4 user-level scheduling, seL4 MCS, ARINC partition schedulers on seL4 |
| Dynamic in-kernel policy | Kernel threads | Policy loaded from user space but executed in kernel | Linux sched_ext, Ekiben, Bossa |
| Whole-machine core arbitration | Cores granted to applications/runtimes | Kernel threads pinned, parked, or revoked | Arachne, Shenango, Caladan |
| In-process M:N runtime | Goroutines, virtual threads, fibers, async tasks | A smaller set of OS threads | Go, Java virtual threads, Erlang, Tokio |
| User-level thread package | User-level threads or tasklets | One or more kernel execution contexts | Capriccio, Argobots |
| Kernel-assisted two-level runtime scheduling | User threads plus kernel events | Virtual processors / activations | Scheduler activations, Windows UMS |
The common boundary in prior systems is: the kernel allocates protected execution resources, handles blocking and preemption, and enforces isolation. User space supplies domain policy: which goroutine, actor, task, request, or coroutine runs next.
Feasibility Assessment
Moving the entire scheduler out of the kernel is not viable in a protected, preemptive system if “scheduler” means the code that runs on timer interrupts, chooses an immediately runnable kernel thread, saves/restores CPU state, changes page tables, updates per-CPU state, and enforces CPU-time isolation. That mechanism is part of the CPU protection boundary.
Moving scheduler policy out of the kernel is viable. A capOS-like kernel can act as a small CPU driver that enforces runnable-state invariants, capability-authorized scheduling contexts, budgets, priorities, CPU affinity, timeout faults, and IPC donation. A privileged user-space scheduler service can own admission control, budgets, priorities, placement, CPU partitioning, and service-specific policy.
The design point supported by the surveyed systems is not “no scheduler in kernel.” It is “minimal kernel dispatch and enforcement, user-space policy.”
Executive Conclusions
- The next-thread dispatch path is normally kept in kernel mode. It runs when the current user process may be untrusted, blocked, faulting, or out of budget.
- User space can own policy if the kernel exposes scheduling contexts as capability-controlled CPU-time objects. Thread creation and thread handles should follow the same capability-first model.
- Consulting a user-space scheduler server on every timer tick adds context switches to the hottest path and creates a bootstrap problem when the scheduler server itself is not runnable.
- seL4 MCS is the most directly comparable model: scheduling contexts are explicit objects, budgets are enforced by the kernel, and passive servers can run on caller-donated scheduling contexts.
- L4 user-level scheduling experiments show that user-directed scheduling is possible, with reported overhead from 0 to 10 percent compared with a pure in-kernel scheduler for their workload. That is plausible for policy changes, not for every dispatch decision.
- seL4 user-mode partition schedulers show the downside: a prototype partitioned scheduler measured substantial overhead because each scheduling event crosses the user/kernel boundary.
- sched_ext and Ekiben are useful evidence for pluggable scheduler policy, but they still execute policy in or near the kernel. They do not prove that the dispatch mechanism can be a normal user process.
- Whole-machine core arbiters such as Arachne, Shenango, and Caladan support a different split: the kernel still schedules threads, while a privileged control plane grants, revokes, and places cores at coarser granularity.
- Direct-switch IPC and scheduling-context donation reduce the priority inversion and dispatch-overhead risks that appear when capability servers are scheduled only by per-process priorities.
- Pure M:1 user-level threads are insufficient for capOS as the only threading story. They are fast, but one blocking syscall, page fault wait, or long CPU loop can stall unrelated user threads unless every blocking operation is converted to async form.
- M:N runtimes need a small OS contract: capability-created kernel threads, TLS/FS-base state, capability-authorized futex-style wait/wake, monotonic timers, async I/O/event notification, and a way to detect or avoid kernel blocking.
- Scheduler activations solved the right conceptual problem but exposed a complicated upcall contract. A capability OS can get most of the benefit with simpler primitives: async capability rings, notification objects, futexes, and explicit thread objects.
- Work-stealing with per-worker local queues is the dominant general-purpose runtime design. It gives locality and scale, but it needs explicit fairness guards and I/O polling integration.
- SQPOLL-style polling is a scheduling decision. It trades a core for lower submission latency and depends on SMP plus explicit CPU ownership. Full-nohz for that poller should be treated as a CPU-isolation lease with housekeeping and accounting constraints, not as an automatic timer optimization; see NO_HZ, SQPOLL, and realtime scheduling.
- A generic language scheduler in the kernel is a separate design from out-of-kernel CPU policy. Go, Rust async, actor runtimes, and POSIX layers need kernel mechanisms that let them implement their own policy.
Privileged Mechanisms
The following responsibilities are mechanism, not policy. Moving them to a normal user process either breaks protection or puts a user/kernel round trip on the critical path:
- Save and restore CPU register context.
- Switch page tables / address spaces.
- Update per-CPU current-thread state, kernel stack, TSS/RSP0, and syscall stack state.
- Handle timer interrupts and IPIs.
- Maintain a safe runnable/blocked/exited state machine.
- Enforce CPU budgets and preempt a thread that exceeds its budget.
- Choose an emergency runnable thread when the policy owner is dead, blocked, or malicious.
- Run idle and halt safely when no runnable work exists.
- Integrate scheduling with blocking syscalls, page faults, futex waits, and IPC wakeups.
- Preserve invariants under SMP races.
These are exactly the parts currently concentrated in
kernel/src/sched.rs and the x86 context-switch path. They can be simplified and made more generic,
but they remain required somewhere privileged.
Policy Surface
The following are policy examples that can be owned by a privileged user-space service once scheduling contexts exist:
- Admission control: which process/thread is allowed to consume CPU time.
- Priority assignment and dynamic priority changes.
- Budget/period selection for temporal isolation.
- CPU affinity and CPU partitioning decisions.
- Core grants for SQPOLL, device polling, network stacks, and latency-sensitive services.
- Overload handling policy.
- Per-service or per-tenant fair-share policy.
- Instrumentation-driven tuning.
- Runtime-specific hints, such as “latency-sensitive”, “batch”, “driver”, or “poller”.
This split gives a capOS-like system policy freedom while preserving a small, auditable kernel CPU mechanism.
Viable Architectures
1. Minimal Kernel Scheduler Plus User Policy Service
This is one capOS-compatible design point.
The kernel implements:
- Thread states and per-CPU run queues.
- Priority/budget-aware dispatch.
- Scheduling-context objects.
- Timer-driven budget accounting.
- Timeout faults or notifications.
- Capability-checked operations to bind/unbind scheduling contexts to threads.
- Emergency fallback policy.
A user-space sched service implements:
- System policy loaded from the boot manifest.
- Resource partitioning between services.
- Priority/budget updates.
- CPU pinning and SQPOLL grants.
- Diagnostics and policy reload.
The policy service is invoked on configuration changes and timeout faults, not on every context switch.
2. seL4-MCS-Style Scheduling Contexts
seL4 MCS makes CPU time a first-class kernel object. A thread needs a scheduling context to run. A scheduling context carries budget, period, and priority. The kernel enforces the budget with a sporadic-server model. Passive servers can block without their own scheduling context; callers donate their scheduling context through synchronous IPC, and the context returns on reply.
This maps directly to capOS:
SchedContext {
budget_ns
period_ns
priority
cpu_mask
remaining_budget
timeout_endpoint
}
Kernel responsibilities:
- Enforce budget and period.
- Dispatch runnable threads with eligible scheduling contexts.
- Donate and return contexts across direct-switch IPC.
- Notify user space on timeout or depletion.
User-space responsibilities:
- Create and distribute scheduling-context capabilities.
- Decide budgets and priorities.
- Build passive service topologies.
- React to timeout faults.
This moves scheduling policy out without moving the hot dispatch mechanism out.
3. Hierarchical User-Level Scheduler
L4 research evaluated exporting scheduling to user level through a hierarchical user-level scheduling architecture. The reported application overhead was 0 to 10 percent compared with a pure in-kernel scheduler in their evaluation, and the design enabled user-directed scheduling.
This is possible, but the cost model is sensitive:
- Every policy decision that requires a scheduler-server round trip is expensive.
- The scheduler server needs guaranteed CPU time, or the system can deadlock.
- Faults and interrupts still need kernel fallback.
- SMP multiplies races around run queues, CPU ownership, and migration.
This architecture is viable for coarse-grained partition scheduling, VM scheduling, or policy control. As a first general dispatch path, it has higher latency and bootstrap risk than an in-kernel dispatcher.
4. Dynamic In-Kernel Policy
Linux sched_ext lets user space load BPF scheduler programs, but the policy runs inside the kernel scheduler framework. The kernel preserves integrity by falling back to the fair scheduler if the BPF scheduler errors or stalls runnable tasks. Ekiben similarly targets high-velocity Linux scheduler development with safe Rust policies, live upgrade, and userspace debugging.
This model is a later-stage option for dynamic scheduler experiments, but it is not “scheduler in user space.” It also adds verifier/runtime complexity.
5. Core Arbiter / Resource Manager
Arachne, Shenango, and Caladan move high-level core allocation decisions out of the ordinary kernel scheduler path. Applications or runtimes know which cores they own, while an arbiter grants and revokes cores based on load or interference.
This model is useful for capOS after SMP:
- grant cores to NIC drivers, network stacks, or SQPOLL workers;
- revoke poller cores under CPU pressure;
- isolate latency-sensitive services from batch work;
- expose CPU ownership through capabilities.
It does not remove the kernel dispatcher. It changes the granularity of policy from “which thread next” to “which service owns this CPU budget.”
Classic Problem: Kernel Threads vs User Threads
The scheduler activations paper is still the cleanest statement of the core problem: kernel threads have integration with blocking and preemption, while user-level threads have cheaper context switching and better policy control. The failure mode of user-level threads layered naively on kernel threads is that kernel events are hidden from the runtime. A kernel thread can block in the kernel while runnable user threads exist, and the kernel can preempt a kernel thread without telling the runtime which user thread was stopped.
Scheduler activations address this by giving each address space a “virtual multiprocessor.” The kernel allocates processors to address spaces and vectors events to the user scheduler when processors are added, preempted, blocked, or unblocked. The activation is both an execution context and a notification vehicle.
The lesson for capOS is not to copy the full activation API. The durable idea is the split:
- Kernel owns physical CPU allocation, protection, preemption, and blocking.
- Runtime owns which application-level work item runs on a granted execution context.
- Kernel-visible blocking must create a runtime-visible event, or it must be avoided by making the operation async.
For capOS, async capability rings already avoid many blocking syscalls. The remaining hard cases are futex waits, page faults that require I/O, synchronous IPC, and preemption of long-running runtime tasks.
Runtime Schedulers in Practice
Go
Go uses an M:N scheduler with three central concepts:
- G: goroutine.
- M: worker thread.
- P: processor token required to execute Go code.
The Go runtime distributes runnable goroutines over worker threads, keeps per-P queues for scalability, uses global queues and netpoller integration for fairness and I/O, and parks/unparks OS threads conservatively to avoid wasting CPU. Its own source comments call out why centralized state and direct handoff were rejected: centralization hurts scalability, while eager handoff hurts locality and causes thread churn.
Preemption is mixed. Go has synchronous safe points and asynchronous preemption using OS mechanisms such as signals. The runtime can only safely stop a goroutine at points where stack and register state can be scanned.
Implications for capOS:
- Initial
GOOS=caposcan run withGOMAXPROCS=1and cooperative preemption, but useful Go requires kernel threads, futexes, FS-base/TLS, a monotonic timer, and an async network poller. - A signal clone is not strictly required if capOS provides a runtime-visible timer/preemption notification and the Go port accepts cooperative-first behavior.
- The kernel must schedule threads, not processes, before Go can use multiple cores.
Java Virtual Threads
JDK virtual threads use M:N scheduling: many virtual threads are mounted on a
smaller number of platform threads. The default scheduler is a FIFO-mode
work-stealing ForkJoinPool; the platform thread currently carrying a virtual
thread is called its carrier.
The design is intentionally not pure cooperative scheduling from the application’s perspective: most JDK blocking operations unmount the virtual thread, freeing the carrier. But some operations pin the virtual thread to the carrier, notably native calls and some synchronized regions. The JEP also notes that the scheduler does not currently implement CPU time-sharing for virtual threads.
Implications for capOS:
- “Blocking” compatibility requires library/runtime cooperation, not just a scheduler. The runtime needs blocking operations to yield carriers.
- Native calls and pinned regions remain a general M:N hazard. capOS cannot make that disappear in the kernel.
Tokio and Rust Async Executors
Tokio represents the async executor model rather than stackful green threads.
Tasks run until they return Poll::Pending, so fairness depends on cooperative
yield points and wakeups. Tokio’s multi-thread scheduler uses one global queue,
per-worker local queues, work stealing, an event interval for I/O/timer checks,
and a LIFO slot optimization for locality.
Implications for capOS:
- A
capos-rtasync executor can integrate capability-ring completions, notification objects, and timers as wake sources. - A cooperative budget is mandatory. A future that never awaits can monopolize a worker until kernel preemption takes the whole OS thread away.
- A single global CQ per process can become an executor bottleneck if many worker threads consume completions. Per-thread or sharded wake queues are likely needed after SMP.
Erlang/BEAM
BEAM schedulers run lightweight Erlang processes on scheduler threads. The runtime exposes scheduler count and binding controls, and Erlang processes are preempted by reductions rather than OS timer slices. This shows a different point in the design space: the language VM owns fairness because it controls execution of bytecode.
Implications for capOS:
- Managed runtimes can implement stronger fairness than native async libraries because they control instruction dispatch or compiler-inserted safe points.
- Native Rust/C userspace cannot rely on that unless the compiler/runtime inserts yield or safe-point checks.
Capriccio and Argobots
Capriccio showed that a user-level thread package can scale to very high concurrency by combining cooperative user-level threads, asynchronous I/O, O(1) thread operations, linked stacks, and resource-aware scheduling. The important lesson is that the thread abstraction can survive high concurrency when the runtime controls stacks and blocking.
Argobots generalizes lightweight execution units into user-level threads and tasklets over execution streams. It is designed as a substrate for higher-level systems such as OpenMP and MPI, with customizable schedulers. This is directly relevant to capOS because it argues for low-level runtime mechanisms, not one global scheduling policy.
Lithe
Lithe targets composition of parallel libraries. Its thesis is that a universal task abstraction or one global scheduler does not compose well when multiple parallel libraries are nested. Instead, physical hardware threads are shared through an explicit resource interface, while each library keeps its own task representation and scheduling policy.
Implications for capOS:
- Avoid oversubscription by making CPU grants visible to user space.
- A future
CpuSetor scheduling-context capability could let runtimes know how much parallelism they are actually allowed to use. - Nested runtimes benefit from the ability to donate or yield execution resources without going through a process-global policy singleton.
Kernel Interfaces That Matter
Futexes
Futexes are the standard split-lock design: user space does the uncontended fast path with atomics, and the kernel only participates to sleep or wake threads. Linux also has priority-inheritance futex operations for cases where the kernel must manage lock-owner priority propagation.
For capOS:
- Implement futex as a capability-authorized primitive. Do not assume generic Cap’n Proto method encoding is acceptable for the hot path; measure it against a compact operation before fixing the ABI.
- Key futex wait queues by
(address_space, user_virtual_address)for private futexes. Shared-memory futexes eventually need a memory-object identity plus offset. - Support timeout against monotonic time first. Requeue and PI futexes can wait.
Restartable Sequences
Linux rseq lets user space maintain per-CPU data without heavyweight atomics and lets a thread cheaply read its current CPU/node. The current kernel docs also describe scheduler time-slice extensions for short critical sections.
For capOS:
- rseq-style current-CPU access becomes useful after SMP and per-CPU run queues.
- It is not a first threading prerequisite. Futex, TLS, and kernel threads come first.
- If added, expose a small per-thread ABI page with
cpu_id,node_id, and an abort-on-migration critical-section protocol.
io_uring SQPOLL
SQPOLL moves submission from syscall-driven to polling-driven. A kernel thread polls the submission queue and submits work as soon as userspace publishes SQEs. This reduces submission latency and syscall overhead for sustained I/O, but it burns CPU and needs careful affinity.
capOS already has an io_uring-inspired capability ring, so the analogy is direct:
- Current tick-driven ring processing is correct for a toy system but couples invocation latency to timer frequency.
- A kernel-side SQ polling thread interacts badly with single-CPU systems. On a single CPU it competes with the application it is supposed to accelerate.
- Make SQPOLL a scheduling/capability decision: the process donates or is granted a CPU budget for the poller.
- Completion handling remains a separate problem. A runtime still needs to poll CQs or block on notifications.
sched_ext
Linux sched_ext is not a normal user-level thread scheduler. It is a scheduler class whose behavior is defined by BPF programs loaded from user space. The kernel docs emphasize that sched_ext can be enabled and disabled dynamically, can group CPUs freely, and falls back to the default scheduler if the BPF scheduler misbehaves. The docs also warn that the scheduler API has no stability guarantee.
For capOS:
- The relevant idea is safe, dynamically replaceable policy with kernel integrity fallback.
- Copying the BPF ABI is not required. capOS can get a smaller version through privileged scheduler-policy capabilities later.
- Keep early scheduling policy in kernel Rust until the invariants are clear.
Whole-Machine User-Space/Core Schedulers
Arachne
Arachne is a user-level thread system for very short-lived threads. It is core-aware: applications know which cores they own and control placement of work on those cores. A central arbiter reallocates cores among applications. The published results report strong memcached and RAMCloud improvements, and the implementation requires no Linux kernel modifications.
Takeaway: user-level scheduling gets much better when the runtime has explicit core ownership. Blindly creating more kernel threads and hoping the OS scheduler does the right thing is a weaker contract.
Shenango
Shenango targets datacenter services with microsecond-scale tail-latency goals. It uses kernel-bypass networking and an IOKernel on a dedicated core to steer packets and reallocate cores across applications every 5 microseconds. The key policy is rapid core reallocation based on whether queued work is waiting long enough to imply congestion.
Takeaway: a dedicated scheduling/control core can be worthwhile when latency SLOs are tighter than normal kernel scheduling reaction times. It is expensive and only justified for sustained latency-sensitive workloads.
Caladan
Caladan extends the idea from load to interference. It uses a centralized scheduler core and kernel module to monitor and react to memory hierarchy and hyperthread interference at microsecond scale. Its main claim is that static partitioning of cores, caches, and memory bandwidth is neither necessary nor sufficient for rapidly changing workloads.
Takeaway: CPU scheduling is not only “which runnable thread next.” On modern machines it is also placement relative to caches, sibling SMT threads, memory bandwidth, and bursty workload phase changes.
Design Axes
| Axis | Options | Practical conclusion |
|---|---|---|
| Stack model | Stackless tasks, segmented/growing stacks, fixed stacks | Rust async uses stackless futures; Go/Java need runtime-managed stacks; POSIX threads need fixed or growable user stacks |
| Preemption | Cooperative, safe-point, signal/upcall, timer-forced OS preemption | Kernel preemption alone protects the system; runtime fairness needs safe points or cooperative budgets |
| Blocking | Convert all operations to async, add carriers, kernel upcalls | Async caps reduce blocking; Go/POSIX still need kernel threads and futexes |
| Queueing | Global queue, per-worker queues, work stealing, priority queues | Per-worker queues plus stealing are the default; add global fairness escape hatches |
| CPU ownership | Invisible OS scheduling, affinity hints, explicit CPU grants | Explicit grants matter for high-performance runtimes and SQPOLL |
| Cross-process calls | Queue through scheduler, direct switch, scheduling donation | Direct switch and scheduling-context donation reduce sync IPC overhead and inversion |
| Isolation | Best-effort fairness, priorities, budget/period contexts | Cloud-oriented capOS eventually needs budget/period scheduling contexts |
capOS Design Options
Option: Minimal Kernel Mechanism Plus User Policy
This option keeps dispatch and enforcement in the kernel, replaces the current round-robin process scheduler with a minimal kernel CPU mechanism, and moves policy to user space through scheduling-context capabilities.
The kernel side covers:
- dispatching the next runnable thread on each CPU;
- enforcing budget/period/priority invariants;
- handling interrupts, blocking, wakeups, and exits;
- direct-switch IPC and scheduling-context donation;
- an emergency fallback policy.
The user-space scheduler service covers:
- policy configuration from the manifest;
- per-service budgets, periods, priorities, and CPU masks;
- admission control for new processes and threads;
- SQPOLL/core grants;
- response to timeout faults and overload telemetry.
This gives a capOS-like system the exokernel/microkernel benefit of policy freedom without putting a user-space server on the context-switch fast path.
Possible Implementation Sequence
- Thread scheduler in kernel. Convert from process scheduling to thread scheduling, with per-thread kernel stack, saved registers, FS base, and shared process address space/cap table.
- Scheduling contexts. Add kernel objects that carry budget, period, priority, CPU mask, and timeout endpoint. Initially assign one default context per thread.
- ThreadSpawner and ThreadHandle capabilities. Expose thread creation and
lifecycle through capabilities from the start. Bootstrap grants
initthe initial authority;initor a scheduler service delegates it under quota. - Scheduling-context donation for IPC. Baseline direct-switch IPC handoff exists for blocked Endpoint receivers; add budget/priority donation and return once scheduling contexts exist.
- User-space policy service. Let init or a
schedservice create and update scheduling contexts via capabilities. - SMP core ownership. After per-CPU run queues and TLB shootdown exist, allow the scheduler service to manage CPU masks and SQPOLL/poller grants.
- Optional dynamic policy. Much later, consider sched_ext-like policy modules if Rust/verifier infrastructure exists. This is not a prerequisite.
Minimal Kernel API Sketch
interface SchedulerControl {
createContext @0 (budgetNs :UInt64, periodNs :UInt64, priority :UInt16)
-> (context :SchedulingContext);
setCpuMask @1 (context :SchedulingContext, mask :Data) -> ();
bind @2 (thread :ThreadHandle, context :SchedulingContext) -> ();
unbind @3 (thread :ThreadHandle) -> ();
setTimeoutEndpoint @4 (context :SchedulingContext, endpoint :Endpoint) -> ();
stats @5 (context :SchedulingContext) -> (consumedNs :UInt64, throttled :Bool);
}
interface SchedulingContext {
yieldTo @0 (thread :ThreadHandle) -> ();
consumed @1 () -> (consumedNs :UInt64);
}
interface ThreadSpawner {
create @0 (
entry :UInt64,
stackTop :UInt64,
arg :UInt64,
context :SchedulingContext,
flags :UInt64
) -> (thread :ThreadHandle);
}
interface ThreadHandle {
join @0 (timeoutNs :UInt64) -> (status :Int64);
exitCode @1 () -> (exited :Bool, status :Int64);
bind @2 (context :SchedulingContext) -> ();
}
The hot path does not invoke these methods; they are control-plane operations.
Dependency: In-Process Threading
Kernel threads inside a process are a dependency for sophisticated user-level thread support:
Threadobject with saved registers, per-thread kernel stack, user stack pointer, FS base, state, and parent process reference.- Scheduler runs threads, not processes.
- Process owns address space and cap table; threads share both.
- Process context switch saves/restores FS base today; thread scheduling must make that state per-thread.
- Thread creation is exposed first as a
ThreadSpawnercapability; bootstrap grants initial authority toinit, and later policy delegates it through the capability graph. - Thread exit reclaims the thread stack and wakes joiners if join exists.
This directly unblocks Go phase 2, POSIX pthread compatibility, native
thread-local storage, and any multi-worker Rust async executor.
Dependency: Park (Linux futex analogue) and Timer
A minimal capability-authorized park primitive has this shape:
park(park_space, uaddr, expected, timeout_ns) -> Result
unpark(park_space, uaddr, max_count) -> usize
Required semantics:
parkchecks that*uaddr == expectedwhile holding the park wait-lock equivalent, then blocks the current thread.unparkmakes up tomax_countwaiters runnable.- Timeouts use monotonic ticks or a timer wheel/min-heap.
- Return values must distinguish woken, timed out, interrupted, and value mismatch.
The authority should be capability-based from the start, for example through a
ParkSpace, SharedParkSpace, or memory-object-derived capability. Pre-thread
measurement with the benchmark-only ParkBench cap favors a compact
capability-authorized operation over generic Cap’n Proto methods for failed
wait and empty wake. The blocked/resume path still needs measurement after
threads exist because the primitive sits on the runtime parking path.
Measure this before fixing the ABI:
CAP_OP_NOP: ring validation plus CQE post, with no cap lookup or capnp.- Empty and small
NullCapcalls through normal cap lookup, method dispatch, capnp param decode, and capnp result encode. - Futex-shaped compact operation carrying
cap_id,uaddr,expected, andtimeout/max_count, initially returning without blocking. - Generic
ParkBench.wait/ParkBench.wakeCap’n Proto methods for the same pre-thread failed-wait and empty-wake cases. - Later, real blocking paths: failed wait, wake with no waiters, wait-to-block, wake-to-runnable, and wake-to-resume.
The useful decision is not “capability or syscall”; it is “generic capnp method or compact capability-authorized scheduler primitive.” Authority remains in the capability model either way.
Near Term: Runtime Event Integration
For capos-rt, design the executor around kernel completion sources:
- Capability-ring CQ entries wake tasks waiting on cap invocations.
- Notification objects wake tasks waiting on interrupts, timers, or service events.
- Futex wakes resume parked worker threads.
- Timers can be integrated as wakeups instead of periodic polling.
The executor policy can start simple:
- One worker per kernel thread.
- Local FIFO queue per worker.
- One global injection queue.
- Work stealing when local and global queues are empty.
- Cooperative operation budget, then requeue.
Stage 6: IPC Scheduling
For synchronous IPC, direct switch has been introduced before priority scheduling:
- If client A calls server B and B is blocked in receive, switch A -> B directly without picking an unrelated runnable thread. This is implemented for the current single-CPU Endpoint path.
- Mark A blocked on reply.
- Future fastpath work can transfer a small message inline; use shared buffers for large data.
Scheduling-context donation then adds the budget/priority transfer:
- The server runs the request using the caller’s scheduling context.
- The caller’s budget covers client + server work.
- Passive servers can exist without independent CPU budget and only run when a caller donates one.
This avoids priority inversion through the capability graph and matches the service architecture better than per-process priorities alone.
Stage 7: SMP and Core Ownership
Once per-CPU scheduler queues exist, these become policy surfaces:
- CPU affinity depends on correct migration and TLB shootdown.
- A
CpuSetorSchedulingContextcapability can describe allowed CPUs, budget, period, and priority. - Cheap current-CPU exposure depends on a stable per-thread ABI page.
- SQPOLL can be gated on available CPU budget to avoid unlimited poller creation.
Risks and Failure Modes
- M:1 green threads do not provide Go or POSIX compatibility by themselves.
- A normal user-space process choosing the next thread on every timer tick puts a context-switch round trip on the hot path.
- Recovery from scheduler-service failure cannot depend solely on the scheduler service being runnable.
- A Go-like G/M/P scheduler in the kernel couples language runtime policy to the kernel.
- Generic Cap’n Proto capability calls may be too heavy for every synchronization primitive. Measure generic calls against compact capability-authorized operations before fixing the futex ABI.
- sched_ext-like dynamic policy loading depends on mature scheduler invariants and verifier/runtime machinery.
- SQPOLL on a single-core system can compete with the application it is meant to accelerate.
Open Questions
- Does capOS need scheduler-activation-style upcalls? Async caps and notification objects cover many of the same cases with less machinery.
- How can runtime preemption work without Unix signals? Options are cooperative-only, timer notification to a runtime handler, or a kernel forced safe-point ABI. Cooperative-only is one first-support option for Go.
- How are shared-memory futex keys represented? Private futexes can key on address space and virtual address. Shared futexes need memory-object identity and offset.
- How large is the blocked/resume overhead once threads exist? The pre-thread failed-wait and empty-wake measurement already favors compact operations, but 4.5.5 still needs the contended path before freezing the final ABI.
- How much policy belongs in the boot manifest versus a long-running
schedservice? Static embedded systems can use manifest policy. Cloud or developer systems need runtime policy updates. - What is the emergency fallback if the scheduler service exits? Options are a tiny kernel round-robin fallback for privileged recovery threads, a pinned immortal scheduler thread, or panic. The first is the only robust development choice.
Source Notes
- Anderson et al., “Scheduler Activations: Effective Kernel Support for the User-Level Management of Parallelism” (SOSP 1991): https://polaris.imag.fr/vincent.danjean/papers/anderson.pdf
- “Towards Effective User-Controlled Scheduling for Microkernel-Based Systems” (L4 user-level scheduling): https://os.itec.kit.edu/21_738.php
- Asberg and Nolte, “Towards a User-Mode Approach to Partitioned Scheduling in the seL4 Microkernel”: https://www.es.mdh.se/pdf_publications/2641.pdf
- Kang et al., “A User-Mode Scheduling Mechanism for ARINC653 Partitioning in seL4”: https://link.springer.com/chapter/10.1007/978-981-10-3770-2_10
- L4Re overview: https://l4re.org/doc/l4re_intro.html
- Liedtke, “On micro-kernel construction”: https://elf.cs.pub.ro/soa/res/lectures/papers/lietdke-1.pdf
- seL4 MCS tutorial: https://docs.sel4.systems/Tutorials/mcs.html
- seL4 design principles: https://microkerneldude.org/2020/03/11/sel4-design-principles/
- Linux kernel sched_ext documentation: https://www.kernel.org/doc/html/next/scheduler/sched-ext.html
- Arun et al., “Agile Development of Linux Schedulers with Ekiben”: https://arxiv.org/abs/2306.15076
- Williams, “An Implementation of Scheduler Activations on the NetBSD Operating System” (USENIX 2002): https://web.mit.edu/nathanw/www/usenix/freenix-sa/freenix-sa.html
- Microsoft, “User-Mode Scheduling”: https://learn.microsoft.com/en-us/windows/win32/procthread/user-mode-scheduling
- Go runtime scheduler source: https://go.dev/src/runtime/proc.go
- Go preemption source: https://go.dev/src/runtime/preempt.go
- OpenJDK JEP 444, “Virtual Threads”: https://openjdk.org/jeps/444
- Tokio runtime scheduling documentation: https://docs.rs/tokio/latest/tokio/runtime/
- von Behren et al., “Capriccio: Scalable Threads for Internet Services” (SOSP 2003): https://web.stanford.edu/class/archive/cs/cs240/cs240.1046/readings/capriccio-sosp-2003.pdf
- Argobots paper page: https://www.anl.gov/argonne-scientific-publications/pub/137165
- Argobots project: https://www.argobots.org/
- Pan et al., “Lithe: Enabling Efficient Composition of Parallel Libraries” (HotPar 2009): https://www.usenix.org/legacy/event/hotpar09/tech/full_papers/pan/pan_html/
- Linux
futex(2)manual: https://man7.org/linux/man-pages/man2/futex.2.html - Linux kernel restartable sequences documentation: https://docs.kernel.org/userspace-api/rseq.html
io_uring_sqpoll(7)manual: https://manpages.debian.org/testing/liburing-dev/io_uring_sqpoll.7.en.html- Qin et al., “Arachne: Core-Aware Thread Management” (OSDI 2018): https://www.usenix.org/conference/osdi18/presentation/qin
- Ousterhout et al., “Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads” (NSDI 2019): https://www.usenix.org/conference/nsdi19/presentation/ousterhout
- Fried et al., “Caladan: Mitigating Interference at Microsecond Timescales” (OSDI 2020): https://www.usenix.org/conference/osdi20/presentation/fried
Completion Rings And Threaded Runtimes
This note grounds the capOS ring/threading roadmap in existing completion I/O and futex designs. The question is not whether a shared CQ can be made to work with many waiting threads; it can. The question is which ownership model keeps the kernel ABI stable once capOS runs multiple process threads on multiple CPUs.
Sources Checked
- Linux
io_uring_enter(2)documents the aggregate wait shape: withIORING_ENTER_GETEVENTS, the syscall waits untilmin_completecompletion events are available. - Linux
io_uring_setup(2)documents SQPOLL, CQ sizing, and single-issuer-oriented task-run modes. - Linux
io_uring_register(2)documents registered wait regions. - Jens Axboe’s
io_uringpaper explains the core ring design as a pair of shared rings with single producer/single consumer ownership on each side anduser_datacopied from request to completion for matching. - Linux
futex(2)andfutex(7)document futexes as a kernel-assisted blocking path for synchronization objects whose uncontended state lives in user memory. - Microsoft I/O completion ports document the port model: threads wait on a completion port and dequeue completion packets, rather than each thread waiting directly on one specific operation’s storage slot.
Consequences For capOS
The current process-wide capOS ring matches the early io_uring shape: one SQ,
one CQ, and user_data for completion matching. That shape is efficient when
userspace serializes submission and completion consumption through one runtime
owner. It becomes the wrong primitive for full SMP if multiple kernel-scheduled
threads in the same process concurrently enter the kernel, because the ring
turns into a multi-producer/multi-consumer coordination problem.
Waiting for a raw CQ slot is not a good abstraction. CQ slots are circular
buffer storage and are reused. Stable wait identities are request user_data,
kernel answer ids, completion packets, or a completion queue/lane chosen at
submission time.
The clean full-SMP target is per-thread completion ownership. Each thread gets
its own capability ring endpoint: a complete SQ/CQ pair, even if multiple
endpoints are packed into one larger mapping. The existing
cap_enter(min_complete, timeout_ns) semantics can then remain aggregate:
min_complete counts completions available on the current thread’s CQ. Runtime
code still matches individual operations by user_data, but two sibling
threads no longer race to consume the same process CQ.
The Windows IOCP model is a useful counterpoint: a shared completion port works when the abstraction is explicitly a packet queue consumed by a worker pool. That is a runtime/service scheduling model, not the same thing as multiple threads blocking on one raw process CQ while each expects a private answer.
Current Implementation State
The kernel dispatches six SQE opcodes today: CALL, RECV, RETURN, RELEASE,
CANCEL, and NOP. FINISH is reserved for the future system capnp transport and
completes with an unsupported-opcode error. PARK and UNPARK (capability-
authorized futex-style thread-park operations) are also dispatched. Only
CALL opcodes are gated to syscall context (via
call_requires_syscall_dispatch); the other dispatched opcodes, PARK and
UNPARK included, are processed in both syscall and timer-interrupt contexts.
PARK_BENCH is measurement-only and dispatched only when the kernel is built
with the measure feature.
Per-process resource limits are enforced via ResourceProfile, a quota
struct carried on each Process and resolved at spawn time. Two fields
directly bound the ring’s resource use: ring_scratch_limit_bytes caps the
input and output buffer capacity of the per-process ring scratch allocator
(narrowing the kernel-side ceilings MAX_PARAMS and MAX_RESULT);
in_flight_call_limit and endpoint_queue_limit cap the per-Endpoint
in-flight CALL count and the queued (parked) CALL queue depth respectively,
each clamped by a kernel structural maximum of 32.
SQPOLL on the per-process ring has landed: a process can hold a
kernelSqpoll lease whose bound ring transitions into SQPOLL mode, with
the kernel acting as sole SQ consumer for that ring. This is the SQPOLL
foundation for the full-SMP per-thread ring target described below, not the
target itself. Generic full-nohz for explicitly budgeted compute leases and
SQPOLL nohz for explicitly leased caller-thread rings have landed; broader
userspace-poller/device-queue issuance remains future work.
Recommended Direction
- Keep the current process ring as the bootstrap and compatibility surface.
- Add runtime reactor/demux support as an interim path for multithreaded runtimes that still use one process ring.
- Make the full SMP ABI a per-thread ring model:
- each
Threadowns one ring endpoint with a complete SQ/CQ pair; cap_enteroperates on the current thread’s ring;- SQPOLL, when enabled, is the sole kernel SQ consumer for that ring;
- result-cap transfers still mutate the process cap table;
- endpoint, timer, process-wait, thread-join, and futex completions post to
the waiting
ThreadRef’s ring.
- each
- Consider shared completion ports only as a userspace runtime/service abstraction above per-thread rings, not as the kernel’s first full-SMP ring ABI.
References
- Linux
io_uring_enter(2): https://man.archlinux.org/man/io_uring_enter.2.en - Linux
io_uring_setup(2): https://man7.org/linux/man-pages/man2/io_uring_setup.2.html - Linux
io_uring_register(2)registered wait regions: https://www.man7.org/linux/man-pages/man2/io_uring_register.2.html - Jens Axboe, “Efficient IO with io_uring”: https://www.kernel.dk/io_uring.pdf
- Linux
futex(2): https://man7.org/linux/man-pages/man2/futex.2.html - Linux
futex(7): https://man7.org/linux/man-pages/man7/futex.7.html - Microsoft I/O completion ports: https://learn.microsoft.com/en-us/windows/win32/fileio/i-o-completion-ports
x2APIC and APIC Virtualization
Research note for the SMP Phase C LAPIC/IPI decision. The goal is to decide how x2APIC should fit after the current LAPIC/IPI implementation work and to record which virtualization facts affect that choice.
Status note (2026-06-06): The x2APIC backend has landed in
kernel/src/arch/x86_64/lapic.rs: the BSP checksCPUID.01H:ECX.x2APICat boot and prefers x2APIC MSR access when available, falling back to xAPIC MMIO. AP initialization follows the BSP-selected mode. The selected-mode QEMU proof ismake run-interrupt-grant-x2apic, which forces+x2apic, assertsLapicMode::X2Apic, and reuses the routedInterrupt.wait/Interrupt.acknowledgepath. The proof is a bounded QEMU backend-selection proof, not high-core hardware readiness.
Existing Local Research
Before adding this note, docs/research/ contained:
capnp-error-handling.mdcompletion-ring-threading.mderos-capros-coyotos.mdgenode.mdix-on-capos-hosting.mdllvm-target.mdos-error-handling.mdout-of-kernel-scheduling.mdpingora.mdplan9-inferno.mdsel4.mdsmall-llm-survey.mdzircon.md
None of those files directly cover APIC/x2APIC or KVM APIC virtualization.
Sources Checked
- Intel, Intel 64 and IA-32 Architectures Software Developer’s Manuals, Vol. 3 APIC and APIC virtualization chapters.
- Intel, xAPIC Deprecation Plan, updated 2025-09-18.
- Intel, CPUID Enumeration and Architectural MSRs, updated 2025-05-12.
- QEMU, QEMU / KVM CPU model configuration.
- QEMU, Paravirtualized KVM features.
- Linux kernel documentation, KVM API.
Local verification:
- Host command
qemu-system-x86_64 --versionreported QEMU 8.2.2. - Host command
qemu-system-x86_64 -cpu helplistedx2apicas a recognized CPUID feature. - The current capOS LAPIC implementation has both xAPIC MMIO and x2APIC MSR backends. The BSP selects x2APIC when CPUID or firmware state makes it available and otherwise falls back to xAPIC MMIO.
make run-interrupt-grant-x2apicuses-cpu qemu64,+smep,+smap,+rdrand,+x2apic, asserts the selectedLapicMode::X2Apicbackend, and proves the routed interrupt waiter / deferred-EOI acknowledgement path still works in that mode.
x2APIC Findings
x2APIC is still the forward-looking LAPIC backend for later hardware and VM coverage:
- It avoids mapping the local APIC MMIO page and uses architectural MSRs for local APIC register access.
- It supports wider APIC IDs than xAPIC’s 8-bit destination model, which keeps the CPU-id/LAPIC-id split introduced by the SMP proposal relevant on larger systems and VMs.
- Intel’s current public guidance says x2APIC is required above 255 cores, newer Intel client families default to x2APIC, and legacy xAPIC can become unavailable or locked out after firmware or system software enters x2APIC.
- The local capOS dependency set already has
x86_64MSR access, and the implemented x2APIC backend covers EOI, ICR/IPI, spurious vector, LVT timer, timer initial count, divide config, and current APIC ID without adding another architecture crate.
The implementation shape is:
- Keep the xAPIC MMIO LAPIC timer/IPI foundation as the fallback for older hardware and VM configurations that only expose xAPIC.
- Select x2APIC when
CPUID.01H:ECX.x2APICis available or when firmware has already enabled/locked x2APIC. - Keep TLB shootdown, timer, EOI, and device-vector paths on the architectural LAPIC interface rather than on KVM paravirtual APIC helpers.
- Treat larger-APIC-ID and high-core hardware validation as future hardware evidence; the current selected-mode QEMU proof covers backend selection and the routed waiter/ack path only.
Virtualization Findings
Virtualization is relevant to validation and future performance, not to the guest-visible correctness contract:
- QEMU/KVM can expose x2APIC through CPU model feature selection. capOS tests
should make that explicit by extending the current QEMU model to
-cpu qemu64,+smep,+smap,+rdrand,+x2apic, or by using another named CPU model with+x2apic, instead of relying on the host or accelerator default. - KVM exposes APIC state through its own API and has x2APIC-specific handling for 32-bit APIC IDs. That matters to the VMM, but a capOS guest should use the architectural x2APIC interface.
- QEMU/KVM paravirtual features such as
kvm-pv-eoi,kvm-pv-ipi, andkvm-pv-tlb-flushare optional accelerations. They should not be part of the first LAPIC/IPI or TLB-shootdown proof because they would make correctness depend on a Linux/KVM-specific host contract. - APIC virtualization features such as APICv or AMD AVIC are VMM-side acceleration mechanisms. capOS should not require or detect them before it has a stable architectural x2APIC path.
The practical QEMU proof targets are therefore:
- Boot the current xAPIC MMIO LAPIC implementation with
-smp 2. - Prove LAPIC timer ticks on vector 48 and IPI delivery on vector 49.
- Keep KVM paravirtual APIC/TLB/IPI features disabled or ignored for the first correctness proof.
- Run
make run-interrupt-grant-x2apicas the selected-mode x2APIC proof, using-cpu qemu64,+smep,+smap,+rdrand,+x2apicand asserting the selected backend plus the routed interrupt wait/ack path.
capOS Recommendation
Keep x2APIC as the preferred backend when CPUID or firmware state exposes it, with xAPIC MMIO as the fallback. Keep correctness on the architectural LAPIC timer, IPI, EOI, and device-vector paths; KVM paravirtual APIC/TLB/IPI features remain optional accelerations rather than proof dependencies. Do not treat the selected-mode QEMU proof as high-core hardware readiness.
IOMMU Remapping Grounding
This note records primary-source facts for IOMMU/remapping work. The Intel
VT-d path has landed under #[cfg(feature = "qemu")] in kernel/src/iommu.rs
as a QEMU q35 smoke (make run-iommu-remapping); AMD-Vi table programming
remains future work. DMAPool has manager-owned domain identity and
mapping-lifecycle preflight records. For the QEMU Intel IOMMU path, real VT-d
table programming, hardware-DMA translation proof, two-phase
invalidation/IOTLB-flush revocation, and IOMMU-backed hostile stale-DMA smokes
have all landed (see
ddf-iommu-qemu-intel-remapping-smoke).
For QEMU shapes without intel-iommu, the kernel-owned bounce-buffer fallback
remains active (remapping_tables=not-programmed,
hostile_hardware_isolation=not-claimed). AMD-Vi table programming and a
bounce-buffer policy for non-IOMMU devices remain open.
Sources
- Intel, Intel Virtualization Technology for Directed I/O Architecture
Specification,
content ID 671081. Intel page metadata on 2026-05-12 listed Date
2022-06-02and Version5.1 (Latest). Sections used: 6.2.2 “Context-Cache”, 6.2.4 “IOTLB”, 6.5.1 “Register-based Invalidation Interface”, 6.5.2 “Queued Invalidation Interface”, 6.5.3 “IOTLB Invalidation Considerations”, 6.6 “Set Root Table Pointer Operation”, 6.8 “Write Buffer Flushing”, 7.10 “Software Steps to Drain Page Requests & Responses”, 8.3 “DMA Remapping Hardware Unit Definition Structure”, 8.3.1 “Device Scope Structure”, 9.1 “Root Entry”, 9.3 “Context Entry”, 9.4 “Scalable-Mode Context-Entry”, and 11.4.5-11.4.9 covering the root-table-address, invalidation, fault, protected-memory-range, and invalidation-queue registers. - AMD, AMD I/O Virtualization Technology (IOMMU) Specification 48882, 48882-PUB Rev 3.10, February 2025. Sections used: 2.2 device table, device-table entry, I/O page table, and interrupt-remapping material; 2.4 “Commands”; 2.5 “Event Logging”; 3.4 “IOMMU MMIO Registers”; IVRS/device-table/page-table, command-buffer, completion-wait, invalidation, and event-log material.
- QEMU, qemu-manpage
entries for
-device intel-iommu,-device amd-iommu, and-device virtio-iommu-pci; and QEMU PCI developer documentation for PCI IOMMU and IOTLB notifier APIs. These are current-master QEMU docs, not a frozen release manual; theqemu-manpageand PCI developer pages observed on 2026-05-12 were generated for QEMU version 11.0.50.
Intel VT-d Grounding
Intel VT-d identifies DMA request sources through PCI requester/source IDs and
resolves them through DMA remapping hardware units described by DMAR DRHD
structures. The table path is rooted at a root table and context tables. Root
entries select context tables, context entries bind a source to a translation
type, domain identifier, address width, and second-level page-table root, and
scalable-mode context entries extend that context format. The landed QEMU smoke
(kernel/src/iommu.rs, cfg(qemu)) uses exactly this path: DRHD unit,
PCI segment and BDF/source ID, domain ID, aw-bits=39 address width, and a
3-level second-level page-table root. Scalable-mode context entries, 48-bit
IOVA space, interrupt remapping, and multi-device domains remain out of scope
for the current slice.
Invalidation is part of the mapping lifetime, not a diagnostic detail. Intel’s
register-based and queued invalidation interfaces cover context-cache,
IOTLB, device-TLB, interrupt-entry-cache, and wait/completion descriptors. The
landed smoke uses register-based context-cache invalidation (CCMD.ICC global
granularity) and domain-selective IOTLB invalidation (IOTLB.IVT,
CAP.IRO-decoded offset), both with bounded completion-bit polling. Page reuse
is ordered strictly after invalidation completion; a poll exhausted without
observing completion fails closed and does not free the backing pages. Queued
invalidation (GCMD.QIE) is not set in the current slice. Fault-reporting
registers (FSTS.PPF, FRCD[0].F) are the minimum diagnostic surface for
translation failures and protection faults, and are exercised by the
unmapped-IOVA and stale-DMA hostile proofs.
QEMU’s intel-iommu documentation is useful for focused emulator smokes but
should not be treated as hardware coverage. It is q35-only in QEMU current
master. Relevant options include intremap, caching-mode, device-iotlb,
and aw-bits=39|48; QEMU documents 39-bit IOVA space for 3-level IOMMU page
tables and 48-bit IOVA space for 4-level tables.
AMD-Vi Grounding
AMD-Vi uses a different vocabulary and table root. Device requests are keyed by DeviceID and resolved through a Device Table Entry. A DTE carries validity, translation, interrupt-remapping, DomainID, mode/page-table-depth, and page-table-root information. Future shared capOS abstractions can name the logical domain and IOVA lifetime generically, but AMD-specific code should not pretend it is programming Intel root/context tables.
AMD invalidation and completion are command-buffer operations. The future mapping lifetime must include command-buffer invalidation commands, completion wait, and event-log handling. The event log is the basic hardware-facing diagnostic record for malformed requests, page faults, and table errors; the MMIO register set covers control/status, command and event pointers, event-log state, alternate event-log buffers, device-table segment bases, and extended features.
QEMU’s amd-iommu documentation is also q35-only in current master. The
documented options include dma-remap for DMA address translation and
permission checking and intremap for interrupt remapping. Treat these as
emulator smoke inputs until capOS has separate hardware or provider evidence.
QEMU Test Surface
QEMU provides the emulator-level test surface for IOMMU smokes:
intel-iommuon q35 withaw-bits=39(3-level second-level page tables) is the shape used by the landedmake run-iommu-remappingsmoke, pinned to QEMU 8.2.2. The smoke asserts table programming, hardware-DMA translation (mapped_iova_translated=hardware-dma), unmapped-IOVA fault observation (unmapped_iova_fault=observed), two-phase invalidation/IOTLB-flush, and IOMMU-backed hostile stale-DMA proofs.amd-iommuon q35 with DMA remapping enabled is grounded here for a future AMD-Vi table-programming slice.virtio-iommu-pcion q35 x86_64 orvirtARM covers a portable virtio-IOMMU frontend if selected later.- PCI IOMMU/IOTLB notifier APIs in QEMU developer docs describe how emulated devices observe translation changes; they are not guest architectural requirements.
QEMU citations in the Sources section are current-master documentation observed
on 2026-05-12. Tests pin the local qemu-system-x86_64 --version, machine
type, and full device option string in the smoke evidence.
Implementation Status and Future Slices
Intel VT-d QEMU smoke (landed, cfg(qemu)):
- DMAR/DRHD discovery, MMIO/fault-status diagnostics, and disabled IOVA ledger preflight records: landed as prerequisites.
kernel/src/iommu.rsreal VT-d legacy-mode entry programming, RTAR write,GCMD/GSTSSRTP-then-TEhandshake, hardware-DMA translation proof via virtio-rng, unmapped-IOVA fault observation viaFSTS/FRCD, two-phase invalidation/IOTLB-flush revocation, and IOMMU-backed hostile stale-DMA smokes: all landed as of 2026-05-14 (slices A1/A2/B/C). See ddf-iommu-qemu-intel-remapping-smoke.- IOVA export stays disabled for this slice (
iova_export=disabled-this-slice);hostile_hardware_isolation=not-claimedin all evidence.
Future slices (not yet started):
- AMD-Vi table programming: separate source grounding and evidence; AMD-specific DTE, DeviceID, command-buffer, and event-log names must not be conflated with Intel root/context tables.
- Source-grounding refresh for AMD or additional Intel features (48-bit IOVA, scalable-mode context entries, interrupt remapping, device-IOTLB) when a real branch selects them.
- Bounce-buffer policy for QEMU shapes without
intel-iommu: an explicit decision on IOMMU/remapping or an explicit bounce-buffer policy for non-IOMMU devices remains open. - Trusted multi-device sharing groups, production NIC or storage driver ownership, and moving the live virtio-net path off bounce buffers are not in scope for the current slice.
DMA User-Space Driver Isolation
This note records the DMA-addressing and isolation consequences capOS must use when planning user-space storage and NIC drivers. It is intentionally about authority boundaries, not about a particular NVMe or virtio implementation.
Address Spaces And Trust Boundaries
A DMA-capable device does not use a process virtual address. It consumes a device-visible address carried in descriptors, queue-base registers, PRP/SGL entries, or an equivalent protocol field.
On a bare host with an IOMMU:
user VA --CPU MMU--> host physical address
device IOVA --IOMMU--> host physical address
On a guest VM:
guest user VA --guest MMU--> guest physical address --EPT/NPT--> host physical address
With a virtual or assigned IOMMU, a guest can additionally reason about:
guest device IOVA --vIOMMU or paravirt grant layer--> guest physical address
The host still owns the real host IOMMU or equivalent hypervisor translation. A guest-programmable vIOMMU is useful because it gives the guest kernel a guest-internal DMA authority boundary; it is not direct control of the host IOMMU.
Host User-Space Driver Pattern
A safe host user-space driver resembles the VFIO/IOMMUFD split:
- The kernel owns PCI discovery, BAR assignment, PCI configuration mediation, IOMMU domain creation, DMA map/unmap, page pinning, interrupt or MSI-X routing, reset, hotplug, and revocation.
- The user-space driver owns protocol logic: queue formats, descriptor contents, device-specific register sequencing, doorbells, polling, completion handling, and command construction.
- The driver may receive a domain-scoped IOVA for a live buffer only when the kernel has installed and can revoke the IOMMU mapping for that device.
- The driver must not receive unrestricted host physical addresses.
UIO-style “map a BAR and deliver interrupts” is not a complete security model for a DMA-capable PCI device. If a user-space process can program a DMA engine through MMIO, then DMA isolation requires either an IOMMU domain or a stricter broker that prevents raw device-address publication.
Guest Microkernel Pattern
Host isolation and guest isolation are different claims.
For an assigned PCI device or SR-IOV VF without a guest-visible IOMMU, the host can still protect itself by mapping the device only to the VM’s memory. That does not protect the guest kernel from an untrusted guest user-space driver: from the guest’s perspective the device can still DMA to arbitrary guest physical pages.
Virtual devices have the same guest-internal issue in a different form. If an untrusted driver can put arbitrary guest physical addresses into virtqueue descriptors, the host backend can write into guest kernel memory while still staying inside the VM boundary. The host remains protected; the guest kernel is not.
A guest microkernel that wants untrusted user-space drivers therefore needs one of these guest-visible authorization layers:
- a vIOMMU or virtio-iommu path where the guest kernel controls guest IOVA to guest physical mappings;
- a paravirtual grant-table model where descriptors carry grant identifiers instead of raw guest physical addresses;
- a trusted mediation service that owns descriptor/device-address fields and lets the untrusted driver submit only typed commands, buffer capabilities, or opaque handles.
The invariant is:
Never let an untrusted guest driver provide a raw guest physical address to a
device or backend unless a guest-visible DMA authorization layer validates it.
BAR, MSI-X, And DMA Are Separate Authority Surfaces
BAR/MMIO controls CPU-to-device register access. DMA controls device-to-memory access. MSI/MSI-X controls device-to-interrupt-controller messages. A safe user-space driver interface needs all three mediated.
- Mapping a BAR is not enough; a BAR write can enable bus mastering or ring a doorbell that makes descriptors visible to the device.
- MSI-X tables often live inside a BAR. A driver must not get arbitrary write access to MSI-X message address/data entries unless the kernel or hypervisor can mediate interrupt remapping.
- IOMMU memory remapping does not by itself protect BAR register semantics or interrupt routing.
For capOS, DeviceMmio, DMAPool/DMABuffer, and Interrupt must remain
separate capabilities with a single device-manager ledger tying them to the
same owner generation and teardown state.
No-IOMMU Bounce-Buffer Consequences
On a shape without guest-programmable remapping, a real PCI device’s device-visible address is the host physical or bus address the controller uses for DMA. A bounce buffer can keep the data path manager-owned, but it does not magically create an untrusted-driver-safe IOVA namespace.
The no-IOMMU fallback can preserve no-host-physical-exposure only if userspace does not author raw device-address fields. The kernel or a trusted device manager must instead:
- allocate and pin the device-visible bounce pages;
- program queue-base registers and PRP/SGL or virtqueue address fields, or translate typed driver requests into those fields;
- copy between device-visible bounce pages and non-device memory when the selected backend requires it;
- quiesce outstanding DMA before revoke or page reuse;
- scrub bounce pages before reuse;
- keep
hostile_hardware_isolation=not-claimed.
The costs are direct: extra copies, higher latency, CPU/cache pressure, bounded pool exhaustion risk, more teardown bookkeeping, and no hostile-hardware memory isolation claim. These costs are the price of not exposing host physical addresses when no guest-programmable remapping exists.
GCP And QEMU Implications
The GCE probes in Cloud DMA Provider Evidence Inventory show no guest-programmable IOMMU on the sampled GCP shapes: no usable DMAR/IVRS/IORT tables or IOMMU groups, and SWIOTLB software bounce buffering in the Linux guest. Host-side or provider-side isolation may still exist, but capOS cannot program or validate it from inside the guest.
The practical split is:
- QEMU
run-iommu-remappingremains the right local proof lane for direct-remapping behavior: domain-scoped IOVA export, per-device domains, invalidation, faults, and stale-DMA behavior. - GCP storage and NIC driver planning must treat the probed shapes as no-IOMMU/bounce-buffer targets until a future runtime probe observes a guest-programmable remapping unit.
- A design that requires the provider to write device-visible queue-base or PRP/SGL addresses is valid only on a verified direct-remapping/vIOMMU path, or after capOS implements a separate synthetic address namespace that the kernel translates before hardware sees it.
- On the current GCP/no-IOMMU path, the recommended storage design is brokered: userspace owns protocol decisions and buffer capabilities, while the kernel or device manager materializes the device-visible DMA addresses.
Recommended capOS Backend Modes
Use three explicit modes in planning and task acceptance:
| Mode | When it applies | User-space device-address exposure |
|---|---|---|
direct-remapping | capOS discovers, programs, and validates a guest-visible IOMMU/vIOMMU domain. | Domain-scoped IOVA only, labeled as meaningless outside that domain. |
brokered-bounce | No usable guest IOMMU, but a manager-owned bounce path can safely support the device. | None: provider passes buffer caps, grant IDs, or typed commands; kernel writes device-visible addresses. |
unsupported | Observations are contradictory, unsafe, or no safe brokered path exists. | None: device stays unbound or disabled. |
For GCP today, brokered-bounce is the only credible storage/NIC driver target
on the probed shapes. direct-remapping remains a QEMU proof lane and a future
cloud/hardware lane only after runtime evidence shows guest-programmable
remapping.
Cloud DMA Provider Evidence Inventory
This note is the research substrate for the cloud DMA backend decision. It records official AWS, Azure, and Google Compute Engine device-surface facts, defines the evidence-matrix schema that the backend policy fills, specifies the live guest-probe checklist a later credentialed cloud-run task captures, and fixes the classification rules that separate a DMA-capable surface from guest-programmable remapping authority.
It makes no backend selection and no per-VM-shape safety claim. It does not
launch a cloud VM, require provider credentials, or assert that any instance
shape is safe for direct DMA. Selecting a backend and asserting bounce-buffer
safety or IOMMU coverage for a specific shape require attended sign-off and are
out of scope here; that work is cloud-dma-backend-selection. The model this
note feeds is docs/proposals/dma-assurance-model-proposal.md; the local
QEMU/IOMMU grounding it builds on is docs/research/iommu-remapping.md.
How These Facts Were Collected
Provider facts are from official provider documentation and API/CLI references only, retrieved on the dates recorded below. A “fact” here is a statement the provider document makes directly. Where a property is read from an API field rather than stated in prose, it is marked as an inference from API field. No statement in this note comes from running a cloud instance; the live-probe checklist exists precisely because a guest cannot prove provider-side isolation from documentation alone.
Provider Official Facts
AWS EC2
Source: ec2:DescribeInstanceTypes API reference
(InstanceTypeInfo,
NetworkInfo,
EbsInfo),
retrieved 2026-05-24. The matching CLI is
aws ec2 describe-instance-types --instance-types <type>.
- Network surface.
networkInfo.enaSupportreports Elastic Network Adapter (ENA) support with valuesunsupported | supported | required.networkInfo.efaSupported(boolean) andnetworkInfo.efaInforeport Elastic Fabric Adapter presence.networkInfo.enaSrdSupported(boolean) reports ENA Express (Scalable Reliable Datagram).networkInfo.encryptionInTransitSupported(boolean) reports automatic in-transit encryption between instances. - EBS/NVMe surface.
ebsInfo.nvmeSupportreports NVMe support for EBS with valuesunsupported | supported | required.ebsInfo.ebsOptimizedSupportreports EBS-optimized behavior (unsupported | supported | default). - Instance store.
instanceStorageSupported(boolean) andinstanceStorageInforeport local instance-store NVMe disks. - Accelerators.
gpuInfo,fpgaInfo,inferenceAcceleratorInfo,neuronInfo, andmediaAcceleratorInfodescribe GPU/FPGA/inference/Neuron/ media accelerator surfaces when present. - Hypervisor.
hypervisorreportsnitro | xen. Modern Nitro instances reportnitro; the Nitro system is where ENA and NVMe EBS exposure originate.
Inference from API field: an instance type with enaSupport=required and
ebsInfo.nvmeSupport=required exposes a DMA-capable NIC and NVMe block surface.
This identifies a DMA-capable surface; it is not evidence of guest-programmable
remapping authority.
Azure Virtual Machines
Source: Azure Accelerated Networking overview
(page ms.date 2026-02-05, last updated 2026-05-05) and
az vm list-skus,
retrieved 2026-05-24.
- Network surface. Accelerated Networking enables single-root I/O virtualization (SR-IOV) on supported VM sizes, providing a host-bypass data path. The underlying SR-IOV hardware is one of NVIDIA/Mellanox ConnectX-3, ConnectX-4 Lx, ConnectX-5, or the Microsoft Azure Network Adapter (MANA).
- Capability query. A VM size’s Accelerated Networking capability is read
from
az vm list-skusas theAcceleratedNetworkingEnabledcapability value. Most general-purpose and compute-optimized sizes with two or more vCPUs support it (four or more on hyperthreaded sizes); NC and NV sizes appear in output but do not support it. - VF dynamic binding and revocation. The document states the SR-IOV virtual
function (VF) is dynamically revoked and restored across host maintenance and
live migration. Guest images must bind to the synthetic
hv_netvscdevice, not the VF, to keep connectivity, and must markmana | mlx4_core | mlx5_coreSR-IOV devices unmanaged so the synthetic/VF bond is transparent. - Driver delivery. Azure does not update the Mellanox or MANA in-guest drivers; the guest kernel/distribution provides them.
Inference from API field: AcceleratedNetworkingEnabled=True identifies a
DMA-capable SR-IOV NIC surface whose VF can appear and disappear at runtime. The
documented VF revoke/restore behavior is a driver-lifecycle constraint, not
remapping evidence.
Google Compute Engine
Source: Use Google Virtual NIC (gVNIC) and About Local SSD disks, retrieved 2026-05-24.
- Network surface. Third-generation and later machine series (excluding bare
metal) support only gVNIC for the virtual network interface (no virtio-net).
First- and second-generation machines must use gVNIC when on Arm CPU
platforms, when configured as Confidential VM, or when requiring network
speeds between 50 and 100 Gbps, and otherwise still support VirtIO-Net. Custom images declare gVNIC support through
the
GVNICguest OS feature (--guest-os-features=GVNIC, orguestOsFeatures:[{type:"GVNIC"}]). - Local SSD surface. Local SSD is attached over either the NVMe or SCSI
interface; the NVMe interface is required for peak performance, and some
machine series support only one of the two interfaces. The interface is chosen
by the disk
interfacefield (NVMEorSCSI). - Storage transport. Persistent Disk attaches as virtio-scsi on machine families that expose it, while newer families expose NVMe; the exact transport is a per-machine-family property to be captured per shape rather than assumed.
Inference from API field: a third-generation-or-later GCE machine type exposes a gVNIC NIC surface and may expose NVMe Local SSD/Persistent Disk. This identifies DMA-capable NIC/storage surfaces; it is not remapping evidence.
Evidence-Matrix Schema
The backend policy fills one row per observed (provider, shape, image) tuple. Provider-fact columns come from documentation/API; observation columns come from the live-probe checklist; the last two columns are derived classifications, not provider claims.
| Column | Meaning |
|---|---|
| Provider | aws / azure / gcp. |
| Region/zone | The region or zone the observation was taken in. |
| Instance type | Provider instance type / VM size / machine type. |
| Image/kernel | Boot image identifier and guest kernel version. |
| Source command or URL | The exact API/CLI command or official doc URL. |
| Retrieval date | Date the source was read or the probe was captured. |
| Visible PCI/storage/network devices | Devices the guest enumerates (lspci, block/net inventory). |
| Visible IOMMU tables/groups | ACPI DMAR/IVRS/IORT presence and /sys/kernel/iommu_groups. |
| Provider-side isolation notes | Documented host-side isolation (support-policy assumption, not proof). |
| Guest-programmable remapping observations | Whether the guest can discover, program, and validate a remapping authority. |
| Runtime backend inferred by capOS | The backend capOS would select from observations (see classification rules). |
| Support-policy status | Coarse advertised-target roll-up: Direct-remapping / Labeled-bounce-buffer / Unsupported, pending attended sign-off. |
Seed Rows (docs/API-derived, no safety claim)
These rows are seeded from documentation and API fields only. Observation and backend columns are intentionally blank because no instance was probed; they are filled by a later credentialed cloud-run task. No row asserts that any shape is safe for direct DMA.
| Provider | Example shape | Documented NIC surface | Documented storage surface | Remapping observation | Backend |
|---|---|---|---|---|---|
| aws | Nitro instance, enaSupport=required, nvmeSupport=required | ENA (SR-IOV) | NVMe EBS + optional instance-store NVMe | not yet probed | not yet selected |
| azure | Size with AcceleratedNetworkingEnabled=True | SR-IOV VF (MANA/ConnectX) bonded to synthetic hv_netvsc | Managed disk (transport per shape) | not yet probed | not yet selected |
| gcp | 3rd-gen+ machine type (e.g. C3) | gVNIC only | NVMe Local SSD / PD per family | probed 2026-05-24: IOMMU disabled, SWIOTLB (see GCE Live Probe Results) | labeled bounce-buffer |
| gcp | 1st/2nd-gen, x86, non-Confidential, under 50 Gbps | VirtIO-Net or gVNIC | virtio-scsi PD / Local SSD (NVMe or SCSI) | probed 2026-05-24: IOMMU disabled, SWIOTLB (see GCE Live Probe Results) | labeled bounce-buffer |
GCE Live Probe Results (2026-05-24)
These rows replace the GCE “not yet probed” placeholders with live guest
observations. Four representative shapes were booted on Google Compute Engine
(stock Debian 12, kernel 6.1.0-47-cloud-amd64) in a dedicated test project,
each running a /sys- and /proc-only probe delivered through instance
metadata and read back over the serial console. Every instance booted with no
external IP, no service account, and was deleted immediately after its probe
output was captured.
| Machine type | Class | NIC driver | Storage | Guest IOMMU / DMAR | DMA path |
|---|---|---|---|---|---|
n1-standard-1 | 1st-gen | virtio_net | virtio-scsi (sda) | intel_iommu=off, DMAR: IOMMU disabled, no DMAR table, empty iommu_groups | SWIOTLB software bounce buffering |
e2-small | 2nd-gen | virtio_net | virtio-scsi (sda) | same: IOMMU disabled, no DMAR, no groups | SWIOTLB |
c3-standard-4 | 3rd-gen Intel | gvnic | nvme Local SSD (Google vendor 0x1ae0) | same | SWIOTLB |
n2d-standard-2 Confidential | AMD SEV | gvnic | nvme | same; additionally Memory Encryption Features active: AMD SEV | SWIOTLB forced (512 MB) |
Verbatim kernel evidence common to all four shapes:
- the boot command line carries
intel_iommu=off; DMAR: IOMMU disabled;PCI-DMA: Using software bounce buffering for IO (SWIOTLB);/sys/kernel/iommu_groupsis empty, and noDMAR,IVRS, orIORTtable is present under/sys/firmware/acpi/tables/.
The Confidential (SEV) shape additionally logs software IO TLB: Memory encryption is active and system is using DMA bounce buffers, confirming that
bounce buffering is enforced by memory encryption, not merely by configuration.
Classification. No probed GCE shape – neither the older virtio surface nor
the modern gVNIC/NVMe surface – exposes a guest-programmable IOMMU that capOS
could discover, program, and validate. By the
classification rules this rules out the direct-remapping
backend and selects the labeled bounce-buffer fallback for the cloud path on
these shapes. On the Confidential VM the bounce-buffer path is a hardware
invariant: the device cannot reach encrypted guest memory directly. This is a
fail-closed observation, not a hostile-hardware isolation claim; the binding
backend selection and any “supported shape” advertisement remain attended
sign-off work in cloud-dma-backend-selection.
Design implication for GCP storage/NIC drivers. A provider-side or
hypervisor-side IOMMU may still protect Google infrastructure, but that is not
guest-programmable remapping authority for capOS. On the probed GCE shapes a
capOS userspace storage or NIC provider must therefore be planned as a
no-IOMMU, brokered-bounce design: userspace receives buffer capabilities,
grant IDs, or typed commands, while the kernel or device manager materializes
the device-visible queue-base, descriptor, PRP/SGL, or virtqueue address fields.
The direct-remapping lane remains valid for QEMU run-iommu-remapping and for
future cloud/hardware shapes that expose a guest-programmable remapping unit;
it is not a GCP premise today. The generic design consequences are recorded in
DMA User-Space Driver Isolation.
Runtime Probe Protocol
A later credentialed cloud-run task captures the following from the guest, with the region/zone, image, kernel, and retrieval date recorded for each command. Capture the verbatim command output as evidence; do not summarize it.
lspci -nnk -D– PCI topology with full domain:bus:device.function, vendor/ device IDs, and bound kernel driver per function (NIC, storage controller, accelerator identity).ls /sys/kernel/iommu_groups(and per-groupdevices/) – whether the guest sees IOMMU groups at all, and how devices are grouped.- ACPI table presence: DMAR (Intel VT-d), IVRS (AMD-Vi), IORT (Arm SMMU)
under
/sys/firmware/acpi/tables/. Absence is itself evidence. - Kernel log IOMMU/SWIOTLB lines (
dmesg | grep -iE 'iommu|dmar|ivrs|iort|swiotlb') – whether the kernel enabled an IOMMU, fell back to software bounce (SWIOTLB), or found no remapping unit. - Network driver identity:
ethtool -i <iface>and the bound driver (ena,mana/mlx5_core,gve,virtio_net). - Block transport identity:
lsblk -o NAME,TRAN,MODELand controller driver (nvme,virtio_blk,virtio_scsi). - NVMe inventory:
nvme listandnvme id-ctrl <dev>for controller identity where NVMe is present.
A probe result is only usable evidence if capOS could perform the equivalent discovery from its own ACPI/PCI enumeration; the Linux commands above stand in for that discovery during the research phase.
Classification Rules
These rules are deliberately fail-closed and feed the
runtime backend inferred by capOS and support-policy status columns.
- SR-IOV, a virtual NIC (ENA, gVNIC, MANA, virtio-net), a GPU, an accelerator, or local NVMe identifies a DMA-capable or DMA-adjacent surface. This is the presence of a device that does or could bus-master; it is not a safety claim.
- A direct-remapping classification requires guest-programmable remapping
authority that capOS can discover, program, and validate – a usable Intel
VT-d, AMD-Vi, or Arm SMMU unit the guest controls, with translation, fault,
and invalidation behavior matching
docs/research/iommu-remapping.md. A DMA-capable surface alone never implies this. - Provider-side isolation facts (host-enforced VPC isolation, Nitro/host data- path bypass, hypervisor-side IOMMU) are support-policy assumptions, not proof that capOS can safely use direct DMA from inside the guest.
- Ambiguous, contradictory, or unvalidated observations select
Unsupported. This matches the assurance model: unknown or contradictory observations selectUnsupported, not an optimistic default.
These map onto the three backend candidates in the assurance model
(docs/proposals/dma-assurance-model-proposal.md): a direct remapping domain, a
labeled bounce-buffer fallback (direct_dma=blocked, all device-visible memory
manager-owned, no host physical address exposed, hostile-hardware isolation not
claimed), or Unsupported.
Relationship to Backend Selection
cloud-dma-backend-selection consumes this inventory: it maps each backend
candidate to the assurance-model invariants, fills the evidence matrix per cloud
VM shape, and drafts the downstream-contract scaffolding (which device-manager
policy fields a driver declares – direct_dma, trusted_domain,
bounce_buffer – and which stale-handle/stale-completion/teardown/
no-host-physical-exposure gates each candidate must satisfy). That task already
declares this inventory as a dependency. The binding backend selection and any
per-shape safety assertion remain attended-sign-off work and are not made here.
Relevant Research and Grounding
docs/research/iommu-remapping.md– primary-source Intel VT-d/AMD-Vi/QEMU remapping grounding the direct-DMA classification depends on.docs/proposals/dma-assurance-model-proposal.md– the model objects, invariants, and backend-candidate matrix this evidence feeds.docs/dma-isolation-design.md– the manager-owned DMA isolation contract and bounce-buffer fallback the labeled-fallback candidate must satisfy.docs/proposals/cloud-deployment-proposal.md– the cloud deployment context for the usable-instance milestone.docs/tasks/cloud-dma-backend-selection.md– the backend decision that consumes this inventory.
Research: Future Scheduler Architecture
This note records the prior art checked for future capOS scheduling work after the first SMP and per-thread ring milestones exposed that scheduler structure, not only timer programming, will decide whether capOS scales.
Local Grounding
Existing capOS documents already cover part of the answer:
- Scheduling: scheduler architecture including
Phase D WFQ policy, Phase E
SchedulingContext, Phase FCpuIsolationLeaseand CPL0 idle thread, LAPIC timer/IPI foundation, global run queue, per-thread rings, timer sleep waiters, park waiters, direct IPC handoff, and SMP state. - SMP Phase C: active multi-CPU execution and in-process thread-scaling proof work.
- SMP: accepted SMP direction.
- Ring v2 For Full SMP: per-thread completion ownership and SQPOLL preconditions.
- Tickless and Realtime Scheduling: tickless idle, SQPOLL nohz, deadlines, scheduling contexts, donation, and realtime islands.
- Stateful Task and Job Graphs: durable work graphs, assignment metadata, graph-run state, and the stop line that graph coordinators must not become authority-holding god objects.
- NO_HZ, SQPOLL, and Realtime Scheduling: Linux
NO_HZ, clocksource/clockevent, CPU isolation, SQPOLL,
SCHED_DEADLINE, PREEMPT_RT, and seL4 MCS grounding. - Out-of-kernel scheduling: kernel mechanism versus user-space policy split.
- Completion rings and threaded runtimes:
completion ownership,
io_uring, futex, and IOCP lessons. - Multimedia pipeline latency and Robotics realtime control: admitted realtime-island use cases.
External Sources Checked
- Linux kernel documentation, CFS Scheduler.
- Linux kernel documentation, EEVDF Scheduler.
- Linux kernel documentation, Deadline Task Scheduling.
- Linux kernel documentation, Extensible Scheduler Class.
- Linux kernel documentation, NO_HZ: Reducing Scheduling-Clock Ticks.
- Linux kernel documentation, CPU Isolation.
- Linux kernel documentation, Housekeeping.
- Linux kernel documentation, PREEMPT_RT theory of operation.
- FreeBSD manual, ULE scheduler.
- seL4 documentation, MCS Extensions tutorial.
- Anderson, Bershad, Lazowska, and Levy, Scheduler Activations.
- Ghosh et al., ghOSt: Fast & Flexible User-Space Delegation of Linux Scheduling.
- Ousterhout et al., Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads.
- Fried et al., Caladan: Mitigating Interference at Microsecond Timescales.
- Kaffes et al., Shinjuku: Preemptive Scheduling for microsecond-scale Tail Latency.
- Qin et al., Arachne: Core-Aware Thread Management.
Findings
Fair General-Purpose Scheduling
Linux CFS established the now-common model that ordinary tasks should be ordered by virtual runtime, not by a fixed time-slice list. Linux EEVDF keeps the fair-scheduler lineage but chooses the eligible task with the earliest virtual deadline, using request size and lag to improve latency and fairness.
The capOS consequence is not “import Linux CFS.” It is:
- ordinary best-effort work should use virtual-time accounting (Phase D WFQ is now the active policy; the earlier FIFO round-robin was the bootstrap);
- latency-sensitive best-effort work should have bounded, policy-visible request sizes or weights rather than hidden scheduler magic;
- per-CPU run queues are a prerequisite before any EEVDF-like policy matters at SMP scale.
EEVDF is the strongest candidate for the next capOS best-effort policy evolution after WFQ. It should follow WFQ rather than replace it immediately because it depends on accurate runtime charging, per-CPU runnable ownership, and migration accounting that are not yet in place.
Per-CPU Run Queues and Topology
Linux and FreeBSD both make per-CPU scheduler state the normal SMP unit. FreeBSD ULE additionally exposes CPU topology and affinity as first-class placement concerns. This matches the current capOS scaling evidence: one global scheduler lock and one global run queue make every CPU contend on the same state even after per-thread rings remove the process-wide CQ bottleneck.
The near-term capOS scheduling architecture should split:
- per-CPU current thread and run queue ownership;
- cross-CPU wakeup and migration paths;
- shared process/thread metadata protected by narrower locks;
- placement policy from dispatch mechanism;
- diagnostic counters for lock hold/spin time, migration, steals, and IPIs.
Realtime and Temporal Isolation
Linux SCHED_DEADLINE uses EDF plus Constant Bandwidth Server-style budget,
deadline, and period parameters. Its key lesson for capOS is admission:
deadline scheduling without bandwidth control is only a priority policy, not
a guarantee.
seL4 MCS is the more capability-native precedent. CPU time is represented by scheduling-context objects. Passive servers can run on caller-donated CPU time, avoiding priority inversion across synchronous IPC. This maps directly to capOS endpoint services and direct IPC handoff.
The capOS split should remain:
SQE.deadline_ns: request freshness and propagation metadata;SchedulingContext: spendable CPU-time authority;- donation: temporary transfer of CPU budget/deadline along a synchronous capability path;
RealtimeIsland: admitted bundle of scheduling contexts, memory/device reservations, communication paths, and overrun policy.
Tickless, Isolation, and Housekeeping
Linux NO_HZ and CPU isolation reinforce that tick suppression is not one feature. Idle tickless is a timer cleanup. Full-nohz is an isolation contract that also needs housekeeping CPUs, accounting, timer migration, deferred work placement, and revocation latency policy.
For capOS, this grounding shaped the implementation order: automatic nohz
activation for the narrow single-runnable-entity window and SQPOLL-driven
auto-nohz for ring-coupled leases are now implemented (Phase F), both tied
to the CpuIsolationLease with housekeeping, deferred-work placement,
clockevent deadline substrate, one-SQ-consumer ownership, and fail-closed
rollback prerequisites satisfied first. Generic full-nohz for explicitly
budgeted compute leases, timeout-based auto-revoke, and SQPOLL nohz for
explicitly leased caller-thread rings have since landed. Remaining future work
includes:
- full-nohz tied to policy-service issuance and durable monitoring telemetry;
- SQPOLL nohz beyond the current caller-thread ring-coupled lease shape;
- realtime island nohz after admission proves unrelated work, IRQs, deferred frees, and timers are excluded or bounded.
Pluggable and User-Space Policy
Linux sched_ext and ghOSt show that fast scheduler experimentation is useful, but they also preserve privileged dispatch and enforcement. sched_ext runs BPF inside the kernel scheduler framework with fallback; ghOSt delegates policy to user-space agents while retaining kernel mechanisms for safety and preemption.
For capOS, the safe architecture is:
- keep dispatch, budget enforcement, interrupt handling, idle, and fallback in the kernel;
- expose policy knobs through capabilities;
- let a privileged scheduler-policy service own admission, budget selection, CPU partitioning, isolation leases, and tuning;
- call the policy service on configuration changes, depletion/timeout faults, and coarse placement events, not on every context switch.
Dynamic policy loading is a later experiment. It should not become the first way to make basic SMP scheduling scale.
Datacenter Runtime Schedulers
Shenango, Caladan, Shinjuku, and Arachne target microsecond-scale service latency by managing cores, preempting long request handlers, and separating fast user-level scheduling from coarser kernel control. They are useful because capOS will host services, agent runtimes, and network stacks that want low tail latency.
The shared lessons are:
- core grants are different from CPU-time budgets;
- user-level worker schedulers need kernel-visible blocking and preemption boundaries;
- tail-latency policies need request-level telemetry, not only thread-level CPU shares;
- cross-core coordination must be cheap enough that it does not dominate the service latency it tries to reduce.
For capOS this argues for scheduler hints and policy capabilities above the kernel mechanism, not a datacenter-specific kernel scheduler as the default.
Stateful Work Graphs
The stateful task/job graph proposal is related at the workload layer. A graph node can carry assignment metadata such as priority, deadline, budget, queue, and lease, and a domain coordinator can decide which node attempt is runnable inside that graph. That is not the same authority as kernel CPU scheduling.
The scheduler consequence is a clean boundary:
- graph/node priority is domain policy until translated by an authorized scheduler policy service;
- graph budgets reference resource profiles or scheduling contexts, but do not mint CPU time by themselves;
- graph deadlines may create request deadlines or admission inputs, but do not bypass scheduler admission;
- build, init, agent, and operator graph coordinators should lease work and consume scheduler primitives rather than owning a global CPU run queue;
- scheduler telemetry should be attachable to graph runs as artifacts, so a failed or slow job can explain whether it waited on authority, CPU budget, dependency state, I/O, or policy.
Recommended capOS Direction
- (Done: Phase D/E/F) Finish the current thread-scale evidence before
larger policy changes. Phase D WFQ, Phase E
SchedulingContext, and Phase FCpuIsolationLease/ auto-nohz / SQPOLL-coupled nohz have landed. - Split scheduler state into per-CPU runnable ownership and bounded cross-CPU wake/migration. (Per-CPU queues remain future work; Phase F.5.)
- Add precise CPU accounting and scheduler attribution before changing the default policy. (Attribution guardrails landed in Phase A; full per-CPU accounting is Phase F.5 follow-on.)
- Move ordinary best-effort work toward an EEVDF-like virtual-deadline policy after accounting and per-CPU queues exist. (WFQ is current; EEVDF is a follow-on evaluation deferred until per-CPU queues exist.)
- Keep
SCHED_DEADLINE/EDF-CBS and seL4 MCS as the precedent for admitted realtime work, but express CPU authority as capOSSchedulingContextcapabilities. (SchedulingContextis implemented;RealtimeIslandadmission is Phase G future work.) - Keep user-space scheduler policy coarse-grained and capability-authorized; do not consult a user process on every timer interrupt or dispatch.
- Treat SQPOLL, busy polling, and full-nohz as CPU-isolation leases with housekeeping and revocation constraints. (Ring-coupled SQPOLL nohz and generic full-nohz for explicitly budgeted compute leases are implemented; policy-service issuance remains future work.)
- Keep runtime schedulers above per-thread rings, futex/park/notification primitives, timers, and explicit thread objects.
The resulting target is a layered scheduler:
- Kernel dispatch/enforcement: per-CPU queues, context switch, idle, accounting, budget enforcement, timeout faults, direct IPC donation, and cross-CPU wake/migration.
- Kernel policy primitives: weights, virtual deadlines, scheduling contexts, CPU masks, isolation leases, and realtime-island admission hooks.
- Userspace policy: profiles, admission, budget selection, service/runtime hints, placement, diagnostics, and policy reload.
- Userspace runtimes: work stealing, actor queues, async reactors, service request schedulers, and language-specific M:N scheduling.
Open Questions
The questions below have been answered by Phase D/E/F implementation; they are kept for record and context:
- Answered (Phase D): WFQ is the first virtual-time policy; EEVDF is deferred until per-CPU queues and runtime charging exist.
- Answered (Phase D): The thread-scale milestone did not require a per-CPU queue split; WFQ on a global queue with per-thread weight sufficed.
- Answered (Phase E): The initial
SchedulingContextABI usesSchedulingContextSpec(weight, latency class, budget, period, overrun policy) andSchedulingContextInfofor info-only read access;SchedulingContext.info()is method id 0 for stability. Donation/return through endpoints is implemented; realtime island admission is future work. - Answered (Phase F):
CpuIsolationLeaserevocation interacts with session logout and process exit through lease-generation staling, which is the load-bearing rollback trigger; service replacement and process exit cleanup go through the same generation-staling path.
Remaining open questions:
- What is the minimum per-CPU queue split that closes the full-SMP scalability milestone (Phase F.5) without prematurely designing the full fair scheduler?
- How should policy-service issuance select and renew generic full-nohz and SQPOLL nohz leases beyond the current explicit local proofs?
- Which scheduler telemetry belongs in the always-on kernel and which belongs
behind the benchmark-only
measurefeature? - What is the right
RealtimeIslandadmission shape for admitted scheduling contexts, memory/device reservations, and overrun policy (Phase G)?
Research: NO_HZ, SQPOLL, and Realtime Scheduling
This note records the external grounding for capOS tickless idle, SQPOLL-oriented full-nohz CPU isolation, and future realtime scheduling contexts. It was written from the 2026-04-29 shared design discussion and checked against primary Linux/seL4 documentation.
Local Grounding
Relevant local docs:
- Scheduling: current LAPIC tick, bounded timeout waiters, timer-side ring polling, AP scheduler-owner proof, CPL0 idle-thread paths, and Phase F nohz/SQPOLL activation state machine.
- SMP: LAPIC/IPI foundation and deferred per-CPU run queue/concurrent scheduler ownership work.
- Ring v2 For Full SMP: per-thread rings and the rule that SQPOLL must have exactly one SQ consumer.
- Out-of-kernel scheduling: scheduling contexts, user-space policy, and kernel budget enforcement split.
- Multimedia pipeline latency: admitted realtime island model for media graphs.
- Robotics realtime control: scheduling-context authority, control-loop admission, and passive-server donation lessons.
- x2APIC and APIC virtualization: x2APIC as a later backend, not a prerequisite for the current xAPIC LAPIC timer path.
External Sources Checked
- Linux kernel documentation, NO_HZ: Reducing Scheduling-Clock Ticks.
- Linux kernel documentation, Clock sources, Clock events, sched_clock() and delay timers.
- Linux kernel documentation, High resolution timers and dynamic ticks design notes.
- Linux kernel documentation, hrtimers - subsystem for high-resolution kernel timers.
- Linux kernel documentation, CPU Isolation.
- Linux kernel documentation, Housekeeping.
- Linux man-pages project, io_uring_setup(2).
- Linux kernel documentation, Deadline Task Scheduling.
- Linux kernel documentation, PREEMPT_RT theory of operation.
- seL4 documentation, MCS Extensions tutorial.
NO_HZ Findings
Linux separates three timer policies:
- periodic scheduler ticks;
- tick suppression only while a CPU is idle (
NO_HZ_IDLE); - adaptive tick suppression for CPUs with one runnable task (
NO_HZ_FULL).
The first capOS target should match the conservative shape of NO_HZ_IDLE,
not Linux NO_HZ_FULL. The Linux docs explicitly call idle tick suppression
common/default-useful, while NO_HZ_FULL is specialized for realtime and HPC
loads and requires at least one non-adaptive CPU for timekeeping. That maps to
capOS because the current scheduler tick still performs too much work:
timeout expiry, waiter wakeup, run-queue rotation, timer-side ring dispatch,
and transitional network polling.
Linux also records a cost: dyntick-idle adds instructions on idle entry/exit
and may require expensive clockevent reprogramming. capOS should therefore
add counters before changing behavior and should retain a runtime
ForcedPeriodic fallback.
Timekeeping Findings
Linux’s timer stack distinguishes:
- clock sources: monotonic timeline counters;
- clock events: hardware devices that interrupt at selected future times;
- scheduler ticks: one user of clock events, not the timebase itself.
This split is the important design point for capOS. Current TICK_COUNT style
timekeeping is adequate for periodic scheduling but becomes the wrong owner
once the scheduler can stop the tick. capOS should introduce a monotonic
now_ns clocksource layer before enabling tickless idle.
Linux hrtimers provide two lessons without requiring capOS to clone the whole subsystem:
- waiters should be stored by absolute expiry time, not by periodic tick count;
- time-ordered expiry structures simplify deadline-based wakeup and avoid scanning every timer on every tick.
capOS already bounds waiter counts, so the first implementation can use a
small ordered array, BTreeMap, or heap. The security property is bounded,
non-allocating interrupt-path expiry, not a specific data structure.
CPU Isolation and Housekeeping Findings
Linux CPU isolation treats housekeeping as first-class work: unbound timers, workqueues, maintenance, statistics, deferred cleanup, watchdog work, and remote scheduler ticks must move away from isolated CPUs or be explicitly disabled. Linux also requires at least one housekeeping CPU.
For capOS this means full-nohz must not be modeled as a timer flag. It is a CPU ownership contract:
isolated CPU = no unrelated runnable work + no unbound kernel work + explicit
wake/deadline events only
The same rule applies whether the isolated entity is a kernel SQPOLL worker,
a userspace poller, or a future admitted realtime loop. CpuIsolationLease
names the owner, allowed CPU set, allowed mode, accounting target, and
revocation policy. It performs real per-CPU periodic-tick suppression for the
narrow single-runnable-entity window (Phase F closed), and a ring-coupled
kernelSqpoll lease suppresses ticks while its bound ring is in SQPOLL
running/sleeping mode with a live owner (SQPOLL-driven auto-nohz closed).
Without a CpuIsolationLease, a latency-sensitive hint must not grant exclusive
CPU access. Generic full-nohz for explicitly budgeted compute threads, a
generic SQPOLL nohz state machine for explicitly leased caller-thread rings,
and timeout-based auto-revoke have since landed. Broader
userspace-poller/device-queue issuance remains future work.
io_uring SQPOLL Findings
Linux IORING_SETUP_SQPOLL creates a kernel thread that polls the submission
queue. While it remains active, applications can publish SQEs and observe CQEs
without entering the kernel on each submission. When the poller sleeps after
its idle period, it sets IORING_SQ_NEED_WAKEUP; userspace must call
io_uring_enter(..., IORING_ENTER_SQ_WAKEUP) or let liburing do that wake.
The capOS consequence is not “copy io_uring”. It is an ownership rule:
SQPOLL ring: kernel worker owns SQ head; userspace owns SQ tail and CQ head;
cap_enter does not become a second SQ consumer.
This requires Ring v2 or an equivalent per-thread ring endpoint. The current process-wide ring and timer-side ring polling are incompatible with safe SQPOLL because they cannot prevent two consumers from draining the same SQ.
SQPOLL full-nohz required: per-thread rings; a ring mode bit and quiescent mode
transitions; per-CPU scheduler ownership and reschedule IPIs; a housekeeping
CPU; removal or explicit placement of scheduler-tick-polled networking. Those
prerequisites are now closed (Phase F one-SQ-consumer, bounded SQPOLL ring
mode, housekeeping/deferred-work placement, per-CPU idle thread). SQPOLL-driven
nohz activation is implemented for explicitly leased caller-thread
kernelSqpoll rings, including producer wake, bounded service progress,
rollback, and stale-owner rejection. Broad userspace-poller/device-queue policy
issuance remains future work.
Realtime Findings
Linux SCHED_DEADLINE uses runtime, deadline, and period parameters and
depends on admission/bandwidth management. Its documentation is explicit that
without admission control, no scheduling guarantee follows. That directly
separates per-request deadline metadata from CPU budget authority.
PREEMPT_RT’s main lesson is that realtime latency is destroyed by long non-preemptible sections, unbounded interrupt handling, and priority inversion. Linux addresses this by making most kernel execution schedulable, using priority-inheritance-aware locks, and threading interrupts. capOS does not need to clone PREEMPT_RT, but any realtime path must keep IRQ top halves short, avoid blocking locks in admitted hot paths, and provide donation or inheritance for capability service calls.
seL4 MCS provides the strongest capability-OS precedent. Scheduling contexts are kernel objects representing CPU-time authority; they carry budget and period, are configured through per-CPU scheduling-control authority, and are enforced with a sporadic-server model. Passive servers can run on a caller’s donated scheduling context and return it on reply.
For capOS:
SQE.deadline_nsis request freshness metadata.SchedulingContextis CPU-time authority.RealtimeIslandis the admission object for a whole graph/loop.- Scheduling-context donation is how timing survives synchronous capability calls through passive services.
- SQPOLL and AutoNoHz are executor/isolation backends, not the realtime authority itself.
capOS Design Consequences
- Implement tickless idle before full-nohz.
- Split clocksource from clockevent before stopping periodic ticks.
- Convert timeout waiters to absolute monotonic deadlines before one-shot scheduling.
Replace user-mode idle with kernel/per-CPU idle before real tickless idle.Done: the scheduler idle path is a CPL0 per-CPU kernel idle thread; the user-mode idle process is removed.- Keep periodic preemption while there is runnable contention.
- Keep networking in
ForcedPeriodicor move it to explicit IRQ/deadline polling before enabling tickless on network-active CPUs. Network-polling placement is landed as a fail-closed admission gate; placement routing for arbitrary network-active CPUs remains future work. - Treat full-nohz as a CPU lease and housekeeping design, not a standalone
timer optimization.
CpuIsolationLeaseis now implemented, generic full-nohz is landed for explicitly budgeted compute leases, and policy-service issuance remains future work. Add SQPOLL only after per-thread rings and per-CPU scheduler ownership.Done: one-SQ-consumer ring ownership, bounded SQPOLL ring mode, and SQPOLL-driven auto-nohz activation are all closed.- Require one SQ consumer per ring mode. Done: enforced by the Phase F one-SQ-consumer ring ownership gate.
- Use
SQE.deadline_nsonly for freshness/drop/propagation policy; put budget, period, priority, CPU mask, and overrun policy inSchedulingContext. - Use realtime islands for media/robotics/control graphs; reject hard realtime claims until kernel path, IRQ, device, and WCET evidence exist.
Research: Time and Clock Authority in Operating Systems
This note records verified external grounding for capOS’s time and clock
authority design. It covers Linux clock IDs and privilege model, time
namespaces, NTP/chrony discipline, PTP/IEEE-1588, Fuchsia’s UTC clock object,
and leap-second handling. Findings feed directly into the WallClock /
ClockDiscipline / ClockProvenance design in
Time and Clock.
1. Linux: Clock IDs and the Read/Discipline Split
Clock IDs
Linux exposes multiple clock IDs through clock_gettime(2):
CLOCK_REALTIME— settable system-wide wall clock. Measures seconds since the Unix epoch. Can jump forward or backward when disciplined bysettimeofdayor NTP. RequiresCAP_SYS_TIMEto set.CLOCK_MONOTONIC— non-settable system-wide monotonic clock. Counts from an unspecified boot-adjacent point. Cannot jump; unaffected by NTP steps; responds to frequency adjustments only. Does not include suspend time.CLOCK_BOOTTIME— identical toCLOCK_MONOTONICbut includes suspended time. Non-settable. Useful for suspend-aware timers withoutCLOCK_REALTIMEjump exposure.CLOCK_TAI— non-settable clock based on wall time but counting leap seconds (TAI = International Atomic Time). UnlikeCLOCK_REALTIME, it has no discontinuity on leap second insertion.
The CAP_SYS_TIME Privilege
CAP_SYS_TIME gates all operations that modify the kernel clock:
settimeofday(2), stime(2), adjtimex(2)/clock_adjtime(2) when
modes != 0, and setting the hardware RTC. Reading the clock — including a
read-only adjtimex call with modes = 0 — requires no privilege. The
clock_adjtime(2) variant (added in Linux 2.6.39) accepts an additional
clk_id argument so callers can target a specific clock rather than only the
system-wide realtime clock.
Concretely: any process can call clock_gettime(CLOCK_REALTIME, &ts) without
privilege; only a privileged NTP daemon calls adjtimex() or
clock_settime(CLOCK_REALTIME, &ts).
Lesson for capOS
This is the direct prior art for splitting WallClock (read-only cap, granted
to ordinary processes) from ClockDiscipline (stronger cap, held only by the
designated sync service). The Linux CAP_SYS_TIME flag is a coarse ambient
privilege bit; capOS encodes the same split as two distinct capability types,
with no ambient privilege required and no escalation path between them.
2. Linux Time Namespaces
What Is Namespaced
Linux time namespaces (added in Linux 5.6) let processes inside a namespace
observe different values for CLOCK_MONOTONIC and CLOCK_BOOTTIME than the
host. The per-namespace offsets are written to /proc/pid/timens_offsets before
any process enters the namespace; once the first process has entered, writes
return EACCES. The format is:
<clock-id> <offset-secs> <offset-nanosecs>
CLOCK_REALTIME is deliberately not namespaced: the kernel documentation
cites “reasons of complexity and overhead” — in practice, CLOCK_REALTIME is
already settable and the step/slew machinery is not per-namespace.
The offsets are pure integers (seconds + nanoseconds); there is no per-namespace frequency correction or NTP discipline within the namespace. This feature is primarily used for container checkpoint/restore (CRIU) where the monotonic clock must appear consistent before and after migration.
Lesson for capOS
Time is not an ambient global fact — it can be a per-context offset applied to
a shared monotonic base. capOS’s WallClock cap fits this shape directly: the
cap object holds the offset from the kernel monotonic timeline to the wall epoch,
and different processes can hold caps with different offsets (timezone,
test-clock injection, container clock virtualization). Freezing offsets at
namespace creation maps to the capOS invariant that WallClock cannot be
retroactively shifted by the holder — only ClockDiscipline can adjust the
shared reference.
3. NTP Discipline: chrony and ntpd
Step vs. Slew
NTP daemons correct clock drift using two mechanisms:
- Slew (gradual): adjust the clock frequency to converge slowly. Linux
adjtime(3)/adjtimex(ADJ_OFFSET)implements slew. Default rate is bounded to 500 ppm; corrections over 0.5 seconds are clamped. This preserves monotonicity. - Step (abrupt): directly set the clock to the reference value. Breaks timestamp ordering for any process comparing consecutive readings across the step.
chrony makestep: makestep threshold limit allows stepping if the offset
exceeds threshold seconds, but only within the first limit clock updates.
For example, makestep 1.0 3 steps for offsets over 1 second during the first
three updates, then slews only thereafter. A negative limit removes the
update-count restriction entirely. After an initial step, chrony reverts to pure
slew to protect running applications from abrupt clock changes.
Leap Second Handling (leapsecmode)
chrony supports four modes for the UTC leap second insertion:
system(default): the kernel steps the clock at the UTC boundary.step: chronyd performs the step rather than delegating to the kernel.slew: the leap second is absorbed by slewing (~12 seconds of correction at the default 500 ppm rate on Linux).ignore: no automatic correction; the offset is absorbed during normal tracking.
For servers distributing time to clients unaware of leap seconds, chrony
combines leapsecmode slew with smoothtime to smear the correction outward
over up to 17 hours 34 minutes (when limiting slew to 1000 ppm).
Sync State Exposure
chronyc tracking reports the reference source, stratum, system time offset,
frequency error, and RMS offset. chronyc sourcestats shows per-source
statistics. These are the client-visible trust/sync signals that a capOS
ClockProvenance would encode — the binary ntpSynced or ptpSynced flag plus
an error bound.
Lesson for capOS
ClockDiscipline.step() and ClockDiscipline.slew() as distinct cap methods
are justified by this split: an NTP daemon that calls step() at startup but
only slew() at steady state exposes its policy at the capability boundary.
Callers that need monotonic-safe time can check ClockProvenance to distinguish
a recently-stepped clock from a stably-slewed one.
4. PTP / IEEE-1588: Hardware Timestamping
What PTP Provides
IEEE 1588 Precision Time Protocol synchronizes clocks using timestamps captured by NIC hardware at the Media Independent Interface (MII) boundary, typically within 100 ns of frame ingress/egress. This eliminates software scheduling jitter that limits NTP to millisecond accuracy. With hardware support, PTP achieves sub-microsecond accuracy.
Linux implements PTP through ptp4l (PTP daemon managing the protocol state
machine) and phc2sys (synchronizing the hardware PTP clock to the system
clock). ptp4l can configure a system as an Ordinary Clock (single port) or
Boundary Clock (multi-port).
Use Cases vs. NTP
NTP is adequate for general server synchronization (sub-10 ms, typically 1–10 ms LAN, sub-ms with GPS). PTP is used where sub-microsecond accuracy is required: industrial automation, 5G RAN timing, financial trading, and audio/video bridging (AVB/TSN). The distinction is hardware timestamping support in the NIC and a local Grandmaster or GNSS-disciplined boundary clock.
Lesson for capOS
Provenance is not binary (synced vs. unsynced). The ptpSynced vs ntpSynced
distinction in ClockProvenance is justified: a process requiring microsecond
timestamps for audio-visual synchronization or hardware scheduling needs to
distinguish PTP discipline from NTP discipline. A cap validator checking
ClockProvenance before accepting a timestamp for a hard real-time claim should
require ptpSynced and an error bound below the application’s tolerance.
5. Fuchsia / Zircon: UTC Clock Objects
Clock as a Kernel Object
Fuchsia models UTC time as a first-class kernel object (zx_clock_t), not as
a syscall or global variable. A clock is a one-dimensional affine
transformation of the monotonic reference timeline, maintained atomically and
observed through typed operations.
Rights Model
Zircon clock handles carry typed rights:
ZX_RIGHT_READ: permitszx_clock_read()(read current time) andzx_clock_get_details()(read transformation parameters and error bound).ZX_RIGHT_WRITE: permitszx_clock_update()— adjusting the clock’s absolute value, frequency (in ppm), and error bound (in nanoseconds).
Any process holding ZX_RIGHT_WRITE acts as a clock maintainer. There is no
separate “maintain” right; the write right IS the maintain authority.
Monotonic option: clocks created with ZX_CLOCK_OPT_MONOTONIC reject any
zx_clock_update() that would cause the clock to go backward.
Continuous option: clocks created with ZX_CLOCK_OPT_CONTINUOUS allow
setting the absolute value only on the first update; subsequent absolute-value
changes are rejected, allowing only frequency adjustments.
UTC Maintainer Service
All components started by Fuchsia’s Component Manager receive a UTC clock handle
with read-only rights. Only the Timekeeper service receives the write handle.
Timekeeper synchronizes against an RTC or a network time source and calls
zx_clock_update() to discipline the UTC clock.
The UTC clock has a “backstop” guarantee: it never reports a time earlier than the timestamp of the latest build commit (the backstop value). Before Timekeeper first synchronizes, the clock may be in a fixed state (stopped at backstop) or running-but-unsynchronized state. Fuchsia documents that the UTC clock “is neither monotonic nor continuous” — Timekeeper may step it backward when corrections are needed. Callers needing a reliable timestamp must query the clock details to determine whether the clock has been synchronized.
Lesson for capOS
This is the closest capability-native precedent for capOS’s design. The mapping:
| Fuchsia/Zircon | capOS |
|---|---|
Clock kernel object with ZX_RIGHT_READ handle | WallClock capability (read-only) |
Clock handle with ZX_RIGHT_WRITE held by Timekeeper | ClockDiscipline capability (init-granted) |
zx_clock_get_details() error bound and sync signal | ClockProvenance label on WallClock |
| Backstop guarantee (never before build timestamp) | Provenance downgrades on suspend/resume or loss of sync |
ZX_CLOCK_OPT_MONOTONIC flag | The invariant that Timer.now() monotonic base is never adjusted |
The Fuchsia UTC design confirms that the right model is: one strong-authority maintainer, many read-only observers, with a typed signal for trust state. capOS extends this by making provenance an explicit labeled field on the cap rather than a query-on-demand operation.
6. Leap Seconds and Clock Steps: Smearing vs. Stepping
The Problem
UTC inserts or deletes leap seconds at irregular intervals, decided by the
International Earth Rotation and Reference Systems Service (IERS). Inserting a
leap second means UTC has a second labeled 23:59:60 before rolling to midnight,
creating a discontinuity in POSIX time (which counts seconds without leap
seconds). Deleting a leap second would mean skipping a second.
For software:
- Stepping:
CLOCK_REALTIMEjumps by ±1 second at the UTC boundary. Any application comparing twoCLOCK_REALTIMEreadings across the boundary sees a negative elapsed time (on insert) or a missing second (on delete).CLOCK_MONOTONICmust not step; it continues forward through the leap second unaffected. - Slewing/Smearing: the correction is distributed over a window. No
discontinuity occurs, but
CLOCK_REALTIMEtemporarily deviates from true UTC during the smear window.
Industry Smear Practice
Google has applied a 24-hour linear smear (noon-to-noon UTC) since 2008: each second in the smear window is ~11.6 µs longer than an SI second. AWS’s Amazon Time Sync Service applies the same 24-hour noon-to-noon linear smear automatically. Both services suppress the leap second indicator on their NTP responses so clients do not attempt their own step.
The smear approach means that any client synchronized to Google Public NTP or Amazon Time Sync is not tracking true UTC during the smear window — it tracks “smeared UTC”, which is coordinated but not the same as civil UTC. This is a design choice accepting brief inaccuracy for availability of monotonic-safe time.
CLOCK_MONOTONIC Must Not Jump
CLOCK_MONOTONIC is specifically designed to be immune to steps. Linux
documents it as “nonsettable” — no process can set it; only frequency
adjustments are permitted. The rationale: timers, timeouts, and scheduling
deadlines depend on monotonic ordering. Any step in the monotonic timeline would
silently break all in-flight waiters.
Lesson for capOS
The monotonic timeline (Timer.now()) must be the invariant substrate.
WallClock is a separate, disciplinable offset layered on top. A
ClockDiscipline.step() call adjusts the wall-clock offset without touching the
monotonic base — ensuring in-flight ring timeouts and scheduler deadlines are
never invalidated. The ClockProvenance.lastStep timestamp lets an auditor see
when the wall clock was last stepped, so validators can reject timestamps taken
during or shortly after a step if their use case requires continuity.
Applicability to capOS
Read vs. Discipline Authority
Every system surveyed maintains a hard split between reading time (no privilege required, granted to all processes) and adjusting time (strong authority, held by one designated service):
- Linux:
clock_gettime(unprivileged) vsadjtimex/CAP_SYS_TIME(privileged) - Fuchsia:
ZX_RIGHT_READhandle (distributed to all components) vsZX_RIGHT_WRITEhandle (held only by Timekeeper) - chrony/ntpd: any client queries sync state; only the daemon calls
adjtimex
capOS should encode this as: WallClock (read-only cap, grantable and
attenuable) and ClockDiscipline (separate stronger cap, init-granted at boot,
not transferable through normal cap-grant paths).
Clock Provenance as a Typed Signal
Fuchsia’s per-clock error bound and sync signal, and chrony’s tracking
command, both expose metadata about trust state alongside the time value
itself. capOS’s ClockProvenance label on WallClock captures this: a
validator that needs trustworthy time checks provenance rather than relying on
the presence of the cap alone.
The ptpSynced / ntpSynced distinction maps directly to the PTP vs NTP
accuracy gap: hardware timestamping is a stronger claim than software NTP, and
an OS-level audit trail needs to encode which applies.
Wall Clock as a Granted, Attenuable Cap
Linux time namespaces demonstrate that clock offsets can be virtualized
per-context rather than being a single global ambient fact. capOS takes this
further: WallClock is a capability object, not a process-wide environment
variable. A test harness can inject a fake WallClock; a container process can
receive a WallClock with a different UTC offset (timezone) without any global
state change; a WASI host adapter can supply a per-instance WallClock to each
wasm module without sharing a mutable global.
Step vs. Slew as Distinct Cap Methods
chrony’s makestep and leapsecmode options distinguish step (abrupt
correction) from slew (rate adjustment). capOS should expose these as distinct
ClockDiscipline methods so the discipline policy is explicit at the capability
boundary — a sync service can be audited for whether it steps or only slews,
and the ClockProvenance.lastStep field makes a step visible to downstream
validators.
Monotonic Invariant Is Non-Negotiable
Every surveyed system — Linux CLOCK_MONOTONIC, Fuchsia ZX_CLOCK_OPT_MONOTONIC,
chrony slew-only mode — treats monotonic ordering as inviolable. Any step in
the monotonic timeline breaks in-flight timers, scheduling deadlines, and ring
timeouts. capOS’s Timer.now() monotonic base must never be adjusted; only the
wall-clock offset layered above it is disciplinable.
Audit Timestamps and Trusted Time
Audit log entries in capOS will carry timestamps. The ClockProvenance label on
the WallClock used to generate those timestamps becomes the evidence of
timestamp trustworthiness: an audit consumer can reject entries generated while
provenance was unsynchronized or stepped (within a recency window after a
step), rather than silently accepting timestamps of unknown reliability.
WASI Realtime Clock Mapping
WASI Preview 1 clock_time_get(CLOCKID_REALTIME) maps naturally to
WallClock.wallTime(). A per-instance WASI WallClock cap — granted at module
instantiation — means a wasm module receives the same read-only, provenance-labeled
time view that native capOS services receive, with no special privilege and no
ambient global.
Sources
- clock_gettime(2) — Linux manual page
- capabilities(7) — Linux manual page
- adjtimex / clock_adjtime(2) — Ubuntu manpage
- time_namespaces(7) — Linux manual page
- Clock — Fuchsia reference (kernel objects)
- UTC behavior — Fuchsia
- chrony.conf(5) — chrony 4.3 manual
- Leap Second Smearing — Google for Developers
- Look Before You Leap — AWS blog on leap second and smearing
- Configuring PTP Using ptp4l — Red Hat RHEL 7 System Administrator’s Guide
- IEEE 1588 Precision Time Protocol — NTP.org reference
Research: HPC Parallel Patterns
This note grounds the capOS proposal for generic parallel processing pattern coverage. It is not a request to port full HPC suites immediately. The point is to classify which algorithm shapes capOS benchmarks should eventually cover so future SMP, threading, runtime, storage, network, and multi-node claims do not rest only on embarrassingly parallel worker loops.
Source Set
- The Berkeley “View” report argues that parallel programming systems need multiple styles of parallelism, and uses the dwarf taxonomy to describe important computational kernels rather than one benchmark score: https://people.eecs.berkeley.edu/~krste/papers/BerkeleyView.pdf.
- NASA’s NAS Parallel Benchmarks cover EP, IS, CG, MG, FT, BT, SP, LU, UA, DC, and DT across MPI, OpenMP, serial, and hybrid variants: https://www.nas.nasa.gov/software/npb.html.
- TOP500 describes LINPACK/HPL as dense linear-system evidence and warns that one number cannot describe overall system performance: https://www.top500.org/project/linpack/.
- Netlib’s HPL page identifies HPL as a distributed-memory double-precision dense linear-system implementation: https://netlib.sandia.gov/benchmark/hpl/.
- HPCG complements HPL by exercising sparse matrix-vector multiplication, vector updates, global dot products, Gauss-Seidel smoothing, triangular solve, and multigrid-preconditioned conjugate gradient: https://www.hpcg-benchmark.org/.
- Graph500 covers graph construction, breadth-first search, and shortest-path kernels, with shared-memory, distributed-memory, and external-memory/cloud variants: https://graph500.org/?page_id=12.
- MPI 4.1 collective communication names the standard multi-rank movement and computation patterns: barrier, broadcast, gather, scatter, allgather, all-to-all, allreduce/reduce, reduce-scatter, and scan: https://www.mpi-forum.org/docs/mpi-4.1/mpi41-report/node114.htm.
- OpenMP 5.2 covers node-local loop/task/SIMD/reduction mechanisms, including
tasklooppartitioning loop iterations into explicit tasks and reduction clauses for parallel recurrence calculations: https://www.openmp.org/spec-html/5.2/openmp.html, https://www.openmp.org/spec-html/5.2/openmpse74.html, and https://www.openmp.org/spec-html/5.2/openmpse27.html.
Consequences For capOS
The current capOS CPU-scaling benchmarks are necessary but narrow. They exercise static worker partitioning, final result verification, and a small amount of spawn/join or process-wait coordination. That covers one important HPC pattern: independent tasks with a final reduction. It does not cover:
- structured grids and stencil/halo exchange;
- dense tiled matrix work;
- sparse matrix and irregular memory access;
- FFT/transposes and global all-to-all style communication;
- graph frontier expansion and high-fanout irregular queues;
- task graphs with dependency scheduling and cancellation;
- collectives as first-class operations;
- multi-node communication and authority boundaries.
The benchmark plan should therefore treat “parallel processing” as a matrix of patterns rather than a single scaling demo. A useful capOS coverage target is:
| Pattern family | Source precedent | capOS evidence it should force |
|---|---|---|
| Static map/reduce | OpenMP loop/reduction, NAS EP | low-overhead thread/process creation, result aggregation, no hot-path syscalls |
| Dynamic task graph | OpenMP tasks, Berkeley composition point | work queues, cancellation, dependency fan-in/fan-out, scheduler fairness under uneven tasks |
| Stencil and halo exchange | NAS MG/BT/SP/LU | shared buffers, neighbor exchange, barriers, cache locality, future network transport |
| Dense tiled linear algebra | HPL/LINPACK | compute locality, tile scheduling, reductions, optional SIMD/library runtime support |
| Sparse iterative solver | HPCG, NAS CG | irregular memory access, sparse matrix-vector work, global dot-product reductions |
| FFT/transpose | NAS FT | all-to-all movement, temporary buffers, memory pressure, future multi-node transpose |
| Sort/partition | NAS IS | all-to-all buckets, prefix/scan, allocator and queue pressure |
| Graph frontier | Graph500 | irregular frontier queues, atomic-like visited updates, high fanout, load imbalance |
| Collective communication | MPI collectives | barrier, broadcast, scatter/gather, reduce/allreduce, all-to-all semantics |
| Pipeline/stream | Berkeley composition point, future service graphs | bounded queues, backpressure, stage-local authority, telemetry |
The near-term capOS subset should stay CPU-only and single-node until the selected in-process threading milestone is closed. The first expansion should add pattern kernels that reuse existing userspace/runtime mechanisms, then let future networking and storage milestones add multi-node and data-intensive variants.
Cap’n Proto Error Handling: Research Notes
Research on how Cap’n Proto handles errors at the protocol, schema, and Rust crate levels. Used as input for the capOS error handling proposal.
1. Protocol-Level Exception Model (rpc.capnp)
The Cap’n Proto RPC protocol defines an Exception struct used in three
positions: Message.abort, Return.exception, and Resolve.exception.
struct Exception {
reason @0 :Text;
type @3 :Type;
enum Type {
failed @0; # deterministic bug/invalid input; retrying won't help
overloaded @1; # temporary lack of resources; retry with backoff
disconnected @2; # connection to necessary capability was lost
unimplemented @3; # server doesn't implement the method
}
obsoleteIsCallersFault @1 :Bool;
obsoleteDurability @2 :UInt16;
trace @4 :Text; # stack trace from the remote server
}
The four exception types describe client response strategy, not error semantics:
| Type | Client response |
|---|---|
failed | Log and propagate. Don’t retry. |
overloaded | Retry with exponential backoff. |
disconnected | Re-establish connection, retry. |
unimplemented | Fall back to alternative methods. |
2. Rust capnp Crate (v0.25.x)
Core error types
#![allow(unused)]
fn main() {
pub type Result<T> = ::core::result::Result<T, Error>;
#[derive(Debug, Clone)]
pub struct Error {
pub kind: ErrorKind,
pub extra: String, // human-readable description (requires `alloc`)
}
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
#[non_exhaustive]
pub enum ErrorKind {
// Four RPC-mapped kinds (match Exception.Type)
Failed,
Overloaded,
Disconnected,
Unimplemented,
// Wire format validation errors (~40 more variants)
BufferNotLargeEnough,
EmptyBuffer,
MessageContainsOutOfBoundsPointer,
MessageIsTooDeeplyNested,
ReadLimitExceeded,
TextContainsNonUtf8Data(core::str::Utf8Error),
// ... etc
}
}
Constructor functions: Error::failed(s), Error::overloaded(s),
Error::disconnected(s), Error::unimplemented(s).
The NotInSchema(u16) type handles unknown enum values or union
discriminants.
std::io::Error mapping
When std feature is enabled, From<std::io::Error> maps:
TimedOut->OverloadedBrokenPipe/ConnectionRefused/ConnectionReset/ConnectionAborted/NotConnected->DisconnectedUnexpectedEof->PrematureEndOfFile- Everything else ->
Failed
3. capnp-rpc Rust Crate Error Mapping
Bidirectional conversion between wire Exception and capnp::Error:
Sending (Error -> Exception):
#![allow(unused)]
fn main() {
fn from_error(error: &Error, mut builder: exception::Builder) {
let typ = match error.kind {
ErrorKind::Failed => exception::Type::Failed,
ErrorKind::Overloaded => exception::Type::Overloaded,
ErrorKind::Disconnected => exception::Type::Disconnected,
ErrorKind::Unimplemented => exception::Type::Unimplemented,
_ => exception::Type::Failed, // all validation errors -> Failed
};
builder.set_type(typ);
builder.set_reason(&error.extra);
}
}
Receiving (Exception -> Error):
Maps exception::Type back to ErrorKind, preserving the reason string.
Server traits return Promise<(), capnp::Error>. Client gets
Promise<Response<Results>, capnp::Error>.
4. Cap’n Proto Error Handling Philosophy
From KJ library documentation and Kenton Varda:
“KJ exceptions are meant to express unrecoverable problems or logistical problems orthogonal to the API semantics; they are NOT intended to be used as part of your API semantics.”
“In the Cap’n Proto world, ‘checked exceptions’ (where an interface explicitly defines the exceptions it throws) do NOT make sense.”
Exceptions: infrastructure failures (network down, bug, overload). Application errors: should be modeled in the schema return types.
5. Schema Design Patterns for Application Errors
Generic Result pattern
struct Error {
code @0 :UInt16;
message @1 :Text;
}
struct Result(Ok) {
union {
ok @0 :Ok;
err @1 :Error;
}
}
interface MyService {
doThing @0 (input :Text) -> (result :Result(Text));
}
Constraint: generic type parameters bind only to pointer types (Text,
Data, structs, lists, interfaces), not primitives (UInt32, Bool). So
Result(UInt64) doesn’t work – need a wrapper struct.
Per-method result unions
interface FileSystem {
open @0 (path :Text) -> (result :OpenResult);
}
struct OpenResult {
union {
file @0 :File;
notFound @1 :Void;
permissionDenied @2 :Void;
error @3 :Text;
}
}
Unions must be embedded in structs (no free-standing unions). This allows adding new fields later without breaking compatibility.
6. How Other Cap’n Proto Systems Handle Errors
Sandstorm
Uses the exception mechanism for infrastructure errors. Capabilities report
errors through disconnection. The grain.capnp schema does not define
explicit error types. util.capnp documents errors as “It will throw an
exception if any error occurs.”
Cloudflare Workers (workerd)
Uses Cap’n Proto for internal RPC. JavaScript Error.message and
Error.name are preserved across RPC; stack traces and custom properties
are stripped. Does not model errors in capnp schema – relies on exception
propagation.
OCapN (Open Capability Network)
Adopted the same four-kind exception model for cross-system compatibility. Diagnostic information is non-normative. Security concern: exception objects may leak sensitive information (stack traces, paths) at CapTP boundaries.
Kenton Varda expressed reservations about unimplemented (ambiguity about
whether the direct method or callees failed) and disconnected (requires
catching at specific stack frames for meaningful retry).
7. Relevance to capOS
capOS uses the capnp crate but not capnp-rpc. Manual dispatch goes through
CapObject::call() with caller-provided params/result buffers. Current error
handling:
capnp::Error::failed()for semantic errorscapnp::Error::unimplemented()for unknown methods?for deserialization errors (naturally producecapnp::Error)- Transport errors become negative CQE result codes (
CAP_ERR_INVALID_REQUEST,CAP_ERR_INVALID_PARAMS_BUFFER,CAP_ERR_INVALID_RESULT_BUFFER,CAP_ERR_INVOKE_FAILED,CAP_ERR_UNSUPPORTED_OPCODE,CAP_ERR_TRANSFER_NOT_SUPPORTED,CAP_ERR_TRANSFER_ABORTED, etc.). - Kernel-produced
CapExceptionvalues are serialized into result buffers for capability-level failures (CAP_ERR_APPLICATION_EXCEPTION) and decoded bycapos-rt. If the result buffer is too small to hold the serializedCapException, the CQE result isCAP_ERR_APPLICATION_EXCEPTION_TRUNCATEDinstead. The per-processringScratchLimitBytesmanifest field bounds the kernel-side scratch allocation and makes this truncated path reachable for tightly constrained process profiles.
capOS extends the standard four-kind ExceptionType with a fifth variant,
invalidArgument, for capability-level argument validation failures. This
fifth kind has no capnp-rpc equivalent; it maps to Failed when converting
back to capnp::ErrorKind for logging.
The normative schema-author rule now lives in
Error Handling: CQE status is for
ring/transport/kernel dispatch failure, CapException is for
capability-level infrastructure failure, and schema result unions are for normal
application/domain outcomes.
The capnp::Error type carries the information needed for CapException:
kind maps to ExceptionType, and extra maps to message.
Sources
- Cap’n Proto RPC Protocol: https://capnproto.org/rpc.html
- Cap’n Proto C++ RPC: https://capnproto.org/cxxrpc.html
- Cap’n Proto Schema Language: https://capnproto.org/language.html
- Cap’n Proto FAQ: https://capnproto.org/faq.html
- KJ exception.h: https://github.com/capnproto/capnproto/blob/master/c%2B%2B/src/kj/exception.h
- rpc.capnp schema: https://github.com/capnproto/capnproto/blob/master/c%2B%2B/src/capnp/rpc.capnp
- OCapN error handling discussion: https://github.com/ocapn/ocapn/issues/10
- Cap’n Proto usage patterns: https://github.com/capnproto/capnproto/discussions/1849
- capnp-rpc Rust crate: https://crates.io/crates/capnp-rpc
- Cloudflare Workers RPC errors: https://developers.cloudflare.com/workers/runtime-apis/rpc/error-handling/
- Sandstorm util.capnp: https://docs.rs/crate/sandstorm/0.0.5/source/schema/util.capnp
Research: Cloudflare, Cap’n Proto, Workers RPC, and Cap’n Web
This file summarizes Kenton Varda’s Cloudflare work and Cloudflare’s Cap’n Proto-derived RPC stack, with capOS design consequences.
capOS alignment note (2026-05-16): capOS currently uses capnp v0.25 for serialization only (wire format, no capnp-rpc). The capOS kernel is planned to become a capnp-rpc router (Design Principle 5), but capnp-rpc is not yet in use. Implications in this file that reference “remote capability proofs” or “typed Cap’n Proto RPC” describe planned/future work, not current state.
Executive Summary
Cloudflare is the most important modern production context for Cap’n Proto. Kenton Varda, the creator of Cap’n Proto, is the lead engineer for Cloudflare Workers, and Cloudflare’s Workers team is now the primary maintainer of the main C++ Cap’n Proto/KJ implementation. Cloudflare uses Cap’n Proto/KJ in the Workers runtime, Durable Objects, sandbox/supervisor and cross-machine communication, internal service bindings, and Workers RPC. Cap’n Web is a separate JavaScript-native sibling protocol inspired by Cap’n Proto rather than a Cap’n Proto/KJ-based runtime system.
The main capOS lessons are:
- Typed Cap’n Proto RPC is a practical production bridge for systems written in Go/Rust/C++ and JavaScript, not merely historical prior art.
- Object-capability RPC can be a normal developer-facing API, not just a kernel/protocol mechanism. Workers RPC and Cap’n Web both expose object references, functions, promise pipelining, and capability-style security.
- Production systems distinguish the core runtime from the full security
product.
workerdis open source and capability-shaped, but Cloudflare warns that it is not by itself a complete secure sandbox. - Cap’n Proto RPC remains resource-exhaustion-sensitive. capOS must add its own quota/resource-ledger discipline at every remote-capability boundary.
- Cap’n Web shows a separate web-facing branch of the same design family: schema-free, JSON-based, TypeScript-friendly, HTTP/WebSocket/postMessage transports, but still object-capability RPC with promise pipelining.
Source Map
Primary sources read:
- Kenton Varda’s Cloudflare author archive, used as the inventory of his posts.
- Cap’n Proto FAQ and Cap’n Proto 0.9 release notes.
- Cloudflare blog posts:
- “Durable Objects in Dynamic Workers: Give each AI-generated app its own database”
- “Sandboxing AI agents, 100x faster”
- “Code Mode: the better way to use MCP”
- “Introducing workerd: the Open Source Workers runtime”
- “We’ve added JavaScript-native RPC to Cloudflare Workers”
- “Why Workers environment variables contain live objects”
- “Building Cloudflare on Cloudflare”
- “Cap’n Web: a new RPC system for browsers and web servers”
- “Eliminating Cold Starts 2: shard and conquer”
- “Zero-latency SQLite storage in every Durable Object”
- “Durable Objects: Easy, Fast, Correct – Choose three”
- “Dynamic Process Isolation: Research by Cloudflare and TU Graz”
- “Mitigating Spectre and Other Security Threats: The Cloudflare Workers Security Model”
- “Introducing lua-capnproto: better serialization in Lua”
- Cloudflare developer docs:
- Workers RPC
- Workers RPC visibility/security model
- Durable Objects overview
- Dynamic Workers bindings
Cloudflare and Kenton Varda
The Cap’n Proto FAQ says Cloudflare Workers is led by Kenton Varda and that Workers heavily uses Cap’n Proto. It also says the Cloudflare Workers team is now the primary developer and maintainer of Cap’n Proto’s main C++ implementation.
The Cap’n Proto 0.9 release notes state that Cap’n Proto development had become primarily driven by Cloudflare Workers. At that point Workers had already moved from mostly using KJ to heavily using Cap’n Proto RPC for Durable Objects.
For capOS, Cloudflare is strong evidence that:
- Cap’n Proto RPC is still a living system, not only Sandstorm-era history.
- KJ’s async/runtime design matters because it is deployed in Workers.
- Cap’n Proto’s object-capability RPC model is compatible with large-scale production infrastructure, but only with additional platform hardening and resource controls.
The full author archive also includes posts that are not Cap’n Proto-specific but are relevant to capOS architecture:
- live object bindings as capability-shaped environment entries
- Workers security and Spectre mitigation
- Dynamic Process Isolation research with TU Graz
- Durable Objects as single-threaded colocated compute/storage actors
- Dynamic Workers for fast disposable isolate sandboxes
- Code Mode for having agents write code against typed APIs rather than emit direct tool calls
workerd
workerd is Cloudflare’s open-source JavaScript/Wasm runtime, sharing most of
the code that powers Cloudflare Workers. It is designed as a server runtime for
Workers-compatible applications, local testing, and programmable proxy use
cases.
Cloudflare explicitly warns that workerd alone is not a secure sandbox for
untrusted code. The Workers service adds environment-specific hardening,
including V8 patch automation, risk-profile separation, kernel features, and
resource-limit enforcement. The project is also not an independent governance
surface: Cloudflare Workers priorities drive the repository, and internal
interfaces may churn.
capOS implications:
- Borrow the shape, not the whole product. A capOS userspace JavaScript/Wasm
runtime can learn from
workerd, but must not assumeworkerdalone provides OS-grade isolation. - Treat runtime internals as unstable unless pinned. If capOS embeds or adapts
workerd, the trusted-build-input and upgrade policy must account for churn. - Keep the capOS kernel/resource model as the isolation and quota authority; runtime-level object capabilities are an additional layer.
Live Object Bindings
Cloudflare Workers environment variables are not only strings. Bindings are
live objects scoped to a specific Worker’s env parameter. A Worker cannot
reach a protected service by guessing a URL or global name; it must hold the
binding object. The post explicitly compares this to capability-based security:
the binding designates a resource and confers permission to access it, is not
in a global namespace, and must be invoked explicitly.
The post also notes that current Workers bindings are not a complete capability system because ordinary bindings are not generally passed dynamically between Workers yet, though future dynamic bindings are discussed.
capOS implications:
- This strongly supports capOS’s bootstrap CapSet and broker-issued bundle model: authority should arrive as live objects/caps, not ambient service names plus bearer tokens.
- It also supports treating the environment/capset as an explicit function parameter rather than global state. This preserves composition and testability.
- It reinforces the policy that remote protocol fields, URLs, and names should not become authority by themselves.
Durable Objects
Durable Objects are Cloudflare’s single-threaded stateful actor-like compute units colocated with durable storage. Public posts describe them as an approach where code runs where the data is stored, often in the same thread as embedded SQLite, avoiding network storage latency. Earlier Durable Objects posts focus on race avoidance and correctness: single-threaded object execution makes it natural to keep state in memory while serializing operations that touch the same object.
Dynamic Worker Facets extend this idea to Dynamic Workers: generated code can run in a disposable isolate while using a per-app Durable Object facet with its own isolated SQLite-backed storage.
capOS implications:
- Durable Objects are strong prior art for capOS service objects that combine single-threaded state, colocated storage, and actor-style request handling.
- For paper-scoped persistence, a minimal service-owned store/object proof is closer to this model than to a global filesystem.
- For hosted agents, per-task or per-app isolated storage facets are a useful pattern, but the storage capability must remain broker-issued and revocable.
Workers Security, Spectre, and Dynamic Process Isolation
Kenton’s Workers security posts separate API-level capability design from execution isolation. Workers uses V8 isolates for density, but wraps them with cordons, process isolation for higher-risk cases, Linux sandboxing, supervisor processes, local proxy mediation, V8 patch discipline, and timing/Spectre mitigations. Dynamic Process Isolation research with TU Graz addresses the harder Spectre isolation problem when many tenants share isolate infrastructure.
The “Sandboxing AI agents, 100x faster” post reuses this isolate foundation for Dynamic Workers: fast disposable isolates for AI-generated code, rather than heavyweight containers. The post emphasizes speed and density, but the security claim depends on the broader Workers platform, not a bare runtime library.
capOS implications:
- Capability-shaped APIs are not a substitute for execution isolation. capOS should continue treating page tables/processes, resource ledgers, and future sandboxing as separate security layers.
- If capOS runs AI-generated code, isolate-style fast startup is attractive, but the capOS trust boundary must include side-channel and resource controls.
- For browser/Wasm proposals, Workers is evidence that isolates can scale, but also that Spectre/timing mitigations are first-order design constraints.
Cloudflare’s Cap’n Proto Uses
Cloudflare’s public sources describe several Cap’n Proto uses:
- Workers runtime implementation: Cap’n Proto and KJ are core implementation pieces.
- Sandbox/supervisor and cross-machine/datacenter communication.
- Durable Objects: Cap’n Proto RPC is heavily used for communication in the system.
- Internal Workers: Cloudflare added Cap’n Proto RPC bindings so internal Workers can call services such as Quicksilver, DNS, and DoS-protection systems. Schemas are bundled with the Worker at publication time, and the runtime converts JavaScript data to/from Cap’n Proto.
- Worker sharding/cold-start reduction: cross-instance communication in the Workers runtime uses Cap’n Proto RPC, including capabilities to lazily loaded local Worker instances.
- Older Cloudflare infrastructure: Cloudflare wrote
lua-capnprotoand used Cap’n Proto in logging/analytics pipelines before Kenton joined.
capOS implications:
- A typed Cap’n Proto RPC bridge is a credible first remote-capability proof: Cloudflare uses schema-bundled service calls from JavaScript into internal Go/Rust services.
- Lazy capabilities are useful for cold-start and placement problems. A remote cap may represent a lazily created service, but invocation must still be explicit and resource-accounted.
- The capOS “capability proxy” should be framed as a service with explicit listen/connect authority, schema selection, and resource budgets, not a generic kernel network mode.
Workers JavaScript-Native RPC
Cloudflare Workers RPC lets Workers and Durable Objects communicate by calling methods on JavaScript classes exposed through bindings. It is built on Cap’n Proto but removes schema boilerplate from the JavaScript developer surface.
Key properties:
- Calls are asynchronous regardless of whether the server method was declared async.
- Parameters and return values can include structured-clonable data.
- Functions and objects can be passed by reference; the receiver gets a stub and later calls back to the original location.
- Calls to service bindings often stay in the same thread, reducing local RPC overhead dramatically.
- When calls cross the network, promise pipelining lets dependent calls on a returned object travel in one round trip.
- Security is object-capability based: a side can only invoke objects/functions for which it has received a stub.
capOS implications:
- It is reasonable for capOS to expose developer-friendly language bindings above typed capability transport. The kernel ABI should stay narrow, but userspace runtime APIs can feel like method calls on local objects.
- Promise pipelining is not optional polish for object-style APIs over latency. Cloudflare documents it as the mechanism that prevents API designs from collapsing into coarse ad hoc batch methods.
- A local fast path matters. RPC calls that stay within one scheduler/runtime context should avoid unnecessary network-shaped overhead.
Code Mode and Agents
Kenton’s Code Mode post argues that agents should often write code against a typed API rather than emit raw tool calls directly. The Cloudflare claim is that MCP is useful as an API discovery/connection layer, but complex workflows are better expressed as code that calls a TypeScript API. This reduces token flow through the model when chaining operations and lets normal language tooling carry structure.
capOS implications:
- This supports the capOS hosted-agent direction: present capability-scoped tools as typed APIs and let agents compose them in code under a sandbox, instead of exposing broad stringly tool surfaces directly to the model.
- Approval gates should wrap the capability/API boundary, not be hidden inside prompt text.
- Promise pipelining and object references may reduce tool-call latency, but only after authority and review gates are preserved.
Cap’n Web
Cap’n Web is a 2025 Cloudflare RPC protocol and TypeScript implementation by Kenton Varda and Steve Faulkner. It is explicitly described as a spiritual sibling to Cap’n Proto for the web stack.
Design differences from Cap’n Proto:
- no Cap’n Proto schemas
- JSON plus preprocessing for special values instead of Cap’n Proto binary encoding
- TypeScript-friendly APIs
- HTTP, WebSocket, and
postMessage()transports - small dependency-free browser/server package
Shared design lineage:
- object-capability RPC
- bidirectional calls
- functions and objects passed by reference
- promise pipelining
- capability-based security patterns
- import/export tables for pass-by-reference objects
Cap’n Web also introduces a web-specific .map()-style pipelining feature
that records a restricted non-Turing-complete instruction set derived from
pipelined calls, addressing a GraphQL-like “waterfall” case.
capOS implications:
- Cap’n Web is useful prior art for browser-hosted capOS experiments or web admin clients, not for the kernel ABI.
- Schema-free RPC trades away capOS’s current “schema is permission surface” discipline. It may fit JavaScript/web adapters, but core capOS services should remain typed and schema-governed unless a proposal explicitly accepts the runtime-validation burden.
- HTTP batch mode and broken references after batch completion are useful patterns for paper-scoped network-transparency proofs: short-lived remote caps can have explicit lifetime boundaries.
Security and Resource Warnings
Important warnings from primary sources:
- Cap’n Proto’s serialization layer is intended to be safe against malicious bytes, but the reference implementation has not had a formal security review.
- Cap’n Proto RPC is designed for mutually distrusting parties, but the FAQ warns that it is not robust against resource exhaustion attacks.
- Cap’n Proto does not provide encryption by itself; use an encrypted transport such as TLS.
workerdis not a complete sandbox for malicious code without Cloudflare’s surrounding platform hardening.- Cap’n Web/Workers TypeScript surfaces do not automatically enforce runtime type checks merely because TypeScript types exist.
capOS implications:
- Every remote-capability proposal must include resource ledgers for table entries, queued calls, queued bytes, streams, retries, and live objects.
- The first capOS remote-capability proof should validate failure behavior: disconnect, overload, broken refs, stale refs, and malformed payloads.
- Treat TypeScript or schema-free web adapters as convenience layers that require runtime validation at the trust boundary.
- Encryption/authentication is a transport requirement, not something Cap’n Proto RPC gives for free.
Design Consequences for capOS
- The first external capability proxy should be typed and schema-bundled, closer to Cloudflare’s internal Worker-to-service Cap’n Proto RPC bindings than to full OCapN/CapTP compatibility.
- Developer ergonomics can improve above the transport: object stubs, language-native async calls, and promise pipelining are legitimate runtime APIs.
- Keep the kernel/user ABI and core service contracts schema-first. Cap’n Web is compelling for web-facing clients, but its schema-free design does not replace capOS’s typed authority model.
- Promise pipelining should be designed as a core performance and authority feature, not as an optional batching trick.
- Remote cap lifetimes need explicit scopes. HTTP batch-style broken refs, session-scoped refs, and disconnect-driven broken promises are all useful precedents.
- Resource exhaustion must be solved by capOS, not delegated to Cap’n Proto.
- Runtime isolation remains an OS responsibility. A language runtime can be capability-oriented while still needing kernel/VM/sandbox containment.
Sources
- Cap’n Proto FAQ: https://capnproto.org/faq.html
- Cap’n Proto 0.9 release notes: https://capnproto.org/news/2021-08-14-capnproto-0.9.html
- Kenton Varda author archive: https://blog.cloudflare.com/author/kenton-varda/
- Durable Objects in Dynamic Workers: https://blog.cloudflare.com/durable-object-facets-dynamic-workers/
- Sandboxing AI agents, 100x faster: https://blog.cloudflare.com/dynamic-workers/
- Code Mode: the better way to use MCP: https://blog.cloudflare.com/code-mode/
- Introducing workerd: the Open Source Workers runtime: https://blog.cloudflare.com/workerd-open-source-workers-runtime/
- We’ve added JavaScript-native RPC to Cloudflare Workers: https://blog.cloudflare.com/javascript-native-rpc/
- Why Workers environment variables contain live objects: https://blog.cloudflare.com/workers-environment-live-object-bindings/
- Building Cloudflare on Cloudflare: https://blog.cloudflare.com/building-cloudflare-on-cloudflare/
- Cap’n Web: a new RPC system for browsers and web servers: https://blog.cloudflare.com/capnweb-javascript-rpc-library/
- Eliminating Cold Starts 2: shard and conquer: https://blog.cloudflare.com/eliminating-cold-starts-2-shard-and-conquer/
- Zero-latency SQLite storage in every Durable Object: https://blog.cloudflare.com/sqlite-in-durable-objects/
- Durable Objects: Easy, Fast, Correct – Choose three: https://blog.cloudflare.com/durable-objects-easy-fast-correct-choose-three/
- Dynamic Process Isolation: Research by Cloudflare and TU Graz: https://blog.cloudflare.com/spectre-research-with-tu-graz/
- Mitigating Spectre and Other Security Threats: The Cloudflare Workers Security Model: https://blog.cloudflare.com/mitigating-spectre-and-other-security-threats-the-cloudflare-workers-security-model/
- Introducing lua-capnproto: better serialization in Lua: https://blog.cloudflare.com/introducing-lua-capnproto-better-serialization-in-lua/
- Workers RPC docs: https://developers.cloudflare.com/workers/runtime-apis/rpc/
- Workers RPC visibility and security model: https://developers.cloudflare.com/workers/runtime-apis/rpc/visibility/
- Durable Objects overview: https://developers.cloudflare.com/durable-objects/concepts/what-are-durable-objects/
- Dynamic Workers bindings: https://developers.cloudflare.com/dynamic-workers/usage/bindings/
Research: Spritely, OCapN, and CapTP
Research note last checked 2026-05-16. This file records the related specifications, protocols, and design principles behind Spritely’s OCapN/CapTP work and translates them into capOS design consequences. It intentionally summarizes the specifications rather than copying them; the upstream documents are draft standards and should remain the source of truth.
Executive Summary
Spritely’s most relevant contribution for capOS is not a single library. It is a coherent model for secure distributed object programming:
- Authority is an unforgeable object reference. If a peer was not handed a reference, it cannot use the object.
- Object references can cross a network without turning into global names or ACL checks. References remain local session table entries, generally cheap integer positions between the two peers that share a CapTP session.
- Networking is explicit at the implementation boundary but mostly absent from application object design. A program can pass references and send messages to asynchronous objects without inventing a bespoke protocol for every service.
- Latency is handled by promise pipelining. Dependent messages can be sent to the eventual result of an earlier message before the earlier message settles.
- Resource lifetime is part of the protocol. CapTP includes cooperative distributed garbage collection for exported references and answer promises.
- Third-party handoffs solve the hard case where A gives B a reference to an object hosted by C without making A a permanent proxy.
The design is close enough to capOS’s schema-as-ABI direction to matter: capOS already treats typed Cap’n Proto interfaces as authority boundaries and has reserved ring fields for future promise pipelining. OCapN/CapTP gives a prior-art shape for the next network-transparent capability layer, but the current OCapN documents are still drafts and should not be adopted as frozen wire compatibility commitments.
Source Map
Primary sources read:
- Spritely Institute:
- “What is CapTP, and what does it enable?”
- “Introducing OCapN, interoperable capabilities over the network”
- “The Heart of Spritely: Distributed Objects and Capability Security”
- Spritely Goblins 0.18.0 release notes
- Guile Goblins manual sections for OCapN and CapTP
- OCapN draft specifications:
CapTP Specification.mdModel.mdNetlayers.mdLocators.md- Syrup repository and draft specification material
- Related lineage and implementations:
- Cap’n Proto RPC protocol documentation
- Endo
@endo/ocapndocumentation - E / CapTP lineage as summarized by Spritely, OCapN, and Cap’n Proto docs
The OCapN draft repository HEAD observed during this pass was
18400d8508fb67467da6d659412ae19c27b0cd08. The Syrup repository HEAD observed
was 931fa528b8ddda976febba577fb09ee0726845d4.
Current Status
The old spritelyproject.org site is historical. Active project material is
now under the Spritely Institute site and files.spritely.institute.
OCapN is not yet a final standard. The draft specs explicitly warn that they
are likely to change significantly. Spritely’s 2026-04-21 Goblins 0.18.0
release is a useful data point: it changed OCapN protocol details, removed the
old op:deliver-only operation, renamed GC operations to plural batched forms,
and bumped the protocol version incompatibly with earlier Goblins releases.
For capOS this means:
- Use OCapN/CapTP as design grounding, not as a frozen ABI.
- Avoid promising wire-level OCapN compatibility until a concrete version is selected and a test-suite target exists.
- Keep capOS’s own ring and schema ABI evolution policy independent from OCapN draft churn.
Spritely System Model
Spritely Goblins is a distributed object programming environment. Its core objects are actors. Actors live in vats/actormaps, receive messages, and may evolve by returning replacement behavior rather than mutating global ambient state. The programming model is explicitly object-capability based: references are authority, and authority flows by ordinary reference passing.
Spritely adds several properties that are relevant to OS design:
- Transactional turns. Local synchronous object updates happen in turns that can roll back on failure. This keeps partial state updates from becoming visible after an exception.
- Asynchronous references and promises. The same programming model handles local and remote asynchronous objects.
- Persistence and sleeping actors. Goblins can persist actor state and, in 0.18.0, optionally evict actors from the hot cache while retaining live references that wake them on demand.
- Distributed debugging and time travel. Spritely treats deterministic turns and persistent state as debugging tools, not only durability features.
capOS should not copy Goblins’ language runtime shape into the kernel. The usable lesson is the boundary: keep kernel capability objects small and typed, while allowing userspace runtimes to build richer object, promise, rollback, and persistence semantics above them.
Object-Capability Principles
The Spritely/OCapN material uses classic object-capability principles:
- No ambient authority. Code begins without dangerous authority and gains power only through values it is passed.
- Designation is authorization. The reference both names the object and grants the right to invoke it.
- Attenuation by wrapping. A narrower object can hold a broader object and expose only a smaller method surface or policy-filtered behavior.
- Revocation by indirection. A revoker can sit between holder and target and later stop forwarding.
- Accountability by explicit relationship. Authority flow is visible as graph edges between objects, not hidden inside a global namespace.
- Mutual suspicion. A remote peer is not trusted just because the transport is authenticated; it is treated as a potentially adversarial object holding only the capabilities it has received.
This matches capOS’s existing direction: typed interfaces define permission surfaces, and narrower capabilities are preferable to broad rights bitmasks attached to generic handles.
OCapN Protocol Suite
OCapN is a suite, not just CapTP. The important layers are:
| Layer | Role | capOS relevance |
|---|---|---|
| OCapN Model | Abstract passable value model shared across languages. | Defines which values can cross a capability-network boundary. |
| Syrup | Canonical binary serialization used by current OCapN drafts. | Useful for signed certificates and dynamic interop, but not a replacement for capOS’s Cap’n Proto schema ABI. |
| Locators | Peer and sturdyref identity syntax. | Prior art for durable object references and bootstrap URIs. |
| Netlayers | Transport abstraction for secure ordered channels. | Strong precedent for separating object protocol from TCP/TLS/Tor/libp2p/etc. |
| CapTP | Session protocol for messages, promises, GC, and handoffs. | Directly informs future network-transparent capability invocation. |
| Test suite | Interoperability tests for implementations. | capOS should not claim OCapN compatibility without passing a selected suite/version. |
OCapN Data Model
The model draft defines passable values as atoms, containers, references, and errors.
Atoms:
UndefinedNullBoolean- arbitrary precision signed
Integer - IEEE 754
Float64 - Unicode
String ByteArraySymbol
Containers:
List- unordered string-keyed
Struct Tagged, a tag string plus one value
References:
Target, an object reference that can receive messagesPromise, a pending eventual value that can queue messages
Errors are still unsettled in the model draft. The draft preserves only the
coarse requirement that an error round trip as an error. This is weaker than
capOS’s desired error-layer split, so capOS should keep its local rule:
transport status in CQEs, capability infrastructure failure in CapException,
and domain outcomes in schema result unions.
The model also defines pass invariants. The important one for capOS is that remote passage should preserve type, and for most values preserve a specified equality relation when values leave and later return. Promises and errors are special: promises preserve type but not identity equality, and error semantics are deliberately not settled yet.
Syrup
Syrup is a canonical binary serialization format used by OCapN drafts. It is inspired by canonical s-expressions and bencode. It supports booleans, integers, floats, byte strings, strings, symbols, lists, dictionaries/structs, records, and sets. Its important property for CapTP is canonicalization: unordered collections are emitted in a deterministic order, so serialized bytes can be signed and verified consistently.
This matters most for OCapN handoff certificates. A signed envelope signs the canonical serialized form of a CapTP object, so implementations need byte-stable encoding.
capOS implications:
- Keep using Cap’n Proto for typed capOS ABIs and kernel/userspace messages.
- Treat Syrup as an interop codec for a future OCapN bridge, not as the native kernel ring format.
- If capOS implements OCapN handoffs, canonical serialization becomes part of the trusted boundary. Fuzzing and cross-implementation test vectors would be mandatory.
Locators and Sturdyrefs
OCapN locators represent peers and durable object entry points.
A peer locator contains:
transport: the netlayer namedesignator: usually a key or other netlayer-defined identityhints: optional routing data
Only transport and designator identify the peer for comparison. Hints can
help connection setup but do not define identity.
A sturdyref locator contains:
- a peer locator
- a
swiss-num, a secret-ish object token used to fetch a specific object from that peer’s bootstrap object
URI forms include:
ocapn://<designator>.<transport>
ocapn://<designator>.<transport>/s/<swiss-num>
The draft states that a sturdyref should be treated as a capability: the locator plus swiss number is enough to try to obtain the object reference.
capOS implications:
- A future durable capOS network reference must not be confused with a local
CapId. Local cap slots, generations, receiver selectors, session ids, and kernel object pointers are not portable authority. - If capOS adds sturdyrefs, they belong in a userspace naming/storage authority or broker, not in the kernel cap table.
- Hints must never become security identity. They are routing metadata only.
- Swiss-number strength and storage policy are security-critical; weak or enumerable swiss numbers would become bearer-token vulnerabilities.
Netlayers
OCapN netlayers are the transport interface underneath CapTP. A compliant netlayer provides:
- bidirectional message transmission
- delivery while the session remains active
- in-order receipt
- security against third-party message insertion
Encryption and reachability are desirable and often necessary, but the netlayer draft distinguishes required session integrity from optional transport properties. The Tor Onion netlayer is documented in the draft; Spritely Goblins has historically emphasized Tor, while OCapN discussions also mention TCP/TLS, WebSocket, libp2p, IBC, I2P, Unix sockets, and other transports.
capOS implications:
- Follow the OCapN split: object protocol above transport authority.
- Represent listen/connect authority as explicit capabilities, as capOS already does for narrowed TCP listener authority.
- Bind peer identity to the netlayer’s authenticated designator, not to DNS names, host strings, or untrusted hints.
- Treat reconnect and disconnect as first-class protocol states. All remote capabilities served by a severed session must fail closed or become broken promises.
CapTP Session Establishment
A CapTP session is pairwise. It runs over a reliable ordered netlayer channel. The draft session setup:
- establish a secure channel out of band or as part of a handoff
- create a per-session cryptographic key pair
- exchange
op:start-session - verify the remote session start message
- export a bootstrap object at position
0
The bootstrap object conventionally supports:
fetch, to fetch an object by swiss numberdeposit-gift, for third-party handoffswithdraw-gift, for third-party handoffs
capOS implications:
- A remote session needs an explicit session object with state, cryptographic identity, import/export tables, answer table, handoff table, and disconnect state.
- Bootstrap authority should be narrow. A peer’s bootstrap object is the initial remote authority root and should expose only intended fetch/handoff behavior.
- A future capOS OCapN bridge should make protocol version negotiation and feature gating explicit because upstream OCapN has already changed incompatibly.
CapTP References and Descriptors
CapTP references are represented by descriptors whose integer positions have meaning only within a single session. The key descriptor families are:
desc:import-object: the receiver is importing an object at a positiondesc:import-promise: the receiver is importing a promise at a positiondesc:export: refer to an object/promise already exported by the receiving sidedesc:answer: refer to a promise created by a previous answer positiondesc:sig-envelope: signed wrapper over a canonical serialized CapTP objectdesc:handoff-give: gift certificate from gifter to receiverdesc:handoff-receive: receiver certificate used to redeem a gift
The subtle convention is perspective: descriptors describe references from the receiver’s side of the session. This keeps pairwise table entries small but requires careful implementation.
capOS implications:
- Do not serialize process-local cap ids across a network.
- Network references need a separate table keyed by session-local import/export position and generation or epoch.
- Descriptor direction needs tests. Perspective errors here become authority leaks or denial of service bugs.
CapTP Operations
The current CapTP draft includes operations in these groups:
- session lifecycle:
op:start-session,op:abort - delivery:
op:deliver - promise observation/resolution:
op:listen, promise resolverfulfillandbreakbehavior - promise pipelining and extraction:
op:get,op:index,op:untag - cooperative GC:
op:gc-exports,op:gc-answers - handoff bootstrapping through bootstrap methods:
deposit-gift,withdraw-gift
op:deliver-only should be treated as stale for current research because
Goblins 0.18.0 and the current draft dropped it in favor of op:deliver.
capOS implications:
- capOS’s reserved
pipeline_dep/ answer-id style fields should be evaluated against CapTP’s answer table model. op:get,op:index, andop:untagshow that pipelining is not only “call a method on a promised object”; it can also project a reference out of an eventual container without transmitting irrelevant intermediate values.- Batched GC operations are an important shape for avoiding per-reference chatter.
Promise Pipelining
Promise pipelining is the latency-critical idea shared by E, Cap’n Proto RPC, Agoric/Endo, and OCapN. If a call returns a promise for an object, the caller can immediately send follow-on messages to the promised result. The receiver queues or forwards those messages when the promise resolves.
This preserves object-shaped interfaces in high-latency networks. Without pipelining, developers tend to collapse clean object graphs into singleton services with path strings or ad hoc batching APIs, weakening both design and authority boundaries.
capOS implications:
- Promise pipelining is a Tier-1 paper evidence candidate in
docs/roadmap.md; this research reinforces that priority. - Pipelining should target result-cap/answer namespaces, not caller-selected global ids.
- Broken promises must propagate failure to dependent calls. Silent drops would violate caller expectations and leak resources.
- Pipelined calls must remain bounded by resource ledgers: answer table slots, queued message bytes, queued call count, and per-session memory all need caps.
Distributed Garbage Collection
CapTP uses cooperative distributed GC for references exported across a session. At a high level:
- When a reference is exported, the exporting side keeps it alive on behalf of the importing side.
- The importer tracks how many times it received the reference.
- When the importer no longer needs the reference, it sends batched GC deltas.
- The exporter decrements its per-session reference count and may reclaim once the count reaches zero.
- Answer promises also have explicit
op:gc-answerscleanup so answer positions can be reused.
The Goblins docs call this acyclic distributed GC. Cycles spanning machines are not automatically collected in the deployed Guile Goblins path.
capOS implications:
- Network reference release must be explicit and idempotent under disconnect and retry conditions.
- Reference accounting must have one ledger of record per session. Parallel counters in transport, object proxy, and app layers would be unreviewable.
- A capOS bridge should not rely on distributed cycle collection. Design protocols so remote cycles are either impossible, bounded by lease/session lifetime, or broken by explicit revocation.
- Disconnect should conservatively release exports owned solely by the session and break unresolved imports/promises.
Third-Party Handoffs
Third-party handoffs solve the case where A has a reference to an object hosted by C and sends that reference to B. A should not need to proxy every future call from B to C, and B should not gain arbitrary authority at C. The OCapN draft uses certificate-style gifts.
Roles:
- Gifter: the peer sharing a reference it holds
- Receiver: the peer receiving that reference
- Exporter: the peer hosting/exporting the referenced object
Protocol shape:
- The gifter deposits a gift with the exporter’s bootstrap object.
- The gifter sends the receiver a signed
desc:handoff-give. - The receiver validates what it can, connects to the exporter if needed, and
sends a signed
desc:handoff-receiveto withdraw the gift. - The exporter verifies signatures, session ids, receiver binding, and replay protection, then fulfills the receiver’s promise with the gifted reference.
Security properties to preserve:
- The gift is designated to a specific receiver session identity.
- The exporter must reject invalid signatures or replayed handoff counts.
- The handoff can complete whether deposit or withdrawal arrives first.
- Unauthorized peers that observe messages should not be able to redeem the object reference.
capOS implications:
- Handoffs are the correct precedent for cross-session capability transfer. Avoid proxy-only designs as the permanent architecture.
- A capOS implementation needs persistent in-flight handoff state with bounded memory and expiry.
- The replay counter/nonce table is security-sensitive. It should be scoped by exporter-receiver session and garbage collected with the session.
- Handoff certificates should be opaque to ordinary applications unless a debugging authority is explicitly granted.
Error Propagation
OCapN and CapTP allow promises to break with an error value, but the data model has not converged on a rich normative error structure. The CapTP draft warns that transmitting exception details or backtraces can leak sensitive data.
capOS implications:
- Keep the capOS error-layer split. OCapN errors should map into
CapExceptionor schema-level results only through a deliberate adapter. - Strip or seal debug details at network boundaries by default.
- Treat remote error text as untrusted input. It is diagnostic material, not an authority decision input.
Security Risks and Failure Modes
Important risks found in the source material:
- Spec churn. OCapN is draft/pre-standardization and has changed incompatibly.
- Resource exhaustion. Goblins docs state that CapTP does not solve memory usage or resource management by itself.
- Acyclic-only GC. Cycles between servers are not automatically reclaimed in current Goblins’ practical model.
- Peer-wide trust boundary. Even if CapTP routes to specific objects, a malicious remote peer can collude internally. Treat the peer as a single adversarial object with the authority surface of all references it holds.
- Signing-oracle bugs. Goblins 0.18.0 fixed a signing oracle vulnerability in a WebSocket netlayer designator-authentication path. This is a concrete reminder that handoff/netlayer signing APIs need strict domain separation.
- Debug info leakage. Broken promises or exceptions can accidentally expose paths, stack traces, or internal object topology.
- Replay and stale-reference bugs. Handoff counts, session ids, export positions, and answer positions require generation/reuse discipline.
capOS mitigations:
- Version every network protocol boundary.
- Bound every per-session table and queue with resource ledgers.
- Domain-separate all signatures by protocol label, session id, role, and operation kind.
- Fuzz canonical codec parsing and descriptor validation.
- Add negative tests for stale answer positions, stale export positions, replayed handoffs, mismatched receiver keys, malformed locators, and disconnect during handoff.
Relationship to Cap’n Proto
Cap’n Proto RPC is a close relative rather than the same protocol:
- It is schema-first and statically typed.
- Interface references are first-class capabilities.
- Promise pipelining is central.
- Persistent capabilities and three-way interactions are defined as higher protocol levels.
- Cap’n Proto RPC deliberately does not make remote calls look like local blocking calls; the API exposes promises and network failure.
capOS already uses Cap’n Proto for schemas and serialization, but not full
capnp-rpc. OCapN’s dynamic model is useful for language-agnostic distributed
objects; Cap’n Proto remains the better fit for capOS’s typed ABI and generated
interface surface.
The practical direction for capOS:
- Keep local kernel/userspace ABI fixed-layout where needed and Cap’n Proto schema-shaped at service boundaries.
- Learn from OCapN’s session, handoff, locator, and GC machinery.
- Do not replace typed schemas with untyped dynamic symbols unless building an explicit OCapN bridge.
Relationship to Agoric and Endo
Agoric and Endo continue the E-language object-capability lineage in hardened
JavaScript. Endo’s @endo/ocapn docs describe a tentative OCapN implementation
with layers for client/session management, CapTP dispatch and slot management,
codecs, and netlayers. The package is explicitly a work in progress and treats
OCapN as a moving target.
This independently validates the same architectural split:
- object/capability semantics
- session/slot management
- canonical codec
- netlayer abstraction
- higher-level client API for sturdyrefs and handoffs
For capOS, that split is more important than JavaScript-specific APIs.
CapOS Design Consequences
- Keep
CapIdlocal. Never serialize local cap table ids, endpoint generations, receiver selectors, or kernel session ids as portable network authority. - Treat remote references as session-local imports/exports with explicit generation/reuse rules.
- Put sturdyrefs and durable fetch authority in userspace naming/storage services, not in the kernel cap table.
- Keep network transport authority separate from object authority. A process may hold permission to listen/connect without holding permission to fetch a particular remote object, and vice versa.
- Implement promise pipelining through answer/result-cap namespaces. Avoid path-string singleton APIs created only to hide latency.
- Bound all per-session state: exports, imports, answers, queued pipelined deliveries, handoff gifts, handoff replay counters, incoming message bytes, and pending reconnects.
- Make disconnect semantics explicit. Remote refs become disconnected/broken, not silently retrying with ambient authority.
- Strip or seal diagnostic errors crossing a remote boundary.
- Use canonical serialization only where signatures require it. Do not move the kernel ring to Syrup.
- Defer OCapN compatibility claims until capOS targets a specific draft, version negotiation, and test suite.
Open Questions for capOS
- Should capOS expose an OCapN bridge as a userspace service that maps OCapN targets to local typed Cap’n Proto capabilities, or should it first implement a Cap’n Proto RPC bridge for typed external clients?
- What is the narrowest promise-pipelining proof that advances the paper track: local ring answer pipelining, capnp-rpc-compatible pipelining, or OCapN-like answer descriptors?
- How should capOS represent durable remote authority: opaque broker-held sturdyrefs, sealed persistent capabilities, or storage-service entries that mint live session refs on demand?
- Which cryptographic identity should a capOS netlayer use first: TLS certificates, Noise static keys, Tor Onion service ids, or a local test-only key?
- How much of OCapN’s dynamic value model should be admitted at capOS service boundaries, given the existing schema-first security posture?
Recommended Near-Term Use
For current capOS work, this research should be used as grounding for:
- promise pipelining design
- network-transparent capability proxy experiments
- Cap’n Proto RPC interop work
- durable naming/sturdyref design
- remote capability release and disconnect semantics
- third-party capability handoff designs
It should not yet be used to require OCapN wire compatibility for existing capOS demos or to replace the typed Cap’n Proto service model.
Sources
- Spritely Institute: https://spritely.institute/
- What is CapTP, and what does it enable?: https://spritely.institute/news/what-is-captp.html
- Introducing OCapN, interoperable capabilities over the network: https://spritely.institute/news/introducing-ocapn-interoperable-capabilities-over-the-network.html
- Spritely Goblins v0.18.0 release notes: https://spritely.institute/news/spritely-goblins-v0-18-0-sleepy-actors.html
- The Heart of Spritely: Distributed Objects and Capability Security: https://files.spritely.institute/papers/spritely-core.html
- Guile Goblins CapTP manual: https://files.spritely.institute/docs/guile-goblins/0.17.0/CapTP-The-Capability-Transport-Protocol.html
- Guile Goblins OCapN manual: https://files.spritely.institute/docs/guile-goblins/0.16.1/OCapN.html
- OCapN draft specifications: https://github.com/ocapn/ocapn/tree/main/draft-specifications
- CapTP draft specification: https://github.com/ocapn/ocapn/blob/main/draft-specifications/CapTP%20Specification.md
- OCapN model draft: https://github.com/ocapn/ocapn/blob/main/draft-specifications/Model.md
- OCapN netlayers draft: https://github.com/ocapn/ocapn/blob/main/draft-specifications/Netlayers.md
- OCapN locators draft: https://github.com/ocapn/ocapn/blob/main/draft-specifications/Locators.md
- Syrup repository: https://github.com/ocapn/syrup
- Cap’n Proto RPC protocol: https://capnproto.org/rpc.html
- Endo
@endo/ocapn: https://docs.endojs.org/modules/_endo_ocapn.html
Research: Browser Engines, Document Engines, and Agent Browsers
Survey of mainstream browser engines, embedding paths, automation protocols, and Donut Browser-style profile orchestration for Browser Capability and Agent Web Sessions.
Source Snapshot
Checked on 2026-04-30:
- Chromium Ozone overview: https://chromium.googlesource.com/chromium/src/+/main/docs/ozone_overview.md
- Chromium Embedded Framework: https://github.com/chromiumembedded/cef
- Microsoft Edge WebView2: https://developer.microsoft.com/en-us/microsoft-edge/webview2
- WebKit ports and WPE WebKit: https://docs.webkit.org/Ports/Introduction.html, https://webkit.org/wpe/
- Mozilla Gecko and GeckoView: https://firefox-source-docs.mozilla.org/overview/gecko.html, https://firefox-source-docs.mozilla.org/mobile/android/geckoview/contributor/geckoview-architecture.html
- Servo: https://servo.org/
- Ladybird: https://ladybird.org/, https://github.com/LadybirdBrowser/ladybird
- SpiderMonkey: https://spidermonkey.dev/, https://firefox-source-docs.mozilla.org/js/
- Boa: https://github.com/boa-dev/boa
- JavaScriptCore: https://docs.webkit.org/Deep%20Dive/JSC/JavaScriptCore.html
- QuickJS: https://www.bellard.org/quickjs/
- Chrome DevTools Protocol: https://chromedevtools.github.io/devtools-protocol/
- WebDriver BiDi: https://www.w3.org/TR/webdriver-bidi/
- Playwright browser support: https://playwright.dev/docs/browsers
- Model Context Protocol, version 2025-11-25 (latest as of this snapshot): https://modelcontextprotocol.io/specification/2025-11-25/
- Donut Browser: https://github.com/zhom/donutbrowser, https://donutbrowser.com/mission/, https://donutbrowser.com/use-cases/automation/
Design Consequences For capOS
- Do not make a browser engine a near-term kernel or GUI prerequisite. Modern browser engines assume a large userspace substrate: processes, threads, shared memory, timers, files, DNS, sockets/TLS, fonts, image codecs, GPU or software compositing, profile storage, crash handling, and a sandbox.
- Split browser work into three tracks: agent/shell browser sessions first, a cap-native document engine as the middle target, then visual browser after GUI. The first track can start as a capability wrapper around an external or hosted engine. The middle track validates cap-backed web host APIs over provided document data. The visual-browser track needs compositor, input, fonts, storage, networking, and userspace-driver safety.
- Treat browser profiles as capability objects. Cookies, local storage,
cache, permissions, proxy selection, downloads, and automation endpoints
should be held by
BrowserProfile/BrowserContextcaps, not ambient files under a hidden profile directory. - Standardize the agent-facing surface above CDP/WebDriver BiDi, not below it.
CDP is powerful and Chromium-specific; WebDriver BiDi is standardizing
bidirectional browser automation. capOS should expose a typed, narrowed
BrowserSessioncapability and use CDP/BiDi/Playwright only as backends. - Borrow Donut Browser’s useful product ideas – profile isolation, local API, persistent sessions, per-profile proxy/VPN selection, MCP integration, and AI-control hooks – without adopting anti-detection as a capOS goal. Fingerprint, geolocation, locale, proxy, and user-agent choices must be explicit, auditable policy, not stealth defaults.
- Reuse the project rule “the interface is the permission.” A process with
BrowserNavigatecan navigate; a process withBrowserReadPagecan inspect page state; a process withBrowserInputcan click/type; a process withBrowserDownloadand a grantedDownloadSinkcan receive downloaded bytes. Bundling all of those into one raw DevTools port would recreate ambient authority. - Treat a browser as a shell capability, not as the shell. The native shell or agent runner may hold a browser session and use it as a tool, but browser JavaScript must not directly hold the shell’s file, launch, network, or approval capabilities.
- Add a middle track for a cap-native document engine: JS, DOM/CSS, layout, rendering, and perhaps WebAssembly over caller-provided document/resource data, with web host APIs backed by explicit capOS capabilities. This is not full internet browsing, but it could power local HTML/CSS/JS apps and test the browser authority model earlier.
Engine Portability Surface
Chromium / Blink
Chromium has the broadest web compatibility and the strongest automation ecosystem. Ozone is the relevant porting layer: it centralizes low-level input and graphics behind platform interfaces, supports runtime platform binding, and expects new platforms to implement an Ozone backend. CEF is the production embedding path for many native applications: it wraps Chromium/Blink behind stable APIs, binary distributions, and release branches tracking Chromium. WebView2 is Microsoft’s Windows embedding product around Edge/Chromium, with evergreen and fixed-version runtime choices.
capOS implications:
- Best near-term backend for agent/shell usage is an external Chromium family process controlled through CDP, WebDriver BiDi, or Playwright, with capOS wrapping the endpoint as typed caps.
- A native capOS Chromium port is a very large post-GUI project. The likely port boundary is Ozone plus a capOS sandbox/profile/network/storage backend, not direct Blink surgery.
- CDP must not be directly handed to ordinary capOS workloads. It exposes navigation, DOM, network, runtime, storage, input, tracing, and debugging authority in one endpoint and has no stable backward-compatibility guarantee for tip-of-tree protocol use.
WebKit / WPE
WebKit’s upstream port model makes ports first-class maintainable units. WebKitGTK and WPE are maintained by Igalia; WPE is specifically designed as a small-footprint embedded WebKit port with a backend architecture, hardware acceleration, GStreamer media, and periodic releases.
capOS implications:
- WPE is the most plausible visual-browser candidate once capOS has a GUI substrate because it is meant for embedded systems without a full desktop toolkit.
- WPE still needs a platform backend, graphics/EGL or software fallback, input, fonts, networking/TLS, storage, media dependencies, and an update story. It is not an early shell feature.
- WebKit’s port/release discipline is useful precedent for a capOS browser backend: keep platform-specific code narrow and upstreamable where possible.
Gecko / GeckoView
Gecko is Firefox’s full web platform: JavaScript, layout, graphics, media,
networking, profiles, preferences, principals, and more. GeckoView is Mozilla’s
Android embedding library and powers active Mozilla Android browsers. Its API
separates GeckoRuntime, GeckoSession, and GeckoView, delegates storage and
UI behavior to embedders, and hides internal principals from the public API.
capOS implications:
- Gecko is credible as an external backend, especially for browser diversity and WebDriver BiDi, but GeckoView itself is Android-specific and not a desktop/no-OS embedding path for capOS.
- Gecko’s principal model is important precedent: origin/security context is a first-class internal object. capOS should make origin/session policy explicit in its browser capability layer rather than flattening it to URLs.
- The runtime/session/view split maps cleanly to capOS capabilities: engine/service supervision, per-profile context, and visual surface should be separate authorities.
Servo
Servo is a Rust browser engine with WebView embedding ambitions, WebGL/WebGPU support, modular architecture, parallel layout, and active cross-platform work. It is not yet a mainstream compatibility replacement for Chromium/WebKit/Gecko, but it is closer to capOS’s implementation culture than the large C++ engines.
capOS implications:
- Servo is the best research-aligned engine to track for a future native capOS engine experiment because Rust and modular embedding fit capOS better than direct Chromium/Gecko ports.
- It is not the first user-facing browser choice if the goal is broad web compatibility for operators or agents.
- Servo’s WebView API and crate decomposition are worth watching for a
possible
BrowserView/BrowserSessionbackend once capOS has GUI and ordinary userspace dependencies.
Ladybird / LibWeb
Ladybird is building an independent browser engine from scratch, with an alpha target for Linux and macOS in 2026. It uses a multi-process architecture and is focused on standards rather than embedding today. It is valuable prior art for independent engine architecture and process separation, not a near-term capOS dependency.
capOS implications:
- Track Ladybird for architecture ideas: isolated renderer processes, separate network and image-decoder processes, and specification-driven development.
- Do not depend on Ladybird for capOS’s browser plan until its API, platform support, and compatibility stabilize.
- Its “no inherited engine” posture is inspirational but not pragmatic for capOS near-term. capOS should expose capability-native browser APIs while reusing maintained engines underneath.
Cap-Native Document Engine Substrate
A cap-native document engine is a smaller target than a full browser. It executes a document graph supplied by capOS – for example a boot package, Store object, generated UI bundle, or test fixture – and returns a rendered surface, screenshot, event stream, and bounded DOM/accessibility snapshot. Networking, storage, permissions, clipboard, downloads, and device access are not internal browser privileges; they are host bindings backed by separate capabilities.
This track changes the portability question. Instead of asking “which browser can capOS port?”, it asks “which engine pieces can run with capOS as the host environment?”
Servo As A Document Engine
Servo is the closest architectural fit for this middle track. It is Rust,
embeddable, modular, parallel, and already presents itself as a WebView-capable
engine. The value for capOS is not only memory safety. It is the possibility
of treating the embedding API as the boundary where fetch, storage,
permission prompts, surfaces, and resource loading are backed by typed caps.
Risks:
- Servo still brings a large standards surface.
- API stability and completeness must be checked at implementation time.
- A WebView embedding API is not the same as a small deterministic document-rendering library; capOS may still need substantial host glue.
Ladybird / LibWeb As A Document Engine
Ladybird’s LibWeb/LibJS stack is attractive as readable independent-engine prior art. Its multi-process browser architecture also maps well to capOS service decomposition. However, Ladybird is focused on building a full browser, not on providing a stable embeddable document engine for external hosts.
capOS should track it for design ideas and perhaps future experiments, but should not treat it as the near-term substrate for local HTML/CSS/JS apps.
SpiderMonkey
SpiderMonkey is Mozilla’s JavaScript and WebAssembly engine, used by Firefox and Servo, and can be embedded in C++ and Rust projects. It is useful if capOS wants a serious JS/Wasm runtime while building DOM/layout/rendering and host bindings separately or while experimenting with Servo components.
The tradeoff is that SpiderMonkey is only the JS/Wasm engine. DOM, CSS, layout, rendering, networking, storage, event loops, Web APIs, and browser security objects remain host responsibilities unless capOS embeds a larger engine.
JavaScriptCore
JavaScriptCore is WebKit’s ECMAScript engine and an optimizing VM with interpreter and JIT tiers. It is a mature engine, but its natural home is inside WebKit. For capOS, JavaScriptCore is most relevant if the visual-browser track chooses WPE/WebKit; it is less obviously attractive as a standalone cap-native document-engine substrate than Servo or a Rust-native JS engine.
Boa
Boa is an embeddable JavaScript engine written in Rust, with actively maintained crates and a focus on ECMAScript conformance. It is attractive for capOS experiments because it is Rust, smaller than the mainstream browser JS engines, and easier to embed in native services.
The tradeoff is compatibility and performance. Boa is a plausible substrate for trusted/local UI scripting or early host-binding proofs, not a replacement for the JS engine in a general web browser.
QuickJS
QuickJS is a small embeddable JavaScript engine. It is useful as a reference for tiny host-controlled JS runtimes and deterministic local scripting. It is not a DOM/layout/rendering engine and should not be mistaken for browser compatibility.
Consequences
- A cap-native document engine should start with local/trusted bundles, not arbitrary internet pages.
- The host API contract matters more than the JS engine choice.
fetch, storage, clipboard, downloads, timers, workers, and Wasm imports must all be explicit cap-backed facets. - The first proof can be intentionally small: render a packaged HTML/CSS/JS dashboard or demo UI, capture a screenshot and accessibility/DOM snapshot, and prove that missing network/storage/download caps fail closed.
- Full browser compatibility remains a later engine-port problem. This track buys capOS-native web UI and authority-model validation, not Chrome parity.
Automation And Agent Protocols
CDP
Chrome DevTools Protocol can instrument, inspect, debug, profile, capture screenshots, manipulate DOM/runtime/network state, and control browser targets. It is excellent as a backend and dangerous as a user-facing authority surface. The tip-of-tree protocol changes frequently and is not compatibility-stable.
capOS implication: a CDP endpoint is equivalent to a broad browser-admin cap. Only a trusted browser service should hold it. Ordinary agents receive narrowed typed operations.
WebDriver BiDi
WebDriver BiDi is a W3C Working Draft for bidirectional remote control of user agents. It introduces event streaming over WebSocket and includes modules for browser contexts, browsing contexts, emulation, network, script, and input.
capOS implication: BiDi is a better standards-shaped backend contract than raw CDP for cross-engine automation, but it still exposes more authority than most capOS workloads should receive directly.
Playwright
Playwright operates across Chromium, WebKit, and Firefox and manages specific browser versions for each Playwright release. It is practical as an early host-side harness or browser-service backend while capOS lacks native browser engine support.
capOS implication: use Playwright for development and host-side proof harnesses,
but keep it out of the capOS ABI. The capOS ABI should be the typed
BrowserSession/BrowserProfile capability surface.
MCP Browser Tools
MCP standardizes how LLM applications connect to external tools, resources, and prompts, with explicit consent and tool-safety guidance. Browser tools are already becoming a common MCP shape: navigate, snapshot, click, type, screenshot, download, and inspect network state.
capOS implication: the browser capability can export an MCP adapter for external agents, but MCP is only an adapter. It must not smuggle raw browser, network, file, or shell authority around the capOS broker.
Donut Browser Lessons
Donut Browser is an open-source anti-detect browser application with a Tauri Rust/TypeScript codebase, AGPL app licensing, per-profile isolation, local REST API, MCP server, proxy/VPN controls, persistent sessions, sync, and engine choice through Wayfern (Chromium-based) and Camoufox (Firefox-based). Its own mission page states that the app is open source while the browser-engine anti-detection components have a mixed proprietary/open-source model.
Useful to adapt:
- Profile manager as the primary product object.
- Per-profile cookies, storage, extensions, fingerprint settings, proxy/VPN, and persistent session state.
- Local API and MCP server as automation surfaces.
- Ability to launch a profile and attach Playwright/Puppeteer/Selenium through a backend automation endpoint.
- Default-browser routing where each link chooses a profile/context.
Not adopted:
- Anti-detection as a default product promise.
- Closed fingerprint-spoofing logic as a security dependency.
- Treating “looks like a real device” as a capOS correctness goal.
- Exposing a broad local browser-control API without capability-scoped grants.
capOS replacement framing:
BrowserPersonais explicit policy: user agent, viewport, locale, timezone, geolocation, WebRTC exposure, proxy, and storage partition.BrowserProfileholds state and can be cloned, snapshotted, exported, or destroyed through typed caps.BrowserAutomationis split by operation class, not by one admin token.- Audits record profile, persona, network route, downloads, uploads, and whether a human or agent initiated each action.
Open Research Gaps
- Which backend should be the first in-capOS visual engine candidate: WPE or Servo?
- Which substrate should be tried first for a cap-native document engine: Servo WebView components, Ladybird/LibWeb experimentation, SpiderMonkey with a custom DOM, Boa for trusted local UI scripting, or QuickJS for tiny proofs?
- How much of a browser profile should be persistent Store state versus revocable in-memory session state?
- What is the smallest useful DOM/screenshot/accessibility snapshot for an LLM tool that avoids dumping excessive page data into model context?
- How should downloads and uploads preserve provenance and consent across browser, shell, and storage caps?
- Can WebDriver BiDi become the only external automation backend, or is CDP unavoidable for practical Chromium compatibility?
OS Error Handling in Capability Systems: Research Notes
Research on error handling patterns in capability-based and microkernel operating systems. Used as input for the capOS error handling proposal.
1. seL4
Error Codes
seL4 defines 11 kernel error codes in errors.h:
typedef enum {
seL4_NoError = 0,
seL4_InvalidArgument = 1,
seL4_InvalidCapability = 2,
seL4_IllegalOperation = 3,
seL4_RangeError = 4,
seL4_AlignmentError = 5,
seL4_FailedLookup = 6,
seL4_TruncatedMessage = 7,
seL4_DeleteFirst = 8,
seL4_RevokeFirst = 9,
seL4_NotEnoughMemory = 10,
} seL4_Error;
Error Return Mechanism
- Capability invocations (kernel object operations) return
seL4_Errordirectly. - IPC messages use
seL4_MessageInfo_twithlabel,length,extraCaps,capsUnwrapped. Thelabelis copied unmodified – kernel doesn’t interpret it. - MR0 (Message Register 0) carries return codes for kernel object invocations
via
seL4_Call.
Error Propagation
Fault handler mechanism: each TCB has a fault endpoint capability. On fault (capability fault, VM fault, etc.):
- Kernel blocks the faulting thread.
- Kernel sends an IPC to the fault endpoint with fault-type-specific fields.
- Fault handler (separate process) receives, fixes, and replies.
- Kernel resumes the faulting thread.
Design Choices
seL4_NBSendon invalid capability: silently fails (prevents covert channels).seL4_Send/seL4_Callon invalid capability: returnsseL4_FailedLookup.- No application-level error convention – user servers choose their own protocol.
- Partial capability transfer: if some caps in a multi-cap transfer fail,
already-transferred caps succeed;
extraCapsreflects the successful count.
Sources
- seL4 errors.h: https://github.com/seL4/seL4/blob/master/libsel4/include/sel4/errors.h
- seL4 IPC tutorial: https://docs.sel4.systems/Tutorials/ipc.html
- seL4 fault handlers: https://docs.sel4.systems/Tutorials/fault-handlers.html
- seL4 API reference: https://docs.sel4.systems/projects/sel4/api-doc.html
2. Fuchsia / Zircon
zx_status_t
Signed 32-bit integer. Negative = error, ZX_OK (0) = success.
Categories:
| Category | Examples |
|---|---|
| General | ZX_ERR_INTERNAL, ZX_ERR_NOT_SUPPORTED, ZX_ERR_NO_RESOURCES, ZX_ERR_NO_MEMORY |
| Parameter | ZX_ERR_INVALID_ARGS, ZX_ERR_WRONG_TYPE, ZX_ERR_BAD_HANDLE, ZX_ERR_BUFFER_TOO_SMALL |
| State | ZX_ERR_BAD_STATE, ZX_ERR_NOT_FOUND, ZX_ERR_TIMED_OUT, ZX_ERR_ALREADY_EXISTS, ZX_ERR_PEER_CLOSED |
| Permission | ZX_ERR_ACCESS_DENIED |
| I/O | ZX_ERR_IO, ZX_ERR_IO_REFUSED, ZX_ERR_IO_DATA_INTEGRITY, ZX_ERR_IO_DATA_LOSS |
FIDL Error Handling (Three Layers)
Layer 1: Transport errors. Channel broke. Currently all transport-level
FIDL errors close the channel. Client observes ZX_ERR_PEER_CLOSED.
Layer 2: Epitaphs (RFC-0053). Server sends a special final message
before closing a channel, explaining why. Wire format: ordinal 0xFFFFFFFF,
error status in the reserved uint32 of the FIDL message header. After
sending, server closes the channel.
Layer 3: Application errors (RFC-0060). Methods declare error types:
Method() -> (string result) error int32;
Serialized as:
union MethodReturn {
MethodResult result;
int32 err;
};
Error types constrained to int32, uint32, or an enum thereof. Deliberately
no standard error enum – each service defines its own error domain.
Rationale: standard error enums “try to capture more detail than we think is
appropriate.”
C++ binding: zx::result<T> (specialization of fit::result<zx_status_t, T>).
Sources
- Zircon errors: https://fuchsia.dev/fuchsia-src/concepts/kernel/errors
- RFC-0060 error handling: https://fuchsia.dev/fuchsia-src/contribute/governance/rfcs/0060_error_handling
- RFC-0053 epitaphs: https://fuchsia.dev/fuchsia-src/contribute/governance/rfcs/0053_epitaphs
3. EROS / KeyKOS / Coyotos
KeyKOS Invocation Message Format
KC (Key, Order_code)
STRUCTFROM(arg_structure)
KEYSFROM(arg_key_slots)
STRUCTTO(reply_structure)
KEYSTO(reply_key_slots)
RCTO(return_code_variable)
- Order code: small integer selecting the operation (method selector).
- Return code: integer returned by the invoked object via
RCTO. - Data string: bulk data parameter (up to ~4KB).
- Keys: up to 4 capability parameters in each direction.
Invocation Primitives
- CALL: send + block for reply. Kernel synthesizes a resume key (capability to resume caller) as 4th key parameter to callee.
- RETURN: reply using a resume key + go back to waiting.
- FORK: send and continue (fire-and-forget).
Keeper Error Handling
Every domain has a domain keeper slot. On hardware trap (illegal instruction, divide-by-zero, protection fault):
- Kernel invokes the keeper as if the domain had issued a CALL.
- Keeper receives fault information in the message.
- Keeper can fix and resume (via resume key) or terminate.
- A non-zero return code from a key invocation triggers the keeper mechanism.
Coyotos (EROS Successor) – Formalized Error Model
Cleanly separates invocation-level vs application-level exceptions:
Invocation-level (before the target processes the message):
MalformedSyscall, InvalidAddress, AccessViolation,
DataAccessTypeError, CapAccessTypeError, MalformedSpace,
MisalignedReference
Application-level: signaled via OPR0.ex flag bit in the reply control
word. If set, remaining parameter words contain a 64-bit exception code
plus optional info.
Sources
- KeyKOS architecture: https://dl.acm.org/doi/pdf/10.1145/858336.858337
- Coyotos spec: https://hydra-www.ietfng.org/capbib/cache/shapiro:coyotosspec.html
- EROS (SOSP 1999): https://sites.cs.ucsb.edu/~chris/teaching/cs290/doc/eros-sosp99.pdf
4. Plan 9 / 9P
9P2000 Rerror Format
size[4] Rerror tag[2] ename[s]
ename[s]: variable-length UTF-8 string describing the error.- No
Terrormessage – only servers send errors. - String-based, not numeric. Conventional strings (“permission denied”, “file not found”) but no fixed taxonomy.
9P2000.u Extension (Unix compatibility)
size[4] Rerror tag[2] ename[s] errno[4]
Adds a 4-byte Unix errno as a hint. Clients should prefer the string.
ERRUNDEF sentinel when Unix errno doesn’t apply.
Design Rationale
Avoids “errno fragmentation” where different Unix variants assign different numbers to the same condition. The string is authoritative; the number is an optimization for Unix-compatibility clients.
Sources
- 9P2000 RFC: https://ericvh.github.io/9p-rfc/rfc9p2000.html
- 9P2000.u RFC: https://ericvh.github.io/9p-rfc/rfc9p2000.u.html
5. Genode
RPC Exception Propagation
GENODE_RPC_THROW(func_type, ret_type, func_name,
GENODE_TYPE_LIST(Exception1, Exception2, ...),
arg_type...)
Only the exception type crosses the boundary – exception objects (fields,
messages) are not transferred. Server encodes a numeric Rpc_exception_code,
client reconstructs a default-constructed exception of the matching type.
Undeclared exceptions: undefined behavior (server crash or hung RPC).
Infrastructure-Level Errors
RPC_INVALID_OPCODE: dispatched operation code doesn’t match.Rpc_exception_code: integral type, computed asRPC_EXCEPTION_BASE - index_in_exception_list.Ipc_error: kernel IPC failure (server unreachable).- Server death: capabilities become invalid, subsequent invocations
produce
Ipc_error.
Sources
- Genode RPC: https://genode.org/documentation/genode-foundations/20.05/functional_specification/Remote_procedure_calls.html
- Genode IPC: https://genode.org/documentation/genode-foundations/23.05/architecture/Inter-component_communication.html
6. Cross-System Comparison: Transport vs Application Errors
Every capability/microkernel IPC system separates two failure modes:
-
Transport errors – the invocation mechanism failed before the target processed the request (bad handle, insufficient rights, target dead, malformed message, timeout).
-
Application errors – the service processed the request and returned a meaningful error (not found, resource exhausted, invalid operation).
| System | Transport errors | Application errors |
|---|---|---|
| seL4 | seL4_Error (11 values) from syscall | IPC message payload (user-defined) |
| Zircon | zx_status_t (~30 values) from syscall | FIDL per-method error type |
| EROS/Coyotos | Invocation exceptions (kernel) | OPR0.ex flag + code in reply |
| Plan 9 | Connection loss | Rerror with string |
| Genode | Ipc_error + RPC_INVALID_OPCODE | C++ exceptions via GENODE_RPC_THROW |
| Cap’n Proto RPC | disconnected/unimplemented | failed/overloaded or schema types |
Common pattern: small kernel error code set for transport + typed service-specific errors for application.
7. POSIX errno: Strengths and Weaknesses for Capability Systems
Strengths
- Simple (single integer, zero overhead on success).
- Universal (every Unix developer knows it).
- Low overhead (no allocation on error path).
Weaknesses for Capability Systems
- Ambient authority assumption:
EACCES/EPERMassume ACL-style access control. In capability systems, having the capability IS the permission. - Global flat namespace: all errors share one integer space. Capability systems have typed interfaces; errors should be scoped per-interface.
- No structured information: just an integer, no “which argument” or “how much memory needed.”
- Thread-local state: clobbered by intermediate calls, breaks down with async IPC or promise pipelining.
- No transport/application distinction:
EBADF(transport) andENOENT(application) in the same space. - Not composable across trust boundaries: callee’s errno meaningless in caller’s address space without explicit serialization.
No capability system uses a POSIX-style global errno namespace.
Crash Recovery and Supervision: Prior-Art Survey
Survey of crash recovery, supervision, and failure propagation patterns across production systems. Used as input for the capOS Crash Recovery proposal.
1. Erlang/OTP Supervision Trees
Erlang/OTP is the canonical prior art for declarative crash recovery in a capability-shaped process model.
Supervision strategies
A supervisor declares one of four restart strategies:
one_for_one: only the crashed child is restarted; siblings are unaffected.one_for_all: when any child crashes, every child is terminated and then every child is restarted. Used when children have shared state.rest_for_one: the crashed child and all children started after it (in declaration order) are terminated and restarted. Used when later children depend on earlier ones.simple_one_for_one: a simplifiedone_for_onefor dynamically added homogeneous workers.
Restart intensity
Supervisors carry an intensity (max restart count) and period (seconds
window). If more than intensity restarts occur in any rolling period-second
window, the supervisor terminates all children and then itself, escalating the
failure to its own parent supervisor. The defaults are intensity = 1 and
period = 5; that is, one restart per five seconds before the supervisor
gives up.
Each child spec declares a restart type:
permanent— always restarted.transient— restarted only on abnormal exit (exit reason other thannormal,shutdown, or{shutdown, Term}).temporary— never restarted.
“Let it crash”
The design philosophy is to avoid defensive error-handling at the crash site. A process that encounters an unexpected condition should exit cleanly, relying on its supervisor to restart it in a known-good state. Error recovery code introduces its own bugs; a clean restart from a known-good init is safer.
Linked processes propagate EXIT signals bidirectionally. A supervisor traps
exits (process_flag(trap_exit, true)) and converts them to ordinary messages
{'EXIT', Pid, Reason}, allowing it to react rather than crash itself. Monitors
(erlang:monitor/2) give a unidirectional {'DOWN', Ref, process, Pid, Reason}
without the bidirectional link risk.
Lesson for capOS
- Restart budgets (intensity + period) translate directly: the kernel service supervisor should maintain a crash-loop budget — max N restarts per T seconds — and escalate to a parent authority or enter degraded boot if exceeded.
- The three child restart types (
permanent/transient/temporary) match the restart policy field a capOS service manifest would declare. - “Let it crash” applies: a capability server that encounters an unexpected
decode error or illegal state should exit rather than continue with corrupted
internal state. The supervisor restarts it; stale client caps observe a
DisconnectedCQE before the server is live again.
2. systemd Service Recovery
systemd is the dominant Linux service supervisor. Its restart model is policy-driven, external to the service.
Restart= modes
The Restart= directive accepts: no (default), on-success, on-failure,
on-abnormal, on-watchdog, on-abort, or always.
on-failurecovers non-zero exit codes, signals (including core dump), and watchdog timeout — the common production choice.on-abnormalcovers signals, operation timeouts, and watchdog, but not non-zero exit codes.alwaysrestarts unconditionally.
Timing
RestartSec (default 100 ms) is the delay before a restart attempt. It is
not a backoff — it is a flat delay between each attempt.
Crash-loop budget
StartLimitIntervalSec (default 10 s) and StartLimitBurst (default 5) form
the crash-loop budget: more than StartLimitBurst starts within
StartLimitIntervalSec puts the unit in a permanently failed state until
manually reset or the system reboots. This is the systemd analogue of OTP
intensity/period.
Dependency cascades
OnFailure= lists units to activate when a service enters the failed state;
it is typically used to run a notification or diagnostic unit.
Watchdog
WatchdogSec enables a software watchdog: the service must call
sd_notify(0, "WATCHDOG=1") at intervals shorter than WatchdogSec. If the
heartbeat is absent for the full interval, systemd kills and (if Restart=
includes watchdog triggers) restarts the service. This catches live-lock and
hang states that do not produce a crash signal.
Lesson for capOS
- A capability service watchdog translates to a periodic
sd_notify-style ping to a watchdog capability. If the server does not renew within a budget, the supervisor sendsSIGKILL(or the kernel analogue) and restarts. - The crash-loop budget (
StartLimitIntervalSec/StartLimitBurst) is the second time this pattern appears, reinforcing that a fixed restart budget per time window is the correct primitive. RestartSec(flat delay, not exponential) is simpler than Kubernetes backoff and appropriate for always-available system services.
3. Kubernetes: Probes and CrashLoopBackOff
Kubernetes separates health probes (liveness, readiness, startup) from the container restart policy, giving operators fine-grained control.
Probes
- Liveness probe: if it fails, kubelet kills the container and subjects it to the restart policy. Used to detect live-lock (process alive, making no progress).
- Readiness probe: if it fails, the pod’s IP is removed from all matching Service EndpointSlices. No restart is triggered; the pod stays up but receives no traffic.
- Startup probe: disables liveness and readiness probes until it succeeds, giving slow-starting containers time to initialize without being killed prematurely.
RestartPolicy
Always, OnFailure, or Never. With Always or OnFailure, a failed
container is restarted with exponential backoff: 10 s, 20 s, 40 s, … capped
at 5 minutes. If the container runs successfully for 10 minutes, the backoff
counter resets.
CrashLoopBackOff
When the restart backoff delay is active and the pod is waiting before the
next attempt, the pod status shows CrashLoopBackOff. It is not a terminal
state — the pod will still be restarted — but it indicates the container is
stuck in a restart loop and kubelet is applying backoff.
Lesson for capOS
- The readiness/liveness split maps cleanly: a capOS service can expose two status indicators — “alive” (process is running and heartbeating) and “ready” (service is accepting new capability requests). Supervisors and routing layers can use them independently.
- Exponential backoff with a cap (10 s → 5 min) and a reset window (10 min healthy) is appropriate for user-facing services that should self-heal but not spin continuously.
- The startup probe concept is relevant for services whose init phase takes longer than the steady-state heartbeat budget.
4. Fuchsia Component Framework
Fuchsia’s Component Framework manages component lifecycles and capability routing between components.
Lifecycle states
A component instance progresses through: Created → Resolved → Started → Stopped → (Shutdown) → Destroyed. Stopping preserves persistent state; Destroyed removes it entirely.
Client observation of a crashed component
When a Fuchsia component crashes, the kernel pauses the faulting thread and
delivers a message to registered exception channels. The component’s process
is killed (as if via zx_task_kill()), which closes all Zircon channels held
by that process. Clients observing those channels receive
ZX_CHANNEL_PEER_CLOSED. Component manager receives ZX_CHANNEL_PEER_CLOSED
on the runner channel for the component, allowing it to detect and log the
crash.
Clients that were bound to a crashed component’s exposed protocol channels
also observe ZX_CHANNEL_PEER_CLOSED. Component manager then handles
restarting the component (if configured). A new binding request after restart
provides a fresh channel — there is no automatic reconnection of the
pre-crash channel.
Lesson for capOS
- The Fuchsia model confirms that the clean contract for server death in a
capability system is channel close / peer-closed on all outstanding
client channels. capOS should emit a
DisconnectedCQE to every caller that has a pending request or open session to a server that dies. - There is no implicit re-connect: the client must explicitly re-acquire a new capability to the restarted service. Stale caps acquired before the crash must not be silently re-animated after restart.
5. Microkernel Precedent: seL4 and Genode
seL4
seL4 provides no built-in mechanism to notify a client when the process that holds an endpoint dies. A thread fault (capability fault, VM fault, etc.) triggers the thread’s configured fault endpoint, which notifies a designated fault-handler process. The fault handler can fix and resume, or terminate the faulting thread. However, this is per-thread fault delivery — not a general “server died, notify clients” mechanism.
If a server process is killed (all its capabilities revoked, its CNode
destroyed), outstanding seL4_Call callers remain blocked on the endpoint
permanently unless the endpoint object itself is also destroyed or a reply
capability is used. seL4 has no automatic dead-server notification for
waiting callers. Building supervision requires explicit userspace monitors
(e.g., a watchdog thread with a notification capability polled by the
supervisor).
Genode
Genode’s component model gives the parent ultimate control over its children.
When a component is destroyed (whether intentionally by the parent or due to a
crash), the kernel invalidates all capabilities whose associated RPC object is
destroyed, as a direct side effect of object destruction. Subsequent invocations
of those capabilities by other components produce an Ipc_error exception at the
call site.
The parent observes a graceful exit via the exit() RPC on the parent
interface; it receives no explicit crash notification from the kernel. Detecting
unexpected death requires the parent to poll state reports or use the heartbeat
mechanism in Genode’s init component, which tracks skipped_heartbeats per
monitored child.
Lesson for capOS
- seL4’s silence-on-server-death confirms the gap: callers must not be
silently blocked forever when a server dies. capOS must deliver a
DisconnectedCQE (or equivalent transport-level error) to every pending caller when the server capability is revoked or the process exits. - Genode’s implicit capability invalidation on object destruction is the
right kernel primitive: the kernel, not userspace, ensures no stale cap
can reach a destroyed object. capOS already has this via
CapTablerevocation. - Active death notification to a supervisor capability (rather than polling) is the correct extension — analogous to OTP process monitors.
6. Coredump and Minidump: Capture and Redaction
Core dumps contain a complete snapshot of a process’s address space at the
time of the crash. The Linux kernel writes them via core_pattern; systemd
routes them through systemd-coredump running as a socket-activated service
to enforce access controls and journaling.
The primary security concern is that capability keys, cryptographic material,
and user credentials present in process memory at crash time are written
verbatim to the dump file. systemd-coredump stores dumps in a mode readable
only by root and the process owner, but it provides no built-in redaction of
sensitive memory regions. Disabling core dumps (ulimit -c 0) for
security-sensitive services is the common mitigation.
Two recent vulnerabilities (CVE-2025-4598 in systemd-coredump and CVE-2025-5054 in Apport) demonstrate that race conditions in coredump handlers can allow local privilege escalation via sensitive memory access.
Lesson for capOS
- A capability OS dump is structurally more dangerous than a POSIX dump: the crashed process’s CapTable may contain live capabilities to kernel resources that the dump reader does not possess. Dumping capability indices without revocation could allow replay.
- The correct policy on process crash is to revoke all capabilities of the crashed process before writing any dump — the kernel holds the only authoritative revocation path. A dump tool operating post-revocation sees only dead cap indices, not live authority.
- Memory regions tagged as containing key material (capability ring buffers,
decrypted secrets) should be excluded from dumps; a
MADV_DONTDUMPanalogue applied to sensitive pages at allocation time is the mechanism.
Applicability to capOS
Across all surveyed systems, four design invariants recur:
-
Crash-loop budget. Every production supervisor limits restarts per time window (OTP
intensity/period; systemdStartLimitBurst/StartLimitIntervalSec; Kubernetes CrashLoopBackOff backoff). capOS service manifests should carry amaxRestarts+restartWindowSecsbudget; on exhaustion the supervisor enters a degraded-boot state rather than spinning. -
Dead-server notification is the kernel’s job. seL4 and Genode both demonstrate what happens when the kernel is silent: callers block forever or receive opaque errors. capOS must emit a
DisconnectedCQE to pending callers when a server’s capability is revoked, and must revoke server capabilities atomically on process exit. -
No stale authority after restart. A restarted service gets new capabilities — it does not inherit the pre-crash CapTable. Clients must re-acquire capabilities to the new instance. The Fuchsia model (fresh channel on new binding) and OTP model (new process Pid, old monitors fire
DOWN) both enforce this. -
Watchdog caps complement passive monitoring. systemd’s
WatchdogSecand Genode’s heartbeat mechanism both address live-lock states that produce no crash signal. A watchdog capability that the service must renew periodically is the capOS translation: if the service fails to renew, the supervisor kills and restarts it.
Sources
- Erlang OTP Supervisor Behaviour: https://www.erlang.org/doc/system/sup_princ.html
- Erlang stdlib supervisor module: https://www.erlang.org/doc/apps/stdlib/supervisor.html
- systemd.service(5) man page (Debian): https://manpages.debian.org/jessie/systemd/systemd.service.5.en.html
- Kubernetes Pod Lifecycle: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/
- Kubernetes Probes: https://kubernetes.io/docs/concepts/configuration/liveness-readiness-startup-probes/
- Fuchsia Component Lifecycle: https://fuchsia.dev/fuchsia-src/concepts/components/v2/lifecycle
- Fuchsia Exception Handling: https://fuchsia.dev/fuchsia-src/concepts/kernel/exceptions
- Fuchsia Component Runner FIDL: https://fuchsia.dev/reference/fidl/fuchsia.component.runner
- seL4 Fault Handlers: https://docs.sel4.systems/Tutorials/fault-handlers.html
- seL4 IPC Tutorial: https://docs.sel4.systems/Tutorials/ipc.html
- Genode Recursive System Structure: https://genode.org/documentation/genode-foundations/20.05/architecture/Recursive_system_structure.html
- Genode Init Component: https://genode.org/documentation/genode-foundations/21.05/system_configuration/The_init_component.html
- systemd-coredump documentation: https://systemd.io/COREDUMP/
- CVE-2025-4598 systemd-coredump analysis: https://blogs.oracle.com/linux/analysis-of-cve-2025-4598
- Core dump security (Kicksecure): https://www.kicksecure.com/wiki/Core_Dumps
Debug, Trace, and Profiling Authority: Prior-Art Survey
Survey of how existing systems scope and gate debug, trace, and profiling access. Each section states the verified fact and the lesson it carries for a capability OS.
1. GDB Remote Serial Protocol (gdbstub)
The GDB Remote Serial Protocol (RSP) is the wire protocol between a GDB client and a gdbstub running on or alongside the target. A stub exposes the target’s entire register file and address space to the connected client via a small set of packet types:
g/G— read and write all general-purpose registers.p/P— read and write individual registers.m/M/X— read and write arbitrary memory ranges.Z/z— set and clear software breakpoints, hardware breakpoints, and hardware watchpoints (read, write, or access).s/c— single-step and continue execution.
Feature negotiation (qSupported) lets client and stub advertise extensions,
but the baseline packet set already provides full read/write authority over the
target’s memory and execution state.
Lesson for capOS. A DebugSession capability is not a read-only observer
— it is a read/write authority over the target’s registers, memory, and control
flow. Attaching the stub to a process is itself the high-privilege act; the
session object must be issued by an explicit grant (e.g., a ProcessSpawner
debug grant or a ThreadControl-derived debug capability) rather than derived
from any lesser handle. The gdbstub pattern shows that the session boundary is
the right chokepoint: once a client holds the session capability the protocol
can proceed without further kernel checks.
2. Linux ptrace and the Yama LSM
Ambient-authority problem
Linux ptrace originally allowed any process to PTRACE_ATTACH to any other
process running under the same UID that was marked as dumpable. The kernel docs
summarize the risk: “a single user is able to examine the memory and running
state of any of their processes. For example, if one application (e.g. Pidgin)
was compromised, it would be possible for an attacker to attach to other running
processes (e.g. Firefox, SSH sessions, GPG agent, etc) to extract additional
credentials and continue to expand the scope of their attack.”
Yama ptrace_scope levels
The Yama Linux Security Module adds a sysctl kernel.yama.ptrace_scope with
four levels to progressively restrict this ambient authority:
| Level | Behaviour |
|---|---|
| 0 | Classic: PTRACE_ATTACH to any same-UID dumpable process. |
| 1 | Restricted: only descendants (or processes that have called prctl(PR_SET_TRACER, ...)) may be attached. |
| 2 | Admin-only: only processes holding CAP_SYS_PTRACE may attach. |
| 3 | No attach: PTRACE_ATTACH and PTRACE_TRACEME are blocked system-wide; the setting is irreversible once applied. |
Most Linux distributions now ship with level 1 as the default, but level 0 remains the kernel default if Yama is not loaded.
Lesson for capOS. Yama exists solely because ambient-authority ptrace is a
privilege-escalation footgun. The correct model is the inverse: no process
should be able to attach to another without an explicit, pre-granted capability.
In capOS terms, DebugSession attach must require a pre-issued debug capability
(analogous to level 3 everywhere, not level 0 with an opt-out). The parent
process or init can hold a ThreadControl-derived debug grant; a
RestrictedLauncher can be configured to never issue one. There is no ambient
fallback.
3. Linux perf_events and eBPF gating
perf_event_paranoid
/proc/sys/kernel/perf_event_paranoid is the sysctl controlling
what unprivileged processes may sample:
| Value | Effect |
|---|---|
| -1 | No scope or access restrictions; most permissive. |
| ≥ 0 | Raw tracepoints blocked for unprivileged users. |
| ≥ 1 | CPU-level (system-wide) profiling blocked; per-process only. |
| ≥ 2 | Kernel profiling blocked; user-space events only. |
Debian-based distributions additionally define 4 (block all perf for unprivileged users) and use it as the distro default.
CAP_PERFMON and CAP_BPF
Linux 5.8 introduced CAP_PERFMON to separate performance-monitoring authority
from the broad CAP_SYS_ADMIN. Holding CAP_PERFMON lets a process bypass
perf_event_paranoid scope checks. Similarly, CAP_BPF gates loading BPF
programs that have performance or tracing implications (e.g., kprobes, uprobes,
perf maps); attaching BPF to a kprobe tracepoint requires CAP_PERFMON or
CAP_SYS_ADMIN.
The split reflects the principle of least privilege: a profiling daemon should
not require CAP_SYS_ADMIN merely to sample hardware counters.
Lesson for capOS. Read-only sampling (hardware counters, ring buffer) is a
distinct authority from read/write debugging. capOS should issue a
Sampler capability (read-only, non-interrupting, no memory write) separately
from a DebugSession (register/memory read-write, breakpoints). The sampler
does not stop the target and transfers no writable authority; the
perf/CAP_PERFMON split is the prior-art justification for keeping these two
surfaces apart.
4. Fuchsia / Zircon: handle-scoped debug authority
debug_agent and zxdb
On Fuchsia the debugger is split into two components: debug_agent, a
component running on the target that holds process handles and communicates
with the kernel, and zxdb, the developer-facing client that connects to
debug_agent over a socket. The fuchsia.debugger FIDL library defines the
boundary:
DebugAgent— core protocol;AttachToaccepts a name pattern andFilterTypeto select which processes to attach to.ProcessInfoIterator,AttachedProcessIterator— read access to thread and process state.Launcher— creates newDebugAgentinstances.
The debug_agent acquires process handles from the kernel by being granted
them through the Zircon job/process handle tree. Zircon’s handle model means
that process operations (reading memory, setting breakpoints, receiving
exceptions) all require the caller to hold a process handle with the
appropriate ZX_RIGHT_* bits. A process that does not hold a handle to another
process cannot inspect or modify it, regardless of UID. The zxdb UI can
inspect handle tables of attached processes and displays their
ZX_RIGHT_READ/ZX_RIGHT_WRITE/ZX_RIGHT_INSPECT/ZX_RIGHT_SIGNAL rights
to the developer.
Exception delivery in Zircon is also capability-scoped: zx_task_create_exception_channel
creates a channel on a task (thread, process, or job) object; the caller must
hold that task handle. The resulting channel is read-only and can only receive
exception messages, not issue commands, which means observing crashes requires
the task handle but does not by itself grant write authority.
Lesson for capOS. Fuchsia demonstrates that a production debugger can be
built entirely on object-handle authority without ambient attach. The debug_agent
component acts like a bounded debug authority domain: it holds process handles
for the processes it is authorized to debug, and zxdb interacts only through
the FIDL protocol that debug_agent exposes. The capOS equivalent is a
DebugSession capability issued per target process, scoped to a running
session, with a separate read-only ExceptionChannel cap for crash observation.
5. seL4: TCB capability and debug-build gate
Hardware debug API
seL4 exposes hardware breakpoints, watchpoints, and single-stepping to userspace
via TCB object methods, but only when the kernel is built with
KernelDebugBuild (equivalently, HardwareDebugAPI=1 in the CMake config).
The available TCB invocations are:
seL4_TCB_SetBreakpoint— configure a breakpoint or watchpoint (virtual address, access type: read/write/exec, size).seL4_TCB_GetBreakpoint— read the current configuration.seL4_TCB_UnsetBreakpoint— disable a slot.seL4_TCB_ConfigureSingleStepping— break on every N-th instruction.
Each invocation takes a capability to the target TCB. Only a holder of that TCB capability can manipulate the thread’s debug registers.
Debug-only kernel syscalls
KernelDebugBuild also enables:
seL4_DebugSnapshot— outputs a CapDL dump of the current kernel capability state to the serial console.seL4_DebugDumpScheduler— dumps TCB addresses, thread names, instruction pointers, priorities, and scheduler states.
These syscalls expose global kernel state and are intentionally excluded from verified (proof) builds where the information flow would violate the formal security model.
Lesson for capOS. seL4 gates per-thread debug authority on possession of
the TCB capability, which is the right model. capOS’s DebugSession should
similarly be derived from ThreadControl so that only the process or entity
that holds ThreadControl for a thread can open a debug session on it.
The seL4_DebugSnapshot pattern also shows that a system-wide cap-table
snapshot is a separate, higher-privilege operation from per-thread debug access;
in capOS a read-only CapTableSnapshot authority can be issued for audit
purposes without granting register/memory write access.
6. Genode: capability-session GDB monitor
Architecture
Genode implements user-level debugging via a GDB monitor component that interposes between a target application and its parent. The GDB monitor:
- Intercepts session requests from the target before they reach the parent.
- Provides local virtual implementations of the CPU service, RM (region-map/address-space) service, and ROM service, wrapping the real core implementations.
- Exposes a gdbserver protocol endpoint over a terminal session (TCP or UART).
This gives the GDB monitor “full control over all threads and memory objects (dataspace) and the address space of the target.” The monitor holds real capabilities to the target’s CPU and address-space sessions; the target’s own session handles are virtualized stubs that forward to the monitor.
Capability session scoping
Genode’s Cpu_session interface allows retrieving and modifying thread register
and execution state. The API comment in the framework explicitly notes that
these operations are “primarily designated for realizing user-level debuggers.”
Because the monitor interposes the CPU session, it holds the same authority the
parent would hold, but the target holds only stubs — the target cannot see or
touch its own debug registers directly.
Lesson for capOS. The Genode monitor pattern reinforces that debugging
authority flows from capability delegation, not from process identity. The
interposition model also clarifies the ring-trace design decision: debug_tap
in capOS captures SQE/CQE ring records passively and does not require
interposing a CPU session, which keeps ring-trace authority weaker and
non-interrupting by construction. A full DebugSession (register read/write,
breakpoints) requires explicit session acquisition from the parent or init,
matching the Genode monitor’s explicit CPU-session grant.
Applicability to capOS
The cross-system survey points to a consistent set of design invariants:
-
DebugSessionattach is an explicit, audited capability grant, not ambient. The anti-pattern is Linux ptrace at level 0; Yama level 3 is the correct default posture. In capOS no process inherits the ability to debug another: aDebugSessionis derived fromThreadControl, issued by the process’s parent or init, and recorded in the audit log. -
Read-only cap-table snapshots transfer no authority. seL4’s
seL4_DebugSnapshotis a separate, opt-in, debug-build-only facility. In capOS aCapTableSnapshotcap can be issued for audit visibility without granting any write access to the observed process. -
Ring-trace builds on
debug_tapand does not stop the target.perf/CAP_PERFMONshows that sampling is a distinct authority class from full debugging. capOSdebug_tapring records are append-only, non-interrupting, and do not feed back into the target’s execution — matching the sampler authority class, not theDebugSessionclass. -
Sampler does not stop the target. Hardware performance counter sampling (
CAP_PERFMONsemantics) and ring-record sampling (debug_tap) are passive read surfaces. ADebugSessionthat can set breakpoints, modify registers, or write memory is a distinct, higher-privilege capability and must not be conflated with passive tracing. -
Exception observation is weaker than debug write authority. Zircon’s
zx_task_create_exception_channelreturns a read-only channel. capOS should provide a similarExceptionObservercapability (receive crash notifications, no write access) independent ofDebugSession.
Sources
- GDB Remote Serial Protocol (Embecosm application note): https://www.embecosm.com/appnotes/ean4/embecosm-howto-rsp-server-ean4-issue-2.html
- GDB remote protocol packet reference (Apple/GNU): https://developer.apple.com/library/archive/documentation/DeveloperTools/gdb/gdb/gdb_33.html
- Linux kernel Yama documentation: https://docs.kernel.org/admin-guide/LSM/Yama.html
- Linux perf events security (kernel docs): https://docs.kernel.org/admin-guide/perf-security.html
perf_event_open(2)man page: https://man7.org/linux/man-pages/man2/perf_event_open.2.html- LWN: Introducing
CAP_PERFMON: https://lwn.net/Articles/816647/ - Fuchsia
fuchsia.debuggerFIDL reference: https://fuchsia.dev/reference/fidl/fuchsia.debugger - Fuchsia debugger (zxdb) overview: https://fuchsia.dev/fuchsia-src/development/debugger
- Fuchsia Zircon handles and rights: https://fuchsia.dev/fuchsia-src/concepts/kernel/handles
- Fuchsia exception handling: https://fuchsia.dev/fuchsia-src/concepts/kernel/exceptions
- seL4 API reference (debug syscalls and TCB hardware debug): https://docs.sel4.systems/projects/sel4/api-doc.html
- seL4 hardware debug tutorial: https://docs.sel4.systems/projects/sel4-tutorials/debugging-userspace.html
- seL4 kernel configurations (
KernelDebugBuild): https://docs.sel4.systems/projects/sel4/configurations.html - Genode GDB developer resources: https://genode.org/documentation/developer-resources/gdb
- Genode session interfaces (CPU session): https://genode.org/documentation/genode-foundations/20.05/functional_specification/Session_interfaces_of_the_base_API.html
IX-on-capOS Hosting Research
Research note on using IX as a package corpus and content-addressed build model for a more mature capOS system. It explains what IX provides, why it is useful for capOS, and how to extract the most value from it without importing CPython/POSIX assumptions as an architectural dependency.
capOS alignment note (2026-05-16): None of the stages described here (Stages A–F) are implemented. The capability-native services sketched in this note (BuildCoordinator, Store, Namespace, Fetcher, Archive, BuildSandbox) do not yet exist. Cloud usable-instance work, which IX hosting depends on, remains blocked on
DMAPool/DeviceMmio/Interruptauthority and a production NIC/storage driver path. The POSIX adapter track (Phase P1.4) is proceeding independently. IX hosting is future work contingent on a credible userspace-compatibility and storage foundation.
What IX Is
IX is a source-based package/build system. It describes packages as templates, expands those templates into build descriptors and shell scripts, fetches and verifies source inputs, executes dependency-ordered builds, stores outputs in a content-addressed store, and publishes usable package environments through realm mappings.
For capOS, IX should be treated as three separable assets:
- a package corpus with thousands of package definitions and accumulated build knowledge;
- a content-addressed build/store model that already fits reproducible artifact management;
- a compact Python control plane that can be adapted once authority-bearing operations move behind capOS services.
IX should not be treated as a requirement to reproduce Unix inside capOS. Its current implementation uses CPython, Jinja2, subprocesses, shell tools, filesystem paths, symlinks, hardlinks, signals, and process groups because it runs on Unix-like hosts today. Those are implementation assumptions, not the part worth preserving unchanged.
Why IX Is Useful for capOS
capOS needs a credible path from isolated demos to a useful userspace closure. IX is useful because it supplies a package/build corpus and model that can exercise the exact system boundaries capOS needs to grow:
- process spawning with explicit argv, env, cwd, stdio, and exit status;
- fetch, archive extraction, and content verification as auditable services;
- Store and Namespace capabilities instead of ambient global filesystem authority;
- build sandboxing with explicit input, scratch, output, network, and resource policies;
- static-tool bootstrapping before a full dynamic POSIX environment exists;
- differential testing against the existing host IX implementation.
The main value is leverage. IX can give capOS real package metadata, real build scripts, and real toolchain pressure without making CPython or a broad POSIX personality the first required userspace milestone.
Best Way to Get the Most from IX
The optimal strategy is to preserve IX’s package corpus and build semantics while replacing the Unix-shaped execution boundary with capability-native services.
The high-value path is:
- Run upstream IX on the host first to build and validate early capOS artifacts.
- Use CPython/Jinja2 on the host as a reference oracle, not as the in-system foundation.
- Render IX templates through a Rust
ix-templatecomponent that implements the subset IX actually uses. - Run the adapted IX planner/control plane on native MicroPython once capOS has enough runtime support.
- Move fetch, extract, build, Store commit, Namespace publish, and process lifecycle into typed capOS services.
This gets most of IX’s value: package knowledge, reproducible build structure, and a practical self-hosting path. It avoids the lowest-value part: spending early capOS effort on a large CPython/POSIX compatibility layer just to preserve upstream implementation details.
Position
CPython is not an architectural prerequisite for IX-on-capOS.
It is a compatibility shortcut for running upstream IX with minimal changes. For a clean capOS-native integration, the better design is:
- keep IX’s package corpus and content-addressed build model;
- adapt IX’s Python control-plane code instead of preserving every CPython and POSIX assumption;
- run the adapted control plane on a native MicroPython port;
- move build execution, fetching, archive extraction, store mutation, and sandboxing into typed capOS services;
- render IX templates through a Rust template service or tightly scoped IX template engine, not full Jinja2 on MicroPython;
- keep CPython on the host as a differential test oracle and bootstrap tool, not as a required foundation layer for capOS.
MicroPython is a credible sweet spot only with that boundary. It is not a credible sweet spot if the requirement is “make upstream Jinja2, subprocess, fcntl, process groups, and Unix filesystem behavior all work inside MicroPython.”
Sources Inspected
- Upstream IX repository:
https://github.com/pg83/ix - IX package guide:
PKGS.md - IX core:
core/ - IX templates:
pkgs/die/ - Bundled IX template deps:
deps/jinja-3.1.6/,deps/markupsafe-3.0.3/ - MicroPython library docs:
https://docs.micropython.org/en/latest/library/index.html - MicroPython CPython-difference docs:
https://docs.micropython.org/en/latest/genrst/ - MicroPython porting docs:
https://docs.micropython.org/en/latest/develop/index.html - Jinja docs:
https://jinja.palletsprojects.com/en/latest/intro/ - MiniJinja docs:
https://docs.rs/minijinja/latest/minijinja/
Upstream IX Shape
IX is a source-based, content-addressed package/build system. Package
definitions are Jinja templates under pkgs/, mostly named ix.sh, and the
template hierarchy under pkgs/die/ expands those package descriptions into
JSON descriptors and shell build scripts.
The inspected clone has:
- 3788 package
ix.shfiles; - 66 files under
pkgs/die; - a template chain centered on
base.json,ix.json,script.json,sh0.sh,sh1.sh,sh2.sh,sh.sh,base.sh,std/ix.sh, and language/build-system templates for C, Rust, Go, Python, CMake, Meson, Ninja, WAF, GN, Kconfig, and shell-only generated packages.
The IX template surface is broad but not arbitrary Jinja. In the package tree surveyed, the Jinja tags used were:
| Tag | Count |
|---|---|
block | 14358 |
endblock | 14360 |
extends | 3808 |
if / endif | 451 / 451 |
include | 344 |
else | 123 |
set / endset | 52 / 52 |
for / endfor | 49 / 49 |
elif | 23 |
No macro, import, from, with, filter, raw, or call tags were
found in the inspected tree. That matters: IX’s template needs are probably a
finite subset around inheritance, blocks, self.block(), super(), includes,
conditionals, loops, assignments, expressions, and custom filters.
IX’s own Jinja wrapper is small. core/j2.py defines:
- custom loader with
//root handling; - include inlining;
- filters such as
b64e,b64d,jd,jl,group_by,basename,dirname,ser,des,lines,eval,defined,field,pad,add,preproc,parse_urls,parse_list,list_to_json, andfjoin.
That makes the template layer replaceable. The risk is not “Jinja is impossible.” The risk is “full upstream Jinja2 drags in a CPython-shaped runtime just to implement a template subset IX mostly uses in a disciplined way.”
Current IX Runtime Surface
The IX Python core uses ordinary host-scripting features:
os,os.path,json,hashlib,base64,random,string,functools,itertools,platform,getpass;shutil.which,shutil.rmtree,shutil.move;subprocess.run,check_call,check_output;os.execvpe,os.kill,os.setpgrp,signal.signal;fcntl.fcntlto reset stdout flags;asynciofor graph scheduling;multiprocessing.cpu_count;contextvarsfallback support forasyncio.to_thread;tarfile,zipfile;ssl,urllib3, usually only to suppress certificate warnings while fetchers are shell-driven;os.symlink,os.link,os.rename,os.makedirs,open, and file tests.
core/execute.py is the important boundary. It schedules a DAG, prepares
output directories, calls shell commands with environment variables and stdin,
checks output touch files, and kills the process group on failure.
core/cmd_misc.py and core/shell_cmd.py cover fetch, extraction, hash
checking, archive unpacking, and hardlinking fetched inputs.
core/realm.py maps build outputs into realm names using symlinks and metadata
under /ix/realm.
core/ops.py selects an execution mode. Today the modes are local, system,
fake, and molot. A capOS executor mode is the correct integration point.
CPython Path
CPython is the obvious route for upstream compatibility:
- upstream Jinja2 is designed for modern Python and uses normal CPython-style standard library facilities;
- IX’s current Python code assumes
subprocess,asyncio,fcntl,shutil, archive modules, and process semantics; - CPython plus
libcapos-posixwould let a large fraction of that code run with limited changes.
That does not make CPython the right product dependency for IX-on-capOS. CPython pulls in a large libc/POSIX surface and encourages preserving Unix process and filesystem assumptions that capOS should make explicit through capabilities.
CPython should be used in two places:
- Host-side bootstrap and reference evaluation.
- Optional compatibility mode once
libcapos-posixis mature.
It should not be the required path for a clean IX-capOS integration.
If CPython is needed later, capOS has two routes:
- Native CPython through musl plus
libcapos-posix. - CPython compiled to WASI and run through a native WASI runtime.
The native POSIX route is the only route that makes sense for IX-style build
workloads. It needs fd tables, path lookup, read/write/close/lseek, directory
iteration, rename/unlink/mkdir, time, memory mapping, posix_spawn, pipes,
exit status, and eventually sockets. That is the same compatibility work
needed for shell tools and build systems, so it should arrive as part of the
general userspace-compatibility track, not as an IX-specific dependency.
The WASI route is useful for sandboxed or compute-heavy Python, but it is a poor fit for IX package builds because IX fundamentally drives external tools, filesystem trees, fetchers, and process lifecycles. WASI CPython can be useful as a script sandbox, not as the main IX appliance runtime.
MicroPython Path
MicroPython is attractive because capOS needs an embeddable system scripting runtime before it needs a full desktop Python environment.
The upstream docs frame MicroPython as a Python implementation with a smaller,
configurable library set. The latest library docs list micro versions of
modules relevant to IX, including asyncio, gzip, hashlib, json, os,
platform, random, re, select, socket, ssl, struct, sys, time,
zlib, and _thread, while warning that most standard modules are subsets
and that port builds may include only part of the documented surface.
That is a good fit for capOS. It means a capOS port can expose a deliberately chosen OS surface instead of pretending to be Linux.
MicroPython should host:
- package graph traversal;
- package metadata parsing;
- target/config normalization;
- dependency expansion;
- high-level policy;
- command graph generation;
- calls into capOS-native services.
MicroPython should not own:
- generic subprocess emulation;
- shell execution internals;
- process groups or Unix signals;
- TLS/network fetching;
- archive formats beyond small helper cases;
- hardlink/symlink implementation;
- content store mutation;
- build sandboxing;
- parallel job scheduling if that wants kernel-visible resource control.
Those belong in capOS services.
Native MicroPython Port Shape
A capOS MicroPython port should be a new MicroPython platform port, not the Unix port with a large compatibility shim underneath.
The port should provide:
- VM startup through
capos-rt; - heap allocation from a fixed initial heap first, then
VirtualMemorywhen growth is available; - stdin/stdout/stderr backed by granted stream or Console capabilities;
- module import from a read-only Namespace plus frozen modules;
- a small VFS adapter over Store/Namespace for scripts and package metadata;
- native C/Rust extension modules for capOS capabilities;
- deterministic error mapping from capability exceptions to Python exceptions.
The initial built-in surface should be deliberately small:
syswith argv/path/modules;ospath and file operations backed by a granted namespace;timebacked by a clock capability;hashlib,json,binascii/base64,random,struct;- optional
asyncioif the planner keeps Python-level concurrency; - no general-purpose
subprocessuntil the service boundary proves it is necessary.
For IX, the MicroPython port should ship frozen planner modules and native
bindings to ix-template, BuildCoordinator, Store, Namespace, Fetcher,
and Archive. That keeps the trusted scripting surface small and avoids
import-time dependency drift.
Jinja2 and MicroPython
Full Jinja2 compatibility on MicroPython remains unproven and is probably not the optimal target.
Current Jinja docs say Jinja supports Python 3.10 and newer, depends on
MarkupSafe, and compiles templates to optimized Python code. The bundled IX
Jinja tree imports modules such as typing, weakref, importlib,
contextlib, inspect, ast, types, collections, itertools, io, and
MarkupSafe. Some of these can be ported or stubbed, but that is a CPython
compatibility project, not a small MicroPython extension.
The better path is to treat IX’s template language as an input format and render it with a capOS-native component.
Recommended template strategy:
- Build an
ix-templateRust component using MiniJinja or a smaller IX-specific template subset. - Register IX’s custom filters from
core/j2.py. - Implement IX’s loader semantics:
//package-root paths, relative includes, and cached sources. - Reject unsupported Jinja constructs with deterministic errors.
- Keep CPython/Jinja2 as a host-side oracle for differential testing until the capOS renderer matches the package corpus.
MiniJinja is a practical candidate because it is Rust-native, based on Jinja2
syntax/behavior, supports custom filters and dynamic objects, and has feature
flags for trimming unused template features. IX needs multi-template support
because it uses extends, include, and block.
If MiniJinja compatibility is insufficient, the fallback is not CPython by
default. The fallback is an IX-template subset evaluator that implements the
constructs actually used by pkgs/.
Optimal Architecture
The clean design is an IX-capOS build appliance, not a Unix personality layer that happens to run IX.
flowchart TD
CLI[ix CLI or build request] --> Planner[ix planner on MicroPython]
Planner --> Template[ix-template renderer]
Planner --> Graph[normalized build graph]
Template --> Graph
Graph --> Coordinator[capOS BuildCoordinator service]
Coordinator --> Fetcher[Fetcher service]
Coordinator --> Extractor[Archive service]
Coordinator --> Store[Store service]
Coordinator --> Sandbox[BuildSandbox service]
Fetcher --> Store
Extractor --> Store
Sandbox --> Proc[ProcessSpawner]
Sandbox --> Scratch[writable scratch namespace]
Sandbox --> Inputs[read-only input namespaces]
Proc --> Tools[sh, make, cc, cargo, go, coreutils]
Sandbox --> Output[write-once output namespace]
Output --> Store
Store --> Realm[Namespace snapshot / realm publish]
The planner remains small and scriptable. The authority-bearing work happens in services:
BuildCoordinator: owns graph execution and job state.Store: content-addressed objects and output commits.Namespace: names, realms, snapshots, and package environments.Fetcher: network-capable source acquisition with explicit TLS and cache policy.Archive: deterministic extraction and path-safety checks.BuildSandbox: constructs per-build capability sets.ProcessSpawner: starts shell/tools with controlled argv, env, cwd, stdio, and granted capabilities.Toolchainpackages: statically linked tools built externally first, then eventually by IX itself.
The adapted IX planner should call service APIs instead of shelling out for operations that are native capOS concepts.
Control-Plane Boundary
MicroPython should see a narrow, high-level API. It should not synthesize Unix from first principles.
Example shape:
import ixcapos
import ixtemplate
pkg = ixcapos.load_package("bin/minised")
desc = ixtemplate.render_package(pkg.name, pkg.context)
graph = ixcapos.plan(desc, target="x86_64-unknown-capos")
result = ixcapos.build(graph)
ixcapos.publish_realm("dev", result.outputs)
The Python layer can still look like IX. The implementation behind it should be capability-native.
Service API Sketch
The exact schema should follow the project schema style, but this is the shape of the boundary:
interface BuildCoordinator {
plan @0 (package :Text, target :Text, options :BuildOptions)
-> (graph :BuildGraph);
build @1 (graph :BuildGraph) -> (result :BuildResult);
publish @2 (realm :Text, outputs :List(OutputRef))
-> (namespace :Namespace);
}
interface BuildSandbox {
run @0 (command :Command, inputs :List(Namespace),
scratch :Namespace, output :Namespace, policy :SandboxPolicy)
-> (status :ExitStatus, log :BlobRef);
}
interface Fetcher {
fetch @0 (url :Text, sha256 :Data, policy :FetchPolicy)
-> (blob :BlobRef);
}
interface Archive {
extract @0 (archive :BlobRef, policy :ExtractPolicy)
-> (tree :Namespace);
}
Important policy fields:
- network allowed or denied;
- wall-clock and CPU budgets;
- maximum output bytes;
- allowed executable namespaces;
- allowed output path policy;
- whether timestamps are normalized;
- whether symlinks are preserved, rejected, or translated;
- whether hardlinks become store references or copied files.
Store and Realm Mapping
IX’s /ix/store maps well to capOS Store.
IX’s realms should not be literal symlink trees in capOS. They should be named Namespace snapshots:
| IX concept | capOS mapping |
|---|---|
/ix/store/<uid>-name | Store object/tree with stable content hash and metadata |
| build output dir | write-once output namespace |
| build temp dir | scratch namespace with cleanup policy |
| realm | named Namespace snapshot |
| symlink from realm to output | Namespace binding or bind manifest |
| hardlinked source cache | Store reference or copy-on-write blob binding |
touch output sentinel | build-result metadata, optionally synthetic file for compatibility |
This preserves IX’s reproducibility model without importing global Unix authority.
Process and Filesystem Requirements
A mature capOS needs these primitives before IX builds can run natively:
ProcessSpawnerandProcessHandle;- argv/env/cwd/stdin/stdout/stderr passing;
- exit status;
- pipes or stream capabilities;
- fd-table support in the POSIX layer for ported tools;
- read-only input namespaces;
- writable scratch namespaces;
- write-once output namespaces;
- directory listing, create, rename, unlink, and metadata;
- symlink translation or explicit rejection policy;
- hardlink translation or store-reference fallback;
- monotonic time;
- resource limits;
- cancellation.
For package builds, the tool surface is larger than IX’s Python surface:
sh;find,sed,grep,awk,sort,xargs,install,cp,mv,rm,ln,chmod,touch,cat;tar,gzip,xz,zstd,zip,unzip;make,cmake,ninja,meson,pkg-config;- C compiler/linker/archive tools;
cargoand Rust toolchains;- Go toolchain;
- Python only for packages that build with Python.
IX’s static-linking bias helps because the early tool closure can be imported as statically linked binaries.
What to Patch Out of IX
For a clean capOS fit, patch or replace these upstream assumptions:
| Upstream assumption | capOS replacement |
|---|---|
subprocess.run everywhere | BuildSandbox.run() or ProcessSpawner |
process groups and SIGKILL | ProcessHandle.killTree() or sandbox cancellation |
fcntl stdout flag reset | remove or make no-op |
chrt, nice | scheduler/resource policy on sandbox |
sudo, su, chown | no permission-bit authority; use capability grants |
unshare, tmpfs, jail | BuildSandbox with explicit caps |
/ix/store global path | Store capability plus namespace mount view |
/ix/realm symlink tree | Namespace snapshot/publish |
| hardlinks for fetched files | Store refs or copy fallback |
curl/wget subprocess fetch | Fetcher service |
Python tarfile/zipfile | Archive service |
asyncio executor | BuildCoordinator scheduler |
This is more invasive than a “light patch”, but it is cleaner. The IX package corpus and target/build knowledge are preserved; Unix process plumbing is not.
MicroPython Port Scope
The MicroPython port should be sized around IX planner needs plus general system scripting:
Native modules:
capos: bootstrap capabilities, typed capability calls, errors.ixcapos: package graph and build-service client bindings.ixtemplate: template render calls if the renderer is an embedded Rust/C component.ixstore: Store and Namespace helpers.
Python/micro-library requirements:
json;hashlib;base64orbinascii;os.pathsubset;random;time;- small
shutilsubset for path operations if old IX code remains; - small
asyncioonly if planner concurrency remains in Python.
Avoid implementing:
- general
subprocess; - general
fcntl; - full
signal; - full
multiprocessing; - full
tarfile; - full
zipfile; - full
ssl/urllib3; - full Jinja2.
Those are symptoms of preserving the wrong boundary.
CPython Still Has a Role
CPython remains useful even if it is not a capOS prerequisite:
- run upstream IX on the development host;
- compare rendered descriptors from CPython/Jinja2 against
ix-template; - generate fixtures for the capOS renderer;
- bootstrap the first static tool closure;
- serve as a later optional POSIX compatibility demo.
Differential testing should be explicit:
flowchart LR
Pkg[IX package] --> Cpy[Host CPython + Jinja2]
Pkg --> Cap[capOS ix-template]
Cpy --> A[descriptor A]
Cap --> B[descriptor B]
A --> Diff[normalized diff]
B --> Diff
Diff --> Corpus[compatibility corpus]
This makes CPython a test oracle, not a trusted runtime dependency inside capOS.
Staged Plan
Stage A: Host IX builds capOS artifacts
Run IX on Linux host first. Add a capos target and recipes for static capOS
ELFs. This validates package metadata, target triples, linker flags, and static
closure assumptions before capOS hosts any of it.
Outputs:
x86_64-unknown-capostarget model in IX;- recipes for
libcapos,capos-rt, shell/coreutils candidates, MicroPython, and archive/fetch helpers; - static artifacts imported into the boot image or Store.
Stage B: Template compatibility harness
Build ix-template on the host. Render a package corpus through CPython/Jinja2
and through ix-template. Normalize JSON/script output and record divergences.
Outputs:
- supported IX template subset;
- custom filter implementation;
- fixture corpus;
- list of unsupported packages or constructs.
Stage C: Native MicroPython port
Port MicroPython to capOS as a normal native userspace program using
capos-rt and a small libc/POSIX subset only where needed.
Outputs:
- REPL or script runner;
- frozen IX planner modules;
- native
capos,ixcapos, andixtemplatemodules; - no promise of full CPython compatibility.
Stage D: BuildCoordinator and sandboxed execution
Implement capOS-native build services and run simple package builds using externally supplied static tools.
Outputs:
- build graph execution;
- per-build scratch/output namespaces;
- deterministic logs and output commits;
- cancellation and resource policies.
Stage E: IX package corpus migration
Patch IX templates for capOS target semantics. Start with simple C/static packages, then Rust, then Go.
Outputs:
- C/static package subset;
- regular Rust package support once regular Rust runtime/toolchain work is ready;
- Go package support when
GOOS=caposor imported Go toolchain support is credible; - WASI packages as a separate target family where useful.
Stage F: Self-hosting
Run the IX-capOS appliance inside capOS to rebuild a meaningful part of its own userspace closure.
Outputs:
- build the MicroPython IX planner inside capOS;
- build core shell/coreutils/archive tools inside capOS;
- build
libcaposand selected static service binaries; - eventually build Rust and Go runtime/toolchain pieces.
Why This Is Better Than “CPython First”
The CPython-first route optimizes for running upstream IX quickly. The MicroPython-plus-services route optimizes for capOS’s actual design:
- capability authority stays typed and explicit;
- build isolation is native instead of Linux namespace emulation;
- Store/Namespace are first-class rather than hidden behind
/ix; - fetch/archive/build operations are auditable services;
- the scripting runtime remains small;
- the system does not need full CPython before it can have a package manager;
- CPython can still be added later through the POSIX layer without blocking IX-capOS.
The tradeoff is that IX-capOS becomes a real port/fork at the control-plane boundary. That is acceptable for a clean capability-native fit.
Risks
Template compatibility is the main technical risk. IX uses a restricted-looking
Jinja subset, but exact self.block(), super(), whitespace, expression, and
undefined-value behavior must match closely enough for package hashes to remain
stable. This needs corpus testing, not confidence.
Build-script compatibility is the largest scope risk. Even if IX planning is native, the package corpus still executes conventional build systems. capOS must provide enough shell, coreutils, archive, compiler, and filesystem behavior for those tools.
Toolchain bootstrapping is a long dependency chain. The first useful IX-capOS system will import statically linked tools from a host. Native self-hosting is late-stage work.
Store semantics need care around directories, symlinks, hardlinks, mtimes, and executable bits. These details affect build reproducibility and package compatibility.
MicroPython must not grow into a bad CPython clone. If many missing modules are implemented only to satisfy upstream IX assumptions, the design boundary has failed.
Recommendation
Adopt IX as a package corpus and build model, not as a CPython/POSIX program to preserve unchanged.
The optimal capOS-native solution is:
- Host-side upstream IX remains available for bootstrap and oracle tests.
ix-templatein Rust renders the actual IX template subset.- Native MicroPython runs the adapted IX planner/control plane.
- capOS services execute all authority-bearing operations: fetch, extract, build sandbox, Store commit, Namespace publish, and process lifecycle.
- CPython is deferred to general POSIX compatibility and optional tooling.
This makes MicroPython the sweet spot for the in-system IX control plane while avoiding the trap of turning MicroPython into CPython.
Pingora Architecture and Philosophy: Research Report for capOS
Research on Cloudflare’s Pingora framework and whether capOS high-level interfaces should borrow its shape.
Status 2026-06-10 13:25 UTC: the kernel Phase B socket path described below is
since retired — the kernel socket owner, TcpSocket.intoTerminalSession, and
the telnet-gateway demo are removed, and the production socket path is the
Phase C userspace network stack. The directional guidance still applies, read
against the userspace stack.
Status 2026-05-23 00:06 UTC: the directional guidance in this report remains
current. Since the report was first written, Phase B of the networking proposal
has landed: NetworkManager, TcpListener, TcpSocket, and TcpSocket.intoTerminalSession
are implemented in-kernel; the telnet-gateway userspace service runs on a
manifest-forwarded TcpListenAuthority and RestrictedShellLauncher, exercising
the accept/negotiate/session-mint/shell-launch/cleanup lifecycle described in
the “Concrete capOS Direction” section. Bounded SSH gateway prerequisites
(SshHostKey, AuthorizedKeyStore, public-key session minting, restricted shell
launch) are implemented as kernel stubs and fixture proofs; encrypted SSH
transport and an OpenSSH-compatible handshake are not yet implemented.
capos-service slice 1 has landed as a standalone no_std crate: the plaintext
Telnet gateway now uses ServiceMain/ServiceRuntime for initialize,
dependency-wait, readiness, and run-loop structure. The
TerminalSessionFromByteStream / byte-stream terminal host, endpoint-loop
helpers, metrics, budgeting, and graceful handoff pieces remain open work in
docs/proposals/capos-service-proposal.md.
Bottom Line
capOS should build some high-level userspace interfaces inspired by Pingora’s architecture, but should not make Pingora’s HTTP proxy model, callback set, or runtime structure part of the kernel ABI.
The useful idea is not “copy Pingora.” The useful idea is an opinionated library layer that owns repetitive service mechanics and exposes a typed, phase-oriented customization surface to application code. For capOS, that belongs above the capability ring, in userspace libraries and domain services:
capos-rtremains the raw transport owner: bootstrap, CapSet, ring client, typed handles, completion matching, release flushing, exception decoding.capos-serviceshould own service lifecycle mechanics: endpoint receive/return loops, readiness, dependency waiting, shutdown, background tasks, metrics hooks, and graceful handoff.- Domain libraries such as
libcapos-http, terminal hosts, network services, storage services, and agent services can expose Pingora-style phase hooks for their specific request lifecycle. - Kernel capability interfaces should stay narrow, typed, and stable. Do not
add a generic
Servicecapability, callback registry, plugin API, or Pingora-like phase machine to the kernel.
This is a “yes, but only at the userspace framework layer” recommendation.
Sources
Primary external sources:
- Cloudflare, How we built Pingora, the proxy that connects Cloudflare to the Internet, 2022-09-14.
- Cloudflare, Open sourcing Pingora: our Rust framework for building programmable network services, 2024-02-28.
- Cloudflare Pingora repository, README.
- Pingora docs, Internals.
- Pingora docs, Life of a request: phases and filters.
- Pingora docs, Sharing state across phases with CTX.
- Pingora docs, Connection pooling and reuse.
- Pingora docs, Handling failures and failover.
- Pingora docs, Graceful restart and shutdown.
- Pingora docs, Configuration.
- Pingora docs, How to return errors.
- Pingora source snapshot inspected locally:
c0adfd32c216a3bec14371ec4467236f34a6f9db, dated 2026-04-17. - Pingora source files inspected at that snapshot:
server/mod.rs,services/mod.rs,services/listening.rs,services/background.rs,apps/mod.rs,proxy_trait.rs,pingora-runtime/src/lib.rs, andpingora-error/src/lib.rs. - Pingora 0.8.0 changelog, CHANGELOG.md.
- Cloudflare, Resolving a request smuggling vulnerability in Pingora, 2025-05-22.
- Cloudflare, Fixing request smuggling vulnerabilities in Pingora OSS deployments, 2026-03-09.
capOS grounding read for this comparison:
- Capability Model
- Capability Ring
- IPC and Endpoints
- Userspace Runtime
- Service Architecture
- Networking
- Capability-Based and Microkernel Operating Systems Survey
- Genode
- Plan 9 and Inferno
- Zircon
- seL4
What Pingora Is
Pingora is a Rust framework for building programmable network services, especially HTTP proxies. Cloudflare built it after concluding that NGINX’s process/worker architecture and extension model were limiting performance, connection reuse, safety, and feature velocity at Cloudflare scale.
The original design pressure matters:
- NGINX’s per-worker connection pools harmed reuse as worker count increased. Pingora’s shared multithreaded architecture improved origin connection reuse and reduced new TCP/TLS handshakes.
- Cloudflare wanted a statically typed, memory-safe implementation language rather than a C core plus Lua extension layer.
- Cloudflare chose to implement its own HTTP handling rather than rely on an off-the-shelf library because it needed control over non-standard Internet traffic and product-specific behavior.
- Pingora is a library and toolset, not a finished proxy binary. Users build their own executable around Pingora’s server, service, and proxy APIs.
That last point is the main architectural lesson for capOS: the framework is valuable because it packages the hard reusable mechanics while leaving product logic in typed extension points.
Architecture
Server, Services, and Applications
Pingora’s top-level Server represents one process. It owns configuration,
CLI handling, daemonization, signal handling, service startup, graceful
shutdown, and zero-downtime upgrade mechanics. A Server hosts multiple
services.
A Service is the long-running unit of work. Listening services own one or
more endpoints and an application object. Background services run supporting
tasks such as discovery, health checks, metrics, or bootstrap logic. Recent
Pingora versions also include service dependency metadata, readiness watches,
and topological startup ordering.
The layering is deliberately split:
Serverowns process-level operation.Serviceowns listener setup, endpoint accept loops, runtime choice, and shutdown propagation.ServerApphandles an established transport stream.HttpServerAppadds HTTP session negotiation and H1/H2 handling.HttpProxyimplements the HTTP proxy workflow.- User code implements
ProxyHttpto customize the proxy phases.
This means the server has no special concept of “proxy” at the root. Proxying is one application shape hosted by the generic service container.
Per-Service Runtime
Each service gets its own runtime/threadpool. Pingora can use Tokio’s normal multi-threaded work-stealing runtime or a “no steal” runtime built from multiple single-threaded Tokio runtimes. The no-steal option exists because work stealing has overhead, while isolated current-thread runtimes can still use multiple cores.
The important design lesson is not the exact runtime. capOS cannot inherit a Tokio process model directly. The lesson is that runtime policy is a service container concern, not application business logic.
Phase-Oriented Proxy Logic
Pingora’s ProxyHttp trait exposes an ordered lifecycle for a proxied request:
- initialize per-request context,
- run early and normal request filters,
- decide whether to serve from cache or go upstream,
- select an upstream peer,
- handle connect success or failure,
- modify the upstream request,
- process request body chunks,
- process upstream response headers, body chunks, and trailers,
- process downstream response headers, body chunks, and trailers,
- decide retry/failover behavior,
- report final logging and summaries.
Most filters are optional. A per-request CTX object is created for each
request and is passed mutably through the phases. Shared state across requests
is ordinary Rust shared state such as Arc, atomics, or locks.
The ergonomics are strong because the framework gives engineers a lifecycle map. Application code overrides the phase where it has policy, while the framework owns parsing, connection setup, pooling, retries, duplex body forwarding, common error response handling, and resource cleanup.
Connection Pooling and Peer Identity
Pingora pools upstream connections automatically after successful requests, but
only reuses a connection for the exact same Peer. Its peer identity includes
address, scheme, SNI, client certificate, certificate verification behavior,
hostname verification, alternate common name, and proxy settings.
The security lesson is broad: resource reuse must be keyed by all attributes that affect authority, identity, confidentiality, and protocol semantics. A connection pool keyed only by address is wrong for a multi-tenant service.
Failure and Retry
Pingora separates connect failure from post-connect proxy failure. It lets application code mark errors retryable, and it documents the idempotency boundary: retrying after the request was sent is not generally safe for non-idempotent methods.
Its common error type carries an error type, source, retry status, optional cause, and context. That mirrors capOS’s existing split between transport errors and typed application exceptions, but Pingora puts more emphasis on whether a high-level operation can be retried.
Operations Are Part of the Framework
Pingora treats startup, daemonization, graceful termination, graceful upgrade, configuration, error logging, Prometheus metrics, readiness, and service dependencies as framework-level concerns.
The zero-downtime upgrade path transfers listening sockets from an old process to a new one and lets existing requests drain during a grace period. That is a specific Linux mechanism, but the higher-level idea maps to capOS live upgrade: stable acceptor or endpoint authority should be retargetable without dropping new work, and old in-flight calls should be allowed to drain when policy says they can.
Philosophy
Pingora’s philosophy is pragmatic, not minimalist:
- Build a framework, not a monolithic product.
- Own the hot-path mechanics so users do not reimplement them incorrectly.
- Expose typed hooks at lifecycle points where policy naturally belongs.
- Keep common operational behavior in the container rather than every service.
- Prefer static typing and memory safety for extensibility.
- Share reusable resources across workers when the safety boundary allows it.
- Give application code enough control to handle product-specific edge cases.
There is tension in that philosophy. Pingora’s permissive, Internet-facing HTTP goals require supporting odd traffic and complex reuse rules. That flexibility can create security hazards if defaults are too generous or protocol state is not exhausted before reuse.
The 2025 and 2026 Pingora request-smuggling advisories are directly relevant to capOS design. The lesson is not that Pingora is unsafe. The lesson is that high-level frameworks become security-critical because they decide defaults, message framing, cache keys, retry rules, and reuse conditions for their users. capOS libraries should treat those defaults as part of the trusted interface.
Mapping to capOS
What capOS Should Adopt
1. A userspace service framework layer.
capOS already has a low-level transport owner in capos-rt. The next layer
should be an opinionated service framework that runs on top of typed capability
clients and endpoint server helpers. It should not replace capos-rt; it
should use it.
Candidate shape:
- service lifecycle: init, ready, run, shutdown, drain;
- dependency waiting: typed readiness handles, not global service names;
- endpoint serving: generated or handwritten RECV/RETURN loops;
- background tasks: timers, discovery, health checks, metrics export;
- graceful handoff: transfer or retarget listener/endpoint authority;
- structured observability: request summaries, metrics, error suppression policy, and panic boundaries;
- resource accounting: explicit budgets or donated resources for sessions.
2. Phase-oriented domain libraries.
Pingora-style phases fit domains with real lifecycles:
- HTTP proxy and fetch service: request filter, route, connect, upstream request, response, body chunks, logging, failover.
- Terminal host: accept transport, negotiate transport options, authenticate session, spawn shell, proxy terminal I/O, log, cleanup.
- Storage service: authorize operation, resolve object, choose cache path, perform read/write, commit, audit.
- Agent service: authenticate caller, bind tool authority, plan invocation, stream outputs, log decision context.
The phase names should be domain-specific. A generic OS-wide phase machine would become vague and hard to secure.
3. Per-request context objects.
Pingora’s CTX model is a good fit for capOS service libraries. Each request
or session should have an owned context object dropped at the end of the
lifecycle. That context should carry derived policy decisions, peer identity,
timing, resource reservations, and transfer state.
This is cleaner than hidden globals and safer than asking later phases to reparse the original request.
4. Resource reuse keyed by authority identity.
Future capOS HTTP/TLS/TCP services should reuse expensive resources, but the pool key must include all security-relevant identity:
- target address and protocol;
- TLS SNI, ALPN, certificate policy, and client certificate;
- authority cap identity or object epoch;
- caller/session identity if it affects policy;
- cache namespace or tenant;
- request transformation policy when it changes what upstream sees.
This is the capOS analogue of Pingora’s strict Peer equality.
5. Operational lifecycle as an API.
The service framework should make readiness, graceful shutdown, and upgrade handoff explicit. That connects to capOS’s future live-upgrade proposal and avoids baking operational behavior into ad hoc service code.
6. Retry semantics as typed policy.
High-level clients should surface retry decisions only where the domain can
state idempotency and replay safety. For example, HttpEndpoint.get() can
have different retry policy than HttpEndpoint.post(), and a storage write
should not be retried unless the interface defines idempotent operation IDs.
What capOS Should Reject
1. Do not make Pingora’s phases kernel concepts.
The kernel should continue to dispatch narrow CapObject methods over the
ring. It should not know about request filters, upstream peers, retries,
logging phases, or protocol-specific context. Those belong in userspace.
2. Do not add a generic service/plugin capability.
A generic Service.call(phase_id, bytes) or callback registry would weaken
capOS’s central design bet: the typed interface is the permission. Use a
domain-specific Cap’n Proto interface for authority and a domain-specific
library for ergonomics.
3. Do not inherit Pingora’s process model.
Pingora is one unprivileged Linux process hosting multiple services with per-service runtimes. capOS’s isolation model is many processes with explicit capability grants. Service libraries may internally multiplex tasks, but authority boundaries should remain process and capability boundaries.
4. Do not use globals as authority.
Pingora’s ordinary Rust shared-state model is reasonable inside one trusted process. In capOS, cross-service authority must flow through capabilities, not statics, process-wide registries, or global service discovery.
5. Do not ship permissive defaults where explicit policy is needed.
Pingora 0.8.0 removed an insecure cache-key default and hardened HTTP framing. capOS should take this as a rule: cache keys, tenant identity, message framing, body drain behavior, reuse policy, and transfer semantics must be explicit or fail closed.
Concrete capOS Direction
The right decomposition is:
schema/capos.capnp
Stable authority-bearing interfaces.
Keep small and domain-specific.
capos-rt
Raw runtime and transport:
CapSet, ring, typed handles, release, result caps, exceptions.
capos-service
Generic userspace service container:
lifecycle, endpoint loops, readiness, shutdown, background tasks,
metrics, request context, resource budgeting.
domain libraries
Pingora-like phase APIs where they make sense:
HTTP/fetch, terminal host, storage, supervisor, agent tools.
init/supervisors
Compose services by passing capabilities, not by global names.
The first useful application is not the current runtime/Go milestone. The nearest capOS milestone where this should shape implementation is networking Phase B and the Telnet Shell Demo:
- Keep
NetworkManager,TcpListener,TcpSocket, andTerminalSessionas narrow capability interfaces. - Build the Telnet gateway as a userspace service that uses a lifecycle
helper: accept connection, negotiate Telnet, create a socket-backed
TerminalSession, spawn shell with exact grants, proxy until exit, log, cleanup. - Later, build
FetchandHttpEndpointservices with a Pingora-inspired HTTP lifecycle library rather than exposing raw socket authority to apps.
The first concrete proposal should therefore target terminal/networking lifecycle, not HTTP. This is now tracked in capos-service. A useful slice is:
TerminalSessionFromByteStream/ byte-stream terminal host;- lifecycle wrapper around accept, session minting, proxying, and cleanup;
- metrics plus request/session context hooks;
- network service container;
- HTTP/fetch services only after the terminal/networking lifecycle proves the authority and cleanup model.
For generated clients, the Pingora lesson argues for generated or handwritten thin wrappers, not raw Cap’n Proto calls everywhere. The wrapper owns:
- parameter encoding and result decoding;
- typed application exceptions;
- retry classification if the interface defines it;
- result-cap adoption;
- request summary and metrics hooks.
Risks and Review Rules
Any Pingora-inspired capOS framework should be reviewed against these rules:
- Extension hooks must receive the narrowest capabilities needed for that phase. Do not hand a broad service object to every hook by convenience.
- Request context must be lifecycle-owned and dropped deterministically.
- Pool keys must include all authority and identity fields that affect reuse.
- Retry policy must be explicit about whether upstream side effects may have happened.
- Cache-key construction must have no insecure default for multi-tenant data.
- Protocol parsers must drain or close before reusing a stream.
- Background tasks must be budgeted and cancellable during service shutdown.
- Readiness must mean the exported capability is actually ready to serve, not merely that the process started.
- Generated high-level wrappers must preserve the transport/application error split already documented in the capability ring and userspace runtime docs.
Recommendation
Use Pingora as precedent for a capability-native service framework: library-first, typed, phase-oriented, operationally aware, and opinionated about common mechanics.
Do not use Pingora as precedent for broad kernel interfaces, ambient service discovery, global registries, generic plugin phases, or permissive defaults. The capOS version should make authority narrower than Pingora does, because capOS has a stronger capability model available at every boundary.
Research: Game Mechanics Prior Art
This note records the external game-mechanics grounding used for Aurelian
Frontier planning. It exists because the original planning commit
79a9afc translated external mechanics references into
capability-shaped Aurelian tasks, but did not leave a standalone research note.
The recorded planning rationale for that commit used Stardew Valley,
EVE Online, Evil Islands, PixiJS, and Tiled references, with an explicit
instruction not to clone those games. This note covers the game systems used by
Aurelian: Stardew Valley, EVE Online, and Evil Islands.
Source Snapshot
Checked on 2026-04-29:
- Stardew Valley Wiki, Seasons: https://stardewvalleywiki.com/Seasons
- Stardew Valley Wiki, Crops: https://stardewvalleywiki.com/Crops
- Stardew Valley Wiki, Festivals: https://stardewvalleywiki.com/Festivals
- Stardew Valley Wiki, Friendship: https://stardewvalleywiki.com/Friendship
- EVE Online support, Buy and Sell Orders: https://support.eveonline.com/hc/en-us/articles/203218932-Buy-and-Sell-Orders
- EVE Academy, Basic Industrial Production: https://www.eveonline.com/news/view/eve-academy-basic-industrial-production
- Evil Islands official FAQ mirror: https://evil-islands.bgforge.net/usa/faq.html
- Nival, Evil Islands: https://nival.com/games/pc-games/evil-islands
- Ars Technica, Evil Islands review: https://archive.arstechnica.com/reviews/01q2/evilislands/evilislands-2.html
- PlayItHardcore, Evil Islands combat and survival: https://pihwiki.bgforge.net/Evil_Islands%3A_Combat_and_Survival
The Stardew Valley Wiki and EVE Online support/academy pages are treated as the primary grounding for their systems. For Evil Islands, the official FAQ mirror and Nival page ground construction and broad tactical identity; Ars Technica and PlayItHardcore are secondary sources for combat details.
Planning Audit
Commit 79a9afc records the durable planning outcome: external mechanics are
inputs to capOS-shaped tasks, not clone targets. The planning context named
these stable mechanics:
- Stardew Valley: seasons, festivals, schedules, gifts, and affection.
- EVE Online: brokered markets and blueprint/material/facility industry.
- Evil Islands: material/level/gold/equipment construction and limited enchantment.
- PixiJS/Tiled: later browser-client rendering, outside this note’s mechanics scope.
The commit body for 79a9afc says the patch translates external mechanics into
capOS-shaped tasks for seasonal calendars, regional settlements/outposts,
service-mediated order books, blueprint/artifact construction, token-budgeted
agent NPCs, and a 2D tilemap browser client. This research note keeps that
translation explicit and auditable.
Stardew Valley
Stable mechanics to borrow:
- Seasons create calendar pressure. Stardew Valley uses four seasons of 28 days; routines, festivals, visuals, and available resources can vary by season.
- Resource availability is table-driven. Crops, forage, fish, and shop selections are season-sensitive.
- Season boundaries matter. Ordinary crops wither at season change unless they are explicitly multi-season.
- Festivals are scheduled events that can alter access, activities, prizes, shops, dialogue, and social opportunities.
- Relationships are explicit profile facts. Talking, gifts, missed interaction, and events affect a visible relationship meter rather than being pure flavor.
Aurelian translation:
- Model
AdventureCalendaras explicit service-owned state, with fixed-smoke calendar values for deterministic QEMU proof and separate production seeds later. - Keep seasonal resources bounded and generated from content: crops, forage, fish, shop stock, route hazards, and outpost production.
- Make multi-season resources explicit in content validation.
- Treat festivals and military events as scheduled overlays that affect actor presence, witness availability, shop stock, route risk, quests, and debrief choices.
- Store gifts, favors, affection, faction standing, and event participation as profile or ledger facts owned by game services, not client-local counters.
Do not borrow:
- Unbounded daily chores as the core loop. Aurelian is an expedition and authority game; calendar pressure should sharpen mission choices, not become farm maintenance.
- Client-owned social counters. Social state must remain authoritative and auditable.
EVE Online
Stable mechanics to borrow:
- Markets are brokered. Buy and sell requests go through an order-matching system rather than choosing a specific counterparty directly.
- Market eligibility matters. Some assets can be traded through market orders; others require contracts, custody, corporation roles, or special flows.
- Matching is deterministic and rule-driven. Orders match immediately when compatible prices and ranges cross; otherwise they remain listed.
- Industry is blueprint and job based. Manufacturing uses blueprints, materials, job types, time, and facility or slot constraints.
- Production decisions have location and logistics consequences.
Aurelian translation:
- Keep the current actor-local market verbs as the proof slice, then evolve
toward a
MarketServiceor equivalent service-owned order book. - Define market-eligible item classes. Ordinary stackable supplies can use buy/sell orders; relics, writs, witness-certified custody, and dangerous artifacts move through explicit custody or contract-style protocols.
- Implement order books with side, item, quantity, price, location/range, expiry, fees, idempotency keys, and ordered ledger receipts.
- Route multi-owner exchange through reserve/escrow, commit/release, stale-version rejection, cancellation, retry, and crash recovery.
- Use blueprint jobs for construction: inputs, facility, duration, authority gates, output bounds, and service-owned job state.
Do not borrow:
- A fully player-driven MMO economy as the first target. Aurelian needs a small authoritative regional economy that proves capability boundaries before it needs market depth.
- Market transfer for every object. Authority-bearing objects should stay outside generic order books unless a later design proves the custody model.
Evil Islands
Stable mechanics to borrow for construction:
- Equipment construction combines a design/blueprint with material choices and money; unavailable material can be bought as part of assembly cost.
- Material class and quality affect item properties. Materials carry distinct weight, durability, energy/complexity, damage, armor, and vulnerability characteristics.
- Constructed or repaired items can be inspected before committing the job.
- Enchantment is constrained by object capacity and spell complexity; equipment can carry limited spell effects instead of arbitrary modifiers.
Stable mechanics to borrow for combat:
- Damage type matters. Slashing, piercing, and crushing style differences make weapon choice meaningful against different enemy defenses.
- Body-part targeting adds tactical texture. Head, hands/arms, and legs have distinct consequences such as critical risk, attack/casting slowdown, and reduced pursuit.
- Sight and scouting shape fight selection. Long vision and stealth let the player choose engagements instead of charging every hostile.
- Pulling and alert behavior are tactical risks. Enemies can notify related enemies, so careless engagement can turn one fight into many.
- Cast time is a combat risk. Offensive magic can be punished if the player casts while exposed.
Aurelian translation:
- Put construction inputs in generated content: blueprint, required materials, facility class, cost, duration, rank/star/circle gates, output bounds, artifact authority, and enchantment slots.
- Derive bounded item properties from blueprint, material, facility quality, paid cost, and player competence. Avoid unbounded loot rolling.
- Use a small deterministic target-zone set for combat:
head,hands,legs, andcore. - Add damage/mitigation metadata for weapon type, spell type, zone armor, ward state, and inspected knowledge.
- Make scouting and inspection upgrade enemy information from rough threat to zone armor, ward state, intent, and likely counters.
- Support stealth openings and pull/alert behavior as explicit service-owned state transitions with readable causality in transcript output.
- Add fatigue and cast interruption as explicit costs. Retreat should be hard when a mob blocks it, but failures must be legible and deterministic.
Do not borrow:
- Hidden real-time randomness or reaction-speed demands. Aurelian’s proof path remains command-level and deterministic.
- Punitive infinite-monster-fatigue behavior. Monsters can pressure retreat, but the service should name the reason and keep rules fair enough to test.
- Gore or locational damage presentation as spectacle. Body zones exist for tactical outcomes and readable state.
Cross-Game Design Rules For Aurelian
- External mechanics are planning inputs, not clone targets.
- Every durable fact that matters to public world state belongs in an authoritative service: calendar, market, construction jobs, profile progression, social standing, custody, and receipts.
- Every user-visible refusal should name the missing gate: authority, location, rank, resource, stale version, custody, fatigue, target state, or policy.
- Use pure Rust tests for deterministic rules: calendar rollover, seasonal availability, market matching, construction validation, property derivation, target-zone damage, fatigue, and alert propagation.
- Use a real capOS userspace test process for cross-service scenarios: expedition flow, custody, market reserve/commit/release, construction jobs, and party transfer.
- Keep the shell transcript as a low-dependency smoke proof and command-parser proof. It should not be the only test for complex gameplay state machines.
Small Open-Weights LLM Survey for the capOS Agent-Shell
Research notes on current (early 2026) open-weights language models in the
2-4 B active-parameter range, their suitability for the capability-served
planner described in docs/proposals/llm-and-agent-proposal.md, and a rough
compute-cost estimate for training a comparable model from scratch.
Primary sources: OpenRouter model catalog (https://openrouter.ai/api/v1/models,
353 models listed at survey time); empirical probe against OpenRouter’s
hosted endpoints using an agent-planner prompt; published training reports
(Llama 3 tech report, Gemma 2 tech report, Qwen3 model cards, MosaicML MPT
blog posts); Chinchilla scaling law (Hoffmann et al., 2022).
1. Candidate Landscape
Two families of candidates match “2-4 B active parameters”:
- Dense 2-4 B: inference FLOPs and memory footprint both scale with total parameters. Friendly to low-RAM hosts.
- MoE with 2-4 B active: inference FLOPs scale with active params, but total weights must be resident. Only viable on hosts with enough RAM to page-cache the full expert stack.
Dense contenders observed as of 2026-04-24:
| Model | Params | License | Context | Notes |
|---|---|---|---|---|
| Qwen3-4B-Instruct | 4 B | Apache-2.0 | 32 K | Strong tool-use post-training |
| Qwen3-1.7B-Instruct | 1.7 B | Apache-2.0 | 32 K | Same family, smaller floor |
| Gemma 3 4B IT | 4 B | Gemma license | 128 K | Multilingual; verbose outputs |
| Llama 3.2 3B Instruct | 3 B | Llama 3.2 Community | 128 K | Permissive but not OSI |
| Ministral 3B (2512) | 3 B | Mistral Research License | 128 K | Non-commercial; blocks ISO redistribution |
| Phi-4-mini | 3.8 B | MIT | 16 K | Reasoning-leaning training |
| IBM Granite 4.0 H Micro | ~3 B | Apache-2.0 | 128 K | New architecture, less battle-tested |
| SmolLM3-3B (HuggingFace) | 3 B | Apache-2.0 | 64 K | Fully open data + training code |
MoE contenders with ~3 B active:
| Model | Active | Total | License | Context | q4 weight size |
|---|---|---|---|---|---|
| Qwen3-30B-A3B-Instruct-2507 | ~3 B | 30 B | Apache-2.0 | 262 K | ~18 GiB |
| Qwen3-Coder-30B-A3B-Instruct | ~3 B | 30 B | Apache-2.0 | 160 K | ~18 GiB |
| Qwen3-Next-80B-A3B-Instruct | ~3 B | 80 B | Apache-2.0 | 262 K | ~48 GiB |
| Qwen3.5-35B-A3B | ~3 B | 35 B | Apache-2.0 | 262 K | ~21 GiB |
| IBM Granite 4.0 Tiny (7B-A1B) | ~1 B | 7 B | Apache-2.0 | 128 K | ~4 GiB |
2. Empirical Probe
Prompt
Agent-planner system prompt: “You are a capOS shell planner. Given a goal
and typed tool descriptors (name + param schema), emit a single JSON
ActionPlan: {"steps":[{"tool":..,"args":..,"rationale":..}]}. Never
invoke tools. Only reference tools from the descriptor list. Output JSON
only, no prose.”
User prompt: three typed tool descriptors (ServiceSupervisor.restart,
NetworkStack.info, LogReader.tail) and the goal “Restart the network
stack, but first confirm it’s in a failed state by checking status and
last 20 log lines.”
The test exercises three properties a capOS planner needs:
- Correct step ordering (
info+tailbeforerestart). - Correct arg packing for methods with and without arguments.
- Pure JSON output without Markdown fences, which the dispatcher must otherwise strip.
Results
| Model | JSON valid | Order correct | Fences | Arg shape |
|---|---|---|---|---|
| Qwen3-30B-A3B-Instruct-2507 | yes | yes | none | compact, correct |
| Qwen3-Next-80B-A3B-Instruct | yes | yes | none | correct, verbose |
| Qwen3.5-35B-A3B | yes | yes | none | correct |
| Qwen3-8B (proxy for Qwen3-4B) | yes | yes | none | correct |
| Gemma 3 4B IT | yes | yes | ```json fence | fabricated empty status:"" arg on zero-arg call |
| Ministral 3B (2512) | yes | yes | ```json fence | correct |
| Llama 3.2 3B Instruct | yes | no (restart before log check) | ``` fence | correct |
| IBM Granite 4.0 H Micro | no (three duplicate steps keys in one object) | — | none | — |
Qwen3-8B was used as a stand-in for Qwen3-4B because Qwen3-4B is not served on OpenRouter; Qwen3 family models below 8 B share the same post-training recipe, so output quality for structured agent tasks should be comparable with minor degradation at 4 B and more noticeable degradation at 1.7 B.
Interpretation
- Qwen3-A3B family produces the tightest, correctly-ordered plans with no markdown fencing. Best quality-per-active-parameter in the sample.
- Dense 3-4 B Qwen / Gemma / Ministral produce correct plans but add Markdown fences or small schema drift that the dispatcher must tolerate.
- Llama 3.2 3B violated the ordering constraint – planner-unsafe without additional prompt discipline or rejection sampling.
- Granite 4.0 H Micro emitted invalid JSON (duplicate object keys). Retest before adopting; may be endpoint-specific rather than the model.
3. Size Thresholds for capOS Use Cases
Mapping observed behaviour to the proposal’s workloads:
| Workload | Minimum credible size | Notes |
|---|---|---|
| NPC dialogue, canned-reply replacement | 1.7 B dense | Templated plans only; refusal fragile |
| Short-list planner (≤5 typed tools) | 3 B dense | Floor for credible multi-step ordering |
| Long-list planner, plan refine, step-up reasoning | 4 B dense or 30B-A3B | Refusal, self-critique, schema-strict JSON |
| Log / audit summarisation, NPC with context | 4 B dense or 30B-A3B | Needs retrieval grounding regardless |
Embedding / vector retrieval (TextEmbedder) | separate small encoder | Not a generator workload |
Proposal §“Built-in Local Model” sketches a 0.7-2.0 GiB weight budget (q4
class). Qwen3-4B at q4_k_m is ~2.4 GiB, narrowly over that budget.
Resolutions:
- Bump the default budget to ~2.5 GiB and ship Qwen3-4B-Instruct.
- Keep the 2 GiB budget and ship Qwen3-1.7B or SmolLM3-3B (at
q5_k_m, ~2.0 GiB), acknowledging weaker planner quality. - Ship Qwen3-1.7B as default and allow
ModelAdmin.loadWeightsto install Qwen3-4B or a 30B-A3B model post-install.
4. Recommendation for the Proposal
-
Default built-in (ISO): Qwen3-4B-Instruct at
q4_k_m, Apache-2.0. Raise the weight-budget line in the proposal from 2.0 GiB to ~2.5 GiB. Fallback to SmolLM3-3B if a fully-open training-data provenance is required for the trusted-build-inputs chain. -
Optional installed upgrade: Qwen3-30B-A3B-Instruct-2507 for hosts with >=24 GiB RAM. Same ~3 B active compute as a 3 B dense, materially better planning quality.
-
Reject for default ship:
- Ministral 3B (Mistral Research License – cannot redistribute on ISO).
- Llama 3.2 3B (failed ordering discipline in the probe; Llama 3.2 Community License also restricts downstream use).
- IBM Granite 4.0 H Micro until the JSON-output issue is confirmed or refuted on a local run.
-
Update Open Question 3 of the proposal (“smallest credible local model”) with the threshold: 3 B dense is the floor for a planner that can be trusted with ordering constraints; 1.7 B is restricted to NPC / canned-reply territory.
5. Training Compute Cost for a Custom 2-B-Active Model
Rough order-of-magnitude estimate, on the chance that the project considers a purpose-trained capOS planner model rather than a fine-tune.
5.1 FLOPs Budget
Forward+backward training compute approximates 6 x N_active x D_tokens.
Modern open models have drifted far past Chinchilla’s 20-tokens-per-param
ratio; 5k-15k tokens per param is typical.
| Target | Active | Tokens | FLOPs |
|---|---|---|---|
| Chinchilla-minimum 2 B dense | 2 B | 40 B | 4.8e20 |
| Llama-3-ish 2 B dense | 2 B | 15 T | 1.8e23 |
| Qwen3-4B-ish 2 B dense | 2 B | 36 T | 4.3e23 |
| 30B-A3B MoE (3 B active, 15 T tok) | 3 B | 15 T | ~4e23 (+ ~1.5x router/aux overhead) |
5.2 Hardware -> Dollars
Reference: H100 SXM at ~40% MFU ~= 1.4e18 FLOPs / hour; cloud price $2-3 / hr (spot) to $3-4 (on-demand).
| Scale | H100-hours | USD (raw compute) | Wall-clock on 1024 H100 |
|---|---|---|---|
| Chinchilla 2 B (toy) | ~350 | ~$1 k | <1 hr |
| 2 B @ 15 T tok | ~130 k | ~$400 k | ~5 days |
| 2 B @ 36 T tok (SotA match) | ~310 k | ~$900 k | ~12 days |
| 30B-A3B @ 15 T tok | ~290 k | ~$870 k | ~12 days |
5.3 Public Calibration
- Llama 3 8 B: Meta reports ~1.3 M H100-hours ~= $4 M raw.
- Llama 3 70 B: ~6.4 M H100-hours ~= $19 M raw.
- Gemma 2 2 B (~2 T tok, older recipe): <$500 k compute.
- MosaicML MPT-7B (2023, ~1 T tok, A100-class): ~$200 k.
The 6ND estimate agrees with these published runs within a factor of ~2, which is appropriate for an order-of-magnitude planning number.
5.4 Full-Project Multiplier
Final training run is typically 20-30% of total project compute. Realistic end-to-end budget:
- Ablations, restarts, hyperparameter sweeps: 3-5x raw training compute.
- Post-training (SFT + DPO / RLHF / RLVR): +5-15% of pretrain.
- Data pipeline (crawl, clean, dedupe, licensing): can equal or exceed compute cost; tokenizer corpus curation is non-trivial.
- Engineering headcount: 3-8 ML engineers for 6-12 months dominates TCO.
Realistic end-to-end to ship a capOS-class 2 B model from scratch: $3-10 M plus a team. A 30B-A3B MoE adds ~50%.
6. Practical Alternative
Training from scratch is almost certainly not worth it for the agent-shell use case. Two much cheaper paths that achieve the same capOS-specific behaviour:
-
SFT / LoRA on Qwen3-4B or SmolLM3-3B for the capOS
ActionPlanJSON schema, tool descriptors, and refusal patterns. ~10 k-100 k curated examples, 8 x H100 for 1-10 days ~= $500-$10 k. Reproducible on commodity cloud. -
Continued pretraining on a capOS-specific corpus (manifests, schemas, logs, proposals) if the base lacks domain coverage. Single digits of B tokens, $10 k-$100 k.
The only strong reason to train from scratch would be a fully verifiable
weight provenance chain tied to docs/trusted-build-inputs.md. Even then,
a reproducible fine-tune of a known base with a signed recipe captures
most of the benefit at ~1% of the cost.
6a. nanoGPT / nanochat Scale Reference
Karpathy’s nanoGPT repo reproduces GPT-2 small (124 M params: 12 layers,
768 hidden, 12 heads) as its headline config. Karpathy’s follow-up
nanochat (github.com/karpathy/nanochat) ships a full pretrain + SFT
pipeline and uses model depth (d) as the size dial rather than
parameter count. The README is the only authoritative source; the numbers
below are quoted from it, not extrapolated.
- d12 – “GPT-1 sized”, ~5 min pretraining for quick experiments.
- d20 – documented speedrun tier: “$48 (~2 hours of 8xH100 GPU node)”, ~$15 on spot instance, “well below $100”. This is the headline reproducibility tier.
- d24 – appears on the leaderboard as a “slightly overtrained baseline.”
- d26 – “GPT-2 capability happens to be approximately depth 26”; latest leaderboard entry hits GPT-2 CORE metric (0.256525) in ~1.65 hr on 8xH100. Original 2019 GPT-2 training cost is cited as ~$43 k for comparison.
The README does not publish explicit parameter counts per depth; the mapping from depth to params requires inspecting the config code.
Capability mapping to the capOS planner task (empirical, based on same-size published models rather than nanochat runs themselves):
| nanochat scale | Rough param band | Planner capability |
|---|---|---|
| d12 | GPT-1-class, ~50-100 M | Toy completion only, no planner |
| d20 | likely ~100-200 M band | Templated NPC lines; not a planner |
| d26 | GPT-2-class, ~100-400 M band | Simple JSON under strict priming; schema drift common |
| Hypothetical d30+ | unclear (not in README) | Plausibly approaches 1 B territory (SmolLM3-1B / Qwen3-1.7B / Llama 3.2 1B); still below the 3 B dense floor from the probe in section 2 |
Training a nanochat-class model from scratch fits a research-OS budget in a way the numbers in section 5 do not: d20 is ~$48 on-demand and d26 is single-digit hours on 8xH100. That is the only scale at which “capOS ships a weight-provenance-complete default planner” is financially plausible without multi-million-dollar compute.
7. Open Follow-Ups
- Verify Granite 4.0 H Micro JSON behaviour on a local
llama.cpprun rather than the OpenRouter endpoint; the probe may have hit a streaming / formatting quirk specific to the provider. - Measure
q4_k_mtokens-per-second for Qwen3-4B and Qwen3-1.7B on the CPU targets capOS cares about (x86_64 desktop, cloud VM, aarch64 SBC). No numbers are captured here; required before committing to a default. - Evaluate an embedding model separately (
bge-m3,nomic-embed,gte-modernbert) for theTextEmbeddercapability. Out of scope for this survey. - Revisit in 6 months: the 2-4 B frontier is moving monthly as of early 2026, and “best open weight” today may be superseded before the proposal’s Phase 2 begins.
- nanochat d30+ quality and pricing. The README documents tiers up
to d26 (GPT-2 capability, ~1.65-3 hr, <$100 on 8xH100). No published
numbers exist for d30 or beyond. Open questions, before committing to
an in-tree from-scratch provenance model:
- What is the parameter count for d30 (and d28, d32)? Derive from the nanochat config code, not inferred.
- What training time and cost does d30 require to reach a non-trivial SFT-able checkpoint on the same 8xH100 setup? Expected band is roughly 2-4x the d26 run (so ~6-12 hr, ~$150-300 on-demand), but this needs measurement – depth scaling of wall-clock is not linear once the model stops fitting comfortably in per-GPU memory.
- Does a d30-scale nanochat + capOS-specific SFT approach the Qwen3-1.7B planner floor on the section-2 probe? If yes, a provenance-complete default planner becomes realistic for ~$500-$5 k per full run (pretrain + SFT + ablations). If no, provenance has to be bought by fine-tuning a larger external base (Qwen3-1.7B or SmolLM3-3B) and accepting the weaker provenance story.
- Tokenizer choice for any capOS from-scratch or continue-pretrain
path. Independent of model scale or architecture. A capOS-specific
tokenizer with reserved tokens for
ActionPlanJSON structure, Cap’n Proto type IDs, capability interface names, and common schema keywords is plausible at the nanochat-class budget and may materially reduce tokens-per-plan and schema-drift error rate vs. reusing GPT-2 BPE or a generic SentencePiece. For a fine-tune of Qwen3 / SmolLM3 the tokenizer is fixed by the base and this question collapses to “what special tokens can be added without retraining embeddings.”
Research: Hosted Agent Harnesses
Survey of current agent harness, swarm, memory, and interoperability patterns for capOS-Hosted Agent Swarms. The design question is how capOS should host OpenClaw-like personal agents without copying the ambient-host authority model common in desktop tools.
Source Snapshot
Checked on 2026-04-28:
- OpenAI, Harness engineering: https://openai.com/index/harness-engineering/
- OpenAI, Agents SDK harness/sandbox update: https://openai.com/index/the-next-evolution-of-the-agents-sdk/
- OpenClaw docs: https://openclawlab.com/en/, https://openclawlab.com/en/docs/concepts/agent/, https://openclawlab.com/en/docs/concepts/agent-workspace/, https://openclawlab.com/en/docs/concepts/memory/, https://openclawlab.com/en/docs/tools/exec/, https://openclawlab.com/en/docs/tools/browser/, https://openclawlab.com/en/docs/concepts/multi-agent/
- DeepWiki secondary summaries: https://deepwiki.com/openclaw/openclaw, https://deepwiki.com/openclaw/skills/2.2-agent-memory-persistence-pattern, https://deepwiki.com/openclaw/docs/6.3-web-search-and-browser-tools, https://deepwiki.com/FoundationAgents/OpenManus, https://deepwiki.com/microsoft/agent-framework, https://deepwiki.com/microsoft/ai-agents-for-beginners/3.1-autogen-framework
- Karpathy, LLM Wiki: https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f
- Abdullin, Schema-Guided Reasoning: https://abdullin.com/schema-guided-reasoning/
- MetaGPT: https://arxiv.org/abs/2308.00352
- Generative Agents: https://arxiv.org/abs/2304.03442
- Gas Town docs: https://docs.gastownhall.ai/, https://docs.gastownhall.ai/usage/
- Model Context Protocol: https://modelcontextprotocol.io/docs/getting-started/intro, https://modelcontextprotocol.io/docs/learn/architecture
- Agent2Agent Protocol: https://github.com/a2aproject/A2A, https://a2a-protocol.org/latest/specification/
- Microsoft AutoGen and Microsoft Agent Framework: https://www.microsoft.com/en-us/research/project/autogen/overview/, https://learn.microsoft.com/en-us/agent-framework/overview/
- LangGraph durable execution: https://docs.langchain.com/oss/python/langgraph/durable-execution
- CrewAI: https://docs.crewai.com/
- CAMEL-AI: https://docs.camel-ai.org/get_started/introduction
DeepWiki was accessible for the related projects above. It is useful as a code-linked summary layer, but this note treats it as secondary to primary project docs and papers.
Design Consequences For capOS
- Treat the harness as the product surface: workspace, memory, tool descriptors, approval, cancellation, audit, and task state matter as much as model choice.
- Do not treat an agent workspace as a sandbox. In capOS, workspace boundaries should be enforced by capabilities, not by cwd conventions.
- Keep the model out of the authority path. The model proposes structured tool calls; a trusted runner validates and invokes caps.
- Use a persistent artifact model for agent knowledge. Raw sources, wiki pages, indexes, logs, and search indexes should be explicit, versioned, label-aware data, not hidden prompt history.
- Borrow swarm patterns cautiously. Roles, review gates, and durable tasks are useful; anthropomorphic role names and unconstrained peer delegation are not.
- Treat MCP and A2A as adapter protocols. They can carry descriptors, messages, and artifacts, not raw capOS authority.
- Prefer deterministic harness proofs first: fake model, fake browser, fake mutating tool, explicit approval, and auditable transcript.
OpenAI Harness Engineering
OpenAI’s Harness engineering article frames the key operational lesson: agents can only reason over context they can inspect. Repo-local files, schemas, tests, executable plans, and mechanically enforced architecture are therefore stronger harness material than knowledge left in chat, external docs, or tacit human convention.
The 2026 Agents SDK update moves in the same direction: a model-native harness, controlled workspaces, sandbox execution, filesystem tools, MCP, skills, custom instructions, shell execution, and patch tools. The important point for capOS is not the Python API. It is the shape: agents need a runtime that makes inspection, action, state, and safety explicit.
capOS implication: proposals, research notes, schemas, CUE manifests, QEMU proofs, and workplan files are not just documentation. They are harness inputs. They should be versioned, concise, indexed, and mechanically checked where possible.
Applying Harness Engineering To This Repository
The capOS repository is already partially harness-engineered: it has
AGENTS.md, CLAUDE.md, docs/tasks/README.md, review-finding task records,
proposal and research indexes, CUE manifests, named Make targets, QEMU
harnesses, and generated-code checks. The missing work is making those artifacts more
agent-legible, mechanically navigable, and resistant to stale planning state.
The normative repository plan is
capOS Repository Harness Engineering.
The checklist below is retained as research-derived input, not as a separate
planning baseline.
Concrete work needed:
-
Create a repo harness map. Add a concise
docs/agent-harness.mdthat tells future agents where to find current state, design authority, task selection rules, QEMU proofs, generated-code rules, security review rules, and known stale/superseded documents. It should link, not duplicate,CLAUDE.md,docs/tasks/README.md,REVIEW.md, roadmap, backlog, proposals, research, and review-finding task records. -
Make task selection queryable.
docs/tasks/README.mdis human-readable but not easy to query mechanically. Add stable anchors or a small structured sidecar for selected milestone, immediate gates, active branches/worktrees, paused branches, and blocked findings. The sidecar can be generated fromdocs/tasks/README.mdlater; the first step is stable headings and consistent checkbox syntax. -
Add a design-status linter. Check that proposal status, proposal index, topics,
docs/SUMMARY.md,docs/tasks/README.mdpointers, and superseded markers agree. The repo already has mdBook metadata tooling; extend it so stale status drift becomes a failed check. -
Add a harness inventory for run targets. Generate or maintain a table of
make run-*andmake qemu-*targets with purpose, manifest, expected proof output, and owning proposal/backlog. Agents should not infer which QEMU proof applies by grepping Makefile fragments. -
Standardize research notes. Require every new external-design proposal to cite a
docs/research/*.mdnote with source snapshot date, primary sources, secondary sources, design consequences, and open research gaps. This prevents proposals from becoming opaque summaries with no reusable research artifact. -
Add decision records for major pivots. The project currently records pivots in
docs/tasks/README.md, proposals, and changelog. Add short ADR-style records for high-impact direction changes such as endpoint badges to service-object capabilities to session-bound invocation context. Agents need a stable “why this changed” artifact that is not buried in a long proposal. -
Expose schema and interface intent. For each important Cap’n Proto interface, add or generate a short doc page with authority semantics, granted-by paths, threat model, QEMU proofs, and known gaps. This maps the core capOS rule “interface is permission” into agent-readable harness context.
-
Make stale document detection mechanical. Add front matter fields for
status,supersedes,superseded_by,implemented_by, andlast_reviewedwhere missing. Then check links both ways. An agent should be warned when it opens a superseded proposal without the replacement. -
Record proof transcripts as artifacts. QEMU harnesses validate behavior, but future agents often need the exact expected proof shape. Store bounded transcript excerpts or generated proof summaries under
docs/proofs/or a similar directory, with links from proposals and run-target inventory. -
Add eval tasks for agents. Create deterministic “agent can safely edit capOS” evals: find selected milestone, choose the right backlog, identify affected docs, avoid main-worktree edits, run the right check, and update status. These evals should be runnable without model calls by using scripted fixtures where possible.
-
Create a local knowledge compilation path. Use the LLM Wiki pattern for capOS itself: raw sources are proposals/research/changelog/review notes; compiled pages summarize current capability model, shell path, session model, networking status, and QEMU proofs; lint finds contradictions and stale status. This should be generated into a clearly marked
docs/agent-wiki/tree or kept out of published docs until reviewed. -
Keep checks close to docs. Every process rule that matters to agents should have either a check, a generated index, or a fixture. Free-form instructions are useful but insufficient; the harness should fail when architecture or workflow invariants drift.
Near-term implementation order:
- Add
docs/agent-harness.md. - Add run-target inventory.
- Extend mdBook metadata checks for proposal status/index drift.
- Add front matter fields for superseded/replacement relationships.
- Add the first reviewed
docs/agent-wiki/compilation for the selected milestone only.
OpenClaw Harness Controls
OpenClaw is the closest current personal-agent analogue:
- channel ingress through chat apps, webhooks, cron, and a gateway;
- a local-first gateway security boundary;
- an agent runtime with a workspace as the default tool cwd;
- bootstrap instruction/memory/persona files injected into context;
- built-in tools for read/write/edit, exec/process, browser, web, memory, and skills;
- per-agent workspace, sandbox, and tool policy;
- managed browser profiles and optional real-browser/remote-CDP routing;
- markdown memory plus search/index plugins.
Important controls:
- Exec exposes host selection (
sandbox, gateway, node), security mode (deny, allowlist, full), approval prompts, timeouts, background sessions, PTY support, and restrictions on PATH/loader environment overrides. - Browser automation uses a managed Chromium profile, snapshots, screenshots, action refs, profile routing, and CDP. Arbitrary JavaScript evaluation is explicitly risky.
- Memory stores markdown as source of truth. Search returns bounded snippets, file paths, and line ranges rather than entire memory files by default.
- Multi-agent routing can assign different workspaces, sandboxes, and tool allow/deny lists to different agents.
DeepWiki adds code-linked observations: OpenClaw treats tools as functional
capabilities and skills as SKILL.md extensions; includes a security audit
surface; supports Docker/seccomp sandboxing; and uses a personal-assistant trust
model. The OpenClaw skills summary also records the failure mode of a single
growing MEMORY.md: context overflow, compaction loss, and poor retrieval.
capOS implication: copy the harness knobs, not the host authority model. Workspace, exec, browser, memory, and skills should be separate caps with auditable grants.
Memory and Wiki Systems
Karpathy’s LLM Wiki pattern shifts memory from query-time retrieval over raw
chunks to a maintained artifact: immutable raw sources, an LLM-maintained
markdown wiki, and a schema/instruction file that defines conventions. The key
operations are ingest, query, and lint. The useful artifacts are index.md,
log.md, page cross-links, source citations, and health checks for stale or
contradictory pages.
OpenClaw memory and DeepWiki’s OpenClaw skills summary point to similar requirements:
- daily append-only logs versus curated long-term memory;
- markdown as human-inspectable source of truth;
- local indexes using SQLite, FTS/vector search, or hybrid search;
- snippets and line ranges for bounded recall;
- background distillation, pruning, and health checks;
- optional encryption or OS keychain integration for secret-adjacent memory.
capOS implication: AgentMemory should expose source, wiki, index, log, lint,
and search subcaps. Wiki pages should carry provenance and labels. Remote
embedding should be denied for high-label data.
Schema-Guided Reasoning
Abdullin’s Schema-Guided Reasoning describes using structured output schemas to force reasoning through predefined steps, produce auditable intermediate state, and validate outputs. It is especially relevant for local or weaker models.
capOS should use SGR for:
- task intake and risk classification;
- plan decomposition;
- tool-call approval summaries;
- source ingest and citation extraction;
- code/design/security review;
- final handoff and memory updates.
This is harness structure, not authority. A schema can make reasoning more testable, but the runner still enforces capabilities.
Swarm and Multi-Agent Frameworks
MetaGPT encodes standard operating procedures into multi-agent workflows. Its useful lesson is artifact gating: requirements, design, implementation, test, and review phases should produce intermediate outputs that later phases can inspect.
Generative Agents / Smallville contributes the memory-stream, reflection, and planning pattern for long-lived simulated agents. It is useful for NPCs, companion agents, and social simulations, but it is not an authority model. Believable behavior is not safe behavior.
Gas Town focuses on durable multi-agent engineering work: roles, workers, worktrees, convoys, merge queues, attribution, and handoff. Its strongest lesson is that work must survive chat-window loss and worker recycling.
AutoGen emphasizes actor-style asynchronous agent communication, distributed runtimes, tools, memory, observability, and group/team patterns. Microsoft Agent Framework adds a production framing: graph workflows, checkpointing, human-in-the-loop, durable execution, telemetry, and MCP/A2A integrations. LangGraph’s durable execution docs add a specific replay rule: side effects and non-determinism need task wrappers or idempotence so resumed workflows do not repeat external writes.
CrewAI and CAMEL-AI show the common high-level framework shape: agents, crews or societies, flows/workflows, memory/knowledge, toolkits, RAG, structured outputs, observability, and human-in-the-loop triggers.
OpenManus, summarized by DeepWiki, is a useful “general agent” reference: a think-act loop, multi-provider LLM support, MCP integration, sandboxed code and browser automation, and multiple entry points for general tasks, MCP, and data analysis.
capOS implication: implement durable AgentTask and SwarmScheduler first.
Do not start with free-form inter-agent chat as the substrate.
Interoperability Protocols
MCP provides standardized tool/resource/prompt discovery and execution over JSON-RPC, with stdio and HTTP transports. It is useful for external tool ecosystems, but it is not a capability security model by itself.
capOS should translate MCP descriptors into capOS tool descriptors and execute
through the trusted runner. Local stdio MCP servers should run with no ambient
filesystem or network authority. Remote MCP should require explicit
HttpEndpoint and credential caps.
A2A is a primary reference for peer-agent interoperability. Its project describes agent cards for discovery, negotiation of text/forms/media modalities, collaboration on long-running tasks, and operation without exposing internal state, memory, or tools. Its documented feature set includes JSON-RPC 2.0 over HTTP(S), synchronous request/response, streaming, push notifications, and exchange of text, file, and structured JSON data.
capOS should translate that into a stricter local bridge. Remote agents are
untrusted peers. Agent cards map to reviewed descriptors, not authority.
Incoming A2A messages become AgentMessage events delivered through an
AgentInbox; task ids, causal parents, size limits, expiry, and sender identity
are mandatory. Artifact references require separate caps before content is
read. Requested actions become proposed tool calls. Requested authority becomes
an approval request. Raw capOS caps should not cross an A2A bridge.
For local swarms, the same rule applies without the network protocol: agents coordinate through task records, inbox messages, resource leases, resource watches, and merge/release queues, not through free-form chat that tries to remember who is editing a repo, todo item, wiki page, or browser profile.
Research Still Missing
- Primary security advisories for OpenClaw and comparable personal-agent runtimes, especially gateway exposure, node hosts, skills, browser profiles, exec approvals, memory, and provider credentials.
- MCP security beyond the happy-path spec: tool poisoning, stdio command spawning, remote auth, marketplace signing, prompt injection, and lookalike tools.
- A2A security and identity: authentication, authorization, task provenance, artifact integrity, and non-transfer of authority.
- Browser automation containment: CDP risk, extension relays, logged-in profiles, downloads/uploads, arbitrary JS evaluation, clipboard, screenshots, SSRF/private-network policy, and deterministic testing.
- Memory correctness: citation fidelity, contradiction detection, stale summaries, label propagation, hallucinated links, human review, and rollback.
- Retrieval tradeoffs: index-first wiki navigation versus vector RAG, hybrid BM25/vector search, reranking, local embeddings, snippet budgets, and remote embedding denial.
- Swarm evaluation: when parallel agents improve throughput, when they create coordination debt, how to assign work, and how to prevent review capture.
- Local model viability for schema-following, tool calls, memory summarization, and offline embeddings.
- Provider policy: data retention, regional routing, ephemeral credentials, revocation, spend controls, and audit of remote inference.
- Formal authority model: prove that model text, memory text, remote agent messages, and MCP descriptors cannot mint capOS authority.
Research: Scientific Agent-Lab Software Stack
This note surveys existing scientific software that capOS should treat as adaptable service backends for a future agent-facing research lab. The central lesson is that capOS should not invent a new computer algebra system, solver, proof assistant, notebook system, or package manager. It should give agents typed, audited, resource-bounded capabilities over mature tools and preserve the exact environment, inputs, outputs, and proof artifacts needed for review.
Design Consequences For capOS
- Provide a
scientific-standardpackage as a service graph, not as ambient binaries on a global filesystem. - Start by adapting existing command-line and library tools behind narrow typed facades. Native rewrites are unjustified until a backend needs a smaller trusted core or a direct capability ABI.
- Treat heavyweight systems such as SageMath, OSCAR, JupyterLab, Lean/mathlib, and Spack as environment subjects: they need package-store, workspace, process, network, cache, and quota policy, not just a binary launch API.
- Expose solver and proof tools as deterministic request/response services
whenever possible. A model should ask
SmtSolver.check,ProofSession.build, orOptimizationSolver.solve, not run arbitrary shell text. - Keep formal proof assistants separate from automatic solvers. SMT results are useful evidence, but durable mathematical claims need proof artifacts checked by Lean, Rocq, Isabelle, Agda, or another trusted kernel.
- Make provenance a first-class output. Every notebook cell, solver run, proof build, CAS session, package environment, model prompt, and data input should produce replayable metadata and an audit record.
- Prefer open-source backends in the default package. Proprietary engines such as Wolfram Engine can be optional connector services with explicit license, network, and production-use metadata.
Source Baseline
External sources used for this survey:
- PARI/GP: https://pari.math.u-bordeaux.fr/
- SageMath: https://www.sagemath.org/
- GAP: https://www.gap-system.org/
- Singular: https://www.singular.uni-kl.de/
- OSCAR: https://www.oscar-system.org/about/
- SymPy: https://www.sympy.org/
- GNU Octave: https://octave.org/about
- R Project: https://www.r-project.org/
- SciPy: https://scipy.org/
- JupyterLab: https://jupyterlab.readthedocs.io/
- Z3: https://github.com/Z3Prover/z3
- cvc5: https://github.com/cvc5/cvc5
- HiGHS: https://highs.dev/
- SCIP: https://scipopt.org/
- OR-Tools: https://developers.google.com/optimization
- JuMP: https://jump.dev/
- CVXPY: https://www.cvxpy.org/
- Lean 4: https://lean4.dev/
- Lean mathlib: https://github.com/leanprover-community/mathlib4
- Rocq Prover: https://rocq-prover.org/
- Isabelle: https://isabelle.cs.tum.edu/
- Agda: https://agda.github.io/agda/
- Spack: https://spack.io/ and https://computing.llnl.gov/projects/spack-hpc-package-manager
- Guix-HPC: https://hpc.guix.info/
- Nix: https://nixos.org/
- Apptainer: https://github.com/apptainer/apptainer
Local grounding:
- Language Models and Agent Runtime
- capOS-Hosted Agent Swarms
- Userspace Binaries
- Stateful Task and Job Graphs
- Storage and Naming
- HPC Parallel Processing Patterns
- GPU Capability
- System Performance Benchmarks
- Hosted Agent Harnesses
- HPC Parallel Patterns
- Small LLM Survey
- Linux Sandboxes And Virtualization For Workloads
- NO_HZ, SQPOLL, and Realtime Scheduling
Tool Families
Computer Algebra And Exact Mathematics
PARI/GP is the obvious default for number-theory work. The upstream project is
a cross-platform open-source computer algebra system designed for fast number
theory computations, and it exposes both the gp shell and the PARI C library.
For capOS, the C library is the better long-term service backend; the shell is
useful for early compatibility and transcript capture.
SageMath is the best broad open-source mathematical umbrella. It is GPL software built on NumPy, SciPy, matplotlib, SymPy, Maxima, GAP, FLINT, R, and other packages, with the explicit mission of being a free alternative to Magma, Maple, Mathematica, and Matlab. Sage is too large to make into a small TCB component, but it is ideal as an agent lab kernel when the package store and Python compatibility layer exist.
GAP is the standard open system for computational discrete algebra, especially computational group theory. Singular is the specialized polynomial, commutative-algebra, algebraic-geometry, and singularity-theory system. OSCAR is the newer Julia-based research system that unifies GAP, Singular, Polymake, ANTIC/Hecke/Nemo/AbstractAlgebra-style capabilities across algebra, geometry, number theory, and polyhedral geometry. For capOS this suggests two levels: small typed services for common requests, and full language kernels for research workflows that need the native ecosystem.
SymPy is lightweight and embeddable because it is Python-based and has few dependencies. It is a good first symbolic backend for agents that need exact manipulation, code generation, and checkable expressions without launching Sage. SymPy should not replace PARI, GAP, Singular, or OSCAR for their specialist domains.
Numerical, Statistical, And Notebook Workflows
SciPy provides fundamental algorithms for scientific computing in Python: optimization, integration, interpolation, eigenvalue problems, algebraic and differential equations, statistics, sparse matrices, k-dimensional trees, and more. It sits on NumPy and compiled Fortran/C/C++ kernels, so capOS support depends on Python, native extension loading, BLAS/LAPACK packaging, and controlled native-code execution.
R remains the standard open statistical environment. GNU Octave remains the open MATLAB-like numerical environment for linear and nonlinear numerical work. Julia is strategically important because OSCAR, JuMP, SciML, and many modern research packages depend on it. The first capOS lab should host these as isolated language kernels rather than trying to normalize them into one universal API.
JupyterLab is the standard interactive computing front end because notebooks
combine code, prose, equations, visualizations, outputs, and controls. capOS
should adapt the notebook model but not grant notebook kernels ambient shell
or filesystem authority. A future NotebookSession should start kernels with
explicit workspace, package environment, data, network, and compute caps, and
record every execution result as a reproducible artifact.
Satisfiability, Optimization, And Operations Research
Z3 and cvc5 are the primary open SMT backends to expose through a capOS
SmtSolver capability. Both support stand-alone and library use. SMT-LIB
should be a supported import/export format, but the service API should expose
typed assumptions, objectives, timeouts, model requests, unsat cores, and
proof/certificate availability explicitly.
For mathematical optimization, capOS should separate modeling layers from solver engines:
- JuMP and CVXPY are high-level modeling interfaces that let researchers state optimization problems in Julia or Python.
- HiGHS is a strong open backend for large sparse LP, MIP, and QP models.
- SCIP is a broad optimization suite around constraint integer programming, with current Apache-licensed releases.
- OR-Tools is a practical operations-research toolkit, especially for constraint programming, routing, scheduling, and combinatorial optimization.
The capOS API should accept common model formats and provide bounded solve jobs with time/memory limits, deterministic seeds when supported, solution certificates when available, and reproducibility metadata. It should not hide which backend solved the problem.
Formal Proof Systems
Lean 4 is both a general-purpose functional language and an interactive theorem prover, with mathlib as its main community mathematics library. It is the best default for agent-assisted formal mathematics because current LLM tooling and library momentum are strongest there.
Rocq, formerly Coq, remains an industrial-strength dependently typed prover with a long verification history and program extraction story. Isabelle is a generic proof assistant, with Isabelle/HOL and mature automation important for systems proofs. Agda is valuable for constructive type theory and dependently typed programming. A capOS lab should support all of them as separate proof kernel families instead of pretending they are interchangeable.
Agent integration should be conservative:
- Agents may propose proof edits, search lemmas, call tactics, and run builds.
- The proof checker decides whether a theorem is accepted.
- Accepted proof artifacts must include toolchain version, library revision, package closure, command line, and full build log.
- CAS or SMT evidence can guide a proof but is not the proof unless the proof assistant checks an imported certificate or independently reconstructs it.
Reproducible Environments And Package Stores
The scientific stack is too large and language-diverse for a hand-written capOS package format to be the first step. Existing systems offer useful pieces:
- Nix provides isolated, declarative package builds and large package coverage.
- Guix-HPC focuses on reproducible scientific deployment, per-user environments, and bit-for-bit repeatability from a specific Guix commit.
- Spack is the HPC-oriented answer for many compiler, MPI, CPU-target, and library variant combinations.
- Apptainer is common in HPC because it packages software into portable images while integrating with GPUs, high-speed networks, and shared filesystems.
capOS should not import any of these as the kernel package manager. Instead,
it should adapt their recipe and closure ideas into capability-native
PackageCatalog, PackageClosure, Environment, and BuildService
interfaces. Early implementations can execute Nix/Guix/Spack/Apptainer on a
Linux host or sidecar; later capOS can consume signed closures as Store objects.
What An LLM Research Lab Needs
A credible LLM agent research lab on capOS needs more than model inference:
- Workspace service. Branchable project workspaces with exact input, output, patch, and artifact history.
- Package environments. Content-addressed software closures for Python, Julia, R, C/C++/Fortran, Lean, Rocq, Isabelle, GAP, PARI, Sage, and solver stacks.
- Notebook service. Jupyter-compatible documents and kernels, but kernels receive explicit caps instead of ambient filesystem, process, or network access.
- Experiment registry. Runs have immutable parameters, model versions, prompts, tool descriptors, seeds, package closures, datasets, results, and reviewer decisions.
- Solver/proof services. CAS, SMT, optimization, and formal proof systems are high-level tool capabilities with structured inputs and bounded resources.
- Literature and retrieval. Paper, code, dataset, citation, and note stores are ordinary namespaces; retrieval does not imply authority to fetch or publish.
- Job graph orchestration. Long calculations, training/evaluation jobs, proof builds, benchmark sweeps, and multi-agent tasks need resumable job graphs with cancellation and status.
- Compute authority. CPU, memory, storage, network, GPU, realtime, and external-provider quotas must be explicit and visible in the audit log.
- Human review surfaces. Agents can generate results, but publication, credential use, external API calls, irreversible filesystem changes, and proof-of-result claims need review gates.
Adaptation Strategy
The near-term path is a staged compatibility bridge:
- Hosted Linux sidecars. Run existing stacks on Linux while capOS exposes them as remote capability services. Use namespace/cgroup/seccomp/Landlock sandboxes for trusted batch tools; use hardware-backed Linux guests (QEMU/KVM first, Firecracker/Kata-style microVMs later) for untrusted notebooks, model-generated code, package builds, and multi-tenant agent jobs. Treat User-Mode Linux as a developer/debug fallback, not the primary strong-isolation boundary. This proves interfaces and audit before native package support exists.
- Command-wrapper services. Wrap tools such as
gp,gap,Singular,lean,lake,rocq,isabelle,octave,Rscript, and solver CLIs with explicit input/output directories, timeout, memory, and network policy. - Library-backed services. Replace wrappers with direct C/C++/Rust/Julia FFI or process-local RPC for small stable APIs such as PARI, Z3, cvc5, and HiGHS.
- Notebook and language kernels. Add Python, Julia, R, Sage, and Lean kernels with capOS-authored kernel launchers and artifact capture.
- Package-closure ingestion. Import Nix/Guix/Spack closures as signed Store objects, then build a capOS-native catalog around content hashes, licenses, vulnerability metadata, CPU/GPU compatibility, and provenance.
- Native capOS services. Only after the interface stabilizes, port the most useful small engines or linkable libraries into native userspace.
Risks And Open Questions
- Supply-chain size. Sage, Julia, Python scientific stacks, and proof libraries bring huge dependency closures. capOS must record and constrain them rather than pretend they are small trusted components.
- Nondeterminism. Floating-point math, randomized solvers, parallel BLAS, GPU kernels, and package resolution can make replay differ. Results need deterministic seeds and variance metadata, not only final answers.
- License boundaries. GPL, LGPL, Apache, MIT, BSD, proprietary, academic, and optional commercial solvers need explicit metadata before packaging.
- Proof trust. A CAS result, SMT model, or solver objective value can be false because of bugs, numeric tolerance, or bad modeling. Formal proof claims must be checked by the named proof kernel or labeled as empirical.
- Agent overreach. The default scientific package must not grant arbitrary shell, network, credential, package-install, or publishing authority to a model. Agents receive tools through runner policy, not direct backend caps.
- Notebook security. Notebooks are executable documents. Opening one is not consent to run it with the reader’s caps.
- Linux sidecar boundary drift. Namespaces, seccomp, Landlock, gVisor,
User-Mode Linux, KVM guests, and microVMs are different security and
compatibility claims. capOS must record the backend, host kernel, policy,
image hashes, guest tickless/nohz state, and capOS outer
NoHzEligibility/NoHzActivationstate rather than labeling all of them “sandboxed Linux”.
Recommendation
Define a future scientific-standard package as a curated service graph with
three profiles:
- Base. PARI/GP, SymPy, Z3, cvc5, HiGHS, Lean, and artifact/provenance services through tight command or library wrappers.
- Research. SageMath, GAP, Singular, OSCAR/Julia, R, Octave, JuMP, CVXPY, SCIP, OR-Tools, Jupyter-compatible notebooks, and package-closure support.
- Lab. Hosted-agent workspaces, experiment registry, browser/web research tools, GPU-backed model/scientific kernels, distributed job graphs, and publication/review workflows.
The Base profile is the first useful target for agents: exact number theory, symbolic manipulation, SMT checking, linear/integer optimization, and Lean proof checking are enough to make an agent substantially more reliable without granting it a general-purpose scientific workstation.
Research: Multimedia Pipeline Latency
Survey of PipeWire and JACK design lessons for a capOS multimedia graph whose explicit goal is the minimal possible guaranteed-stable stack latency.
Goal
The capOS multimedia pipeline should optimize for the lowest end-to-end latency that capOS can guarantee stable under the selected workload, device, and routing graph. “Guaranteed stable” means the graph is admitted only when the kernel/services can reserve enough CPU, memory, device, and wakeup budget for every realtime cycle, and the graph fails closed when those guarantees can no longer be met. A graph that reports a smaller nominal buffer but produces xruns, underruns, clock drift, or large tail latency is worse than a graph with a slightly larger fixed quantum and a schedulability guarantee.
The target is not one universal latency number. The target is a measurable operating point with an explicit contract:
- fixed sample rate and quantum for the realtime island;
- bounded callback/process time per node;
- bounded graph traversal time per cycle;
- admitted worst-case execution budget for every node and bridge;
- reserved memory and pre-registered buffers for the whole graph;
- no allocation, blocking IPC, paging, logging, or credential checks on the realtime data path;
- visible latency contribution per node, link, bridge, device, and provider;
- admission rejection when the graph cannot fit the selected quantum;
- fail-closed handling through bypass, silence, stream stop, or quantum renegotiation rather than unbounded queue growth;
- policy that can choose “lowest stable” for pro audio and “efficient stable” for ordinary desktop/media playback.
This guarantee applies to local capOS-controlled realtime islands. It does not extend to browser scheduling, networks, or remote model/provider inference. Those parts can be measured, bounded by policy where possible, and isolated from the local graph, but not honestly guaranteed by capOS.
PipeWire Lessons
PipeWire separates graph configuration and IPC from realtime data processing. Its graph scheduling documentation describes a main thread for IPC and graph configuration and data processing threads that run with realtime priority. Node resources, buffers, I/O areas, and metadata are prepared in shared memory before realtime processing begins.
PipeWire also treats graph quantum and rate as first-class timing controls. Synchronous links can process in the same cycle, while asynchronous links add one cycle of latency. Its latency model propagates min/max latency through ports and adds latency when links or nodes introduce buffering.
Consequences for capOS:
- Media graph control and media graph processing should be separate execution domains.
- Buffers and metadata must be preallocated before the realtime cycle starts.
- A link that crosses an isolation, clock, process, network, or wakeup boundary must declare its additional latency instead of hiding it.
- Latency should be graph metadata, not an after-the-fact measurement only.
- Quantum and rate are policy inputs, not incidental driver details.
JACK Lessons
JACK was designed for professional low-latency audio. Its API centers on a process callback invoked by the JACK server at the correct time, graph-order callbacks, xrun notification, and port latency ranges. JACK’s latency API asks clients to report min/max latency so applications can detect routing that has become anomalous or needs compensation.
Consequences for capOS:
- A capOS native audio graph needs a cycle callback model for realtime nodes, even if the public API is capability-oriented rather than JACK-compatible C.
- The realtime callback contract must be restrictive: no blocking endpoint calls, no dynamic allocation, no filesystem/name lookups, and no waiting for policy decisions.
- Xruns and deadline misses are not debug trivia. They are first-class graph events that policy can use to increase quantum, disable expensive nodes, or move work to a different scheduling context.
- Per-port latency ranges are more useful than a single optimistic value.
Guarantee Model
capOS should use a guarantee ladder rather than a single vague “low latency” mode:
| Level | Meaning | Allowed uses |
|---|---|---|
| Best effort | No reserved budget; telemetry only | ordinary media, background capture |
| Bounded soft realtime | Deadlines and drops, but no formal admission proof | web shell voice, remote model paths |
| Guaranteed realtime island | Fixed quantum, admitted CPU/memory/device budgets, fail-closed overruns | native audio, local voice, pro-audio paths |
| Hard device deadline | Driver/device deadline is reserved and violation is treated as a system fault for that island | future dedicated hardware paths |
The first serious multimedia milestone should target guaranteed realtime islands for local audio. Web shell and remote model voice should remain bounded soft realtime because the browser/provider/network portions are outside local control.
Admission should require:
- every node declares worst-case execution time or a conservative budget;
- every bridge declares buffering and wakeup latency;
- every buffer pool is allocated and pinned/registered before start;
- every realtime thread has a scheduling context with period, budget, and priority;
- graph topology is frozen for the active cycle plan;
- overrun policy is configured before start.
If admission fails, the graph does not start at that quantum. If a running graph misses its guarantee, the system records a violation and applies the configured fail-closed policy instead of preserving continuity by accumulating hidden latency.
Stack Latency Model
For capOS, “stack latency” should be modeled as a composed budget:
flowchart LR
DeviceIn[ADC / capture device] --> DriverIn[driver period]
DriverIn --> CaptureRing[capture ring]
CaptureRing --> Graph[media graph quantum cycles]
Graph --> Bridge[process / isolation / network bridges]
Bridge --> Codec[codec / resampler / model adapter]
Codec --> PlaybackRing[playback ring]
PlaybackRing --> DriverOut[driver period]
DriverOut --> DeviceOut[DAC / playback device]
Each edge should carry:
- latency min/max in frames or nanoseconds;
- clock domain;
- quantum/rate;
- buffering depth;
- deadline;
- drift estimate;
- xrun/drop counters.
The useful metric is not just nominal round-trip latency. For guaranteed islands it is the admitted latency bound plus violation count. For softer paths it is nominal latency, p95/p99 process-cycle latency, worst observed cycle over a window, xrun rate, and drift between capture and playback clocks.
capOS Media Graph Shape
The multimedia graph should be a userspace service family:
flowchart LR
Control[control plane endpoint] --> GraphManager[MediaGraphManager]
GraphManager --> Policy[latency / route / permission policy]
GraphManager --> Nodes[node services]
Nodes --> Rings[MemoryObject media rings]
Rings --> Driver[audio/video driver services]
Rings --> Apps[application nodes]
Rings --> Provider[realtime model provider nodes]
The control plane may use normal capability endpoints. The data plane should
use shared MemoryObject rings plus futex/notification wakeups. Cap’n Proto
messages remain appropriate for graph setup, route changes, permission checks,
and telemetry, but not for per-frame audio payload copying.
Node classes:
- driver node: owns device-facing caps such as
DeviceMmio,DMAPool, andInterrupt; - graph driver node: provides the cycle clock for a realtime island;
- transform node: resampler, mixer, echo canceller, VAD, format converter;
- app node: user application capture/playback endpoint;
- bridge node: crosses process, clock, network, provider, or web boundary;
- realtime model node: provider/local model adapter that consumes and emits media plus control events.
Guaranteed Realtime Islands
capOS should not try to make the whole desktop one realtime graph. It should support small realtime islands with explicit rate/quantum policy:
- pro-audio island: low quantum, strict admission, few nodes, no remote model hop in the realtime loop;
- voice-agent island: low enough latency for conversation, with VAD/barge-in priority and bounded buffering;
- ordinary media island: efficient quantum and power policy;
- screen/video island: frame-deadline oriented rather than audio-period oriented.
Bridges between islands are allowed, but each bridge declares the latency it adds. A bridge from a guaranteed island to a non-guaranteed island must not backpressure the guaranteed island. It may drop, resample, replace with silence, or move to a larger negotiated quantum, but it must not create an unbounded queue. This is the PipeWire/JACK lesson in capOS terms: do not hide async links.
Scheduling Implications
Per-SQE deadlines are useful for stale work handling, but they are not enough for guaranteed multimedia latency. The graph needs future scheduling contexts:
- period: graph quantum duration;
- budget: maximum CPU time per period for a node or node group;
- priority: realtime island priority relative to other interactive work;
- affinity: optional CPU isolation for device and graph threads;
- overrun policy: drop, silence, bypass node, increase quantum, or stop graph.
Until scheduling contexts exist, capOS can only prototype bounded soft realtime. The design should still attach monotonic deadlines to media buffers and SQEs so late work is discarded deterministically instead of accumulating hidden latency, but documentation should not claim a local guarantee before admission and budget reservation exist.
Web Shell And Remote Models
Web shell voice and remote realtime models cannot provide guaranteed local stack latency across the full path. Browser scheduling, WebRTC/WebSocket transport, provider inference, and network jitter all sit outside capOS control.
The capOS goal still applies: guarantee the part of the stack capOS controls when it is inside an admitted realtime island, then expose the rest as measured latency and jitter:
- browser capture/playback buffer estimates;
- gateway queue depth;
- provider adapter send/receive jitter;
- model first-audio latency;
- tool-call pause duration;
- barge-in cancellation latency;
- playback underrun/drop counters.
This argues for a local media graph even when the model session is provider native. The local graph is where capOS can enforce bounded buffers, drops, deadlines, and audit.
Design Rules
- Prefer fixed quantum inside a realtime island.
- Reject graph activation or graph changes that cannot be admitted at the selected quantum unless policy explicitly relaxes the guarantee.
- Treat every async boundary as one or more declared latency cycles.
- Keep realtime callbacks pure data processing.
- Move permission checks, tool execution, logging, graph mutation, and model policy to non-realtime threads.
- Preallocate buffers and register memory before starting the graph.
- Use latency ranges and measured telemetry, not a single optimistic latency.
- Provide fail-closed policy that stops, bypasses, silences, or renegotiates quantum when a guarantee is violated, rather than letting queues grow.
- Preserve capability isolation even when it costs a cycle; make the cost explicit and measurable.
- Keep pro-audio/local paths independent from remote-provider voice paths.
Open Questions
- What is the first capOS-visible latency target: voice shell, local playback, or pro-audio loopback?
- Should graph-driver threads live in a privileged media service, or can an application own a realtime island under broker policy?
- How should admission control estimate whether a new node can fit a quantum before activating it?
- Should bridge latency be specified by policy, measured dynamically, or both?
- Which telemetry window should determine when a bounded-soft-realtime path should switch to a larger quantum?
- How should future CPU donation interact with graph scheduling contexts?
References
- PipeWire, Graph Scheduling
- PipeWire, Latency support
- PipeWire, jack.conf
- JACK Audio Connection Kit, API overview
- JACK Audio Connection Kit, Setting Client Callbacks
- JACK Audio Connection Kit, Managing and determining latency
Research: Realtime Multimodal Agent APIs
Survey of provider APIs for realtime native-audio, multimodal, tool-using agents, and the consequences for capOS voice agent-shell, web shell, media graph, scheduling, and capability boundaries.
Scope
This report focuses on APIs where a model can consume realtime audio and emit both audio output and structured tool calls in one session. That is distinct from a chained pipeline where the application separately runs ASR, a text model, and TTS.
The immediate capOS question is whether the earlier agent-shell design should remain text-first with optional ASR/TTS wrappers, or whether it needs a first-class realtime multimodal model session.
Source Snapshot
All source observations below were checked against official provider documentation on 2026-04-25.
- The companion multimedia pipeline latency note covers PipeWire and JACK lessons for low-latency graph scheduling, latency reporting, realtime callbacks, and stable quantum selection.
- OpenAI Realtime API docs describe speech-to-speech sessions, WebRTC and
WebSocket transports, realtime function calling, interruption/truncation, and
the
gpt-realtimemodel family. - OpenAI Voice Agents docs explicitly frame the architecture choice as direct live audio sessions versus chained speech-to-text, text-agent, and text-to-speech pipelines.
- Google AI Gemini Live API docs describe realtime audio/image/text input, audio output, WebSocket transport, VAD, barge-in, tool use, and ephemeral tokens for client-to-server browser use.
- Vertex AI Gemini Live API docs describe the enterprise/cloud variant with realtime voice/video, native audio, transcriptions, function calling, Google Search grounding, and provisioned-throughput-oriented deployment considerations.
Provider Findings
OpenAI Realtime API
OpenAI’s Realtime API is a stateful session API for low-latency interactions
with realtime models. The docs describe calling models such as
gpt-realtime for speech-to-speech conversations over WebRTC or WebSocket,
with the session carrying model, voice, conversation items, and generated
responses.
Important details for capOS:
- Browser clients are steered toward WebRTC for more consistent media performance; server-to-server integrations are steered toward WebSocket.
- WebRTC media and control are split: audio is handled by the peer connection, while other events travel over a data channel.
- WebSocket integrations carry JSON events and require the application to manage input and output audio buffers directly.
- Realtime function calling is session/response configured. The model emits a
function_callitem with a name, JSON arguments, and a generated call id. The application executes the function and sends back afunction_call_outputconversation item keyed by that call id. - Realtime interruption is a first-class path. With VAD, user speech can cancel an ongoing model response. WebRTC/SIP paths have server-side knowledge of played audio; WebSocket paths require the client to stop playback and send truncation metadata for unplayed audio.
gpt-realtime-1.5is documented as a realtime audio-in/audio-out model with text, audio, and image input; text and audio output; and function calling. The current model page marks video as unsupported.
OpenAI’s Voice Agents docs expose the architectural tradeoff directly: live speech-to-speech sessions are the natural low-latency path, while chained ASR plus text-agent plus TTS gives stronger intermediate control and is often more appropriate for approval-heavy workflows.
Google AI Gemini Live API
Google AI’s Gemini Live API is a realtime stateful WebSocket API. The developer docs describe audio, image, and text input; audio output; VAD; barge-in; transcriptions; proactive audio; affective dialog; and tool use.
Important details for capOS:
- The Google AI developer API lists input audio as raw 16-bit PCM at 16 kHz little-endian, image input as JPEG at up to 1 FPS, and output audio as raw 16-bit PCM at 24 kHz little-endian.
- The public developer API supports server-to-server and client-to-server approaches. Client-to-server avoids backend media proxy latency but requires ephemeral tokens rather than long-lived API keys in client code.
- Ephemeral tokens are Live-API-only, short-lived credentials. Google documents default timing behavior of roughly one minute to start a new session and thirty minutes for sending messages over a connection, with the ability to restrict tokens to Live API model/config constraints.
- Tool use supports function calling and Google Search. Function declarations are installed in session configuration, and the client must manually send tool responses. Google AI docs distinguish synchronous function calls from non-blocking function declarations on models that support them, with response scheduling options such as interrupting current model output, waiting until idle, or staying silent.
- Tool support differs by model family and revision. The Google AI docs list Gemini 3.1 Flash Live Preview and Gemini 2.5 Flash Live Preview with function calling, but not all asynchronous behavior is supported by every model.
Vertex AI Gemini Live API
Vertex AI’s Live API docs describe the Google Cloud deployment path. The docs
currently present gemini-live-2.5-flash-native-audio as generally available
and recommended for low-latency voice agents, with native audio,
transcriptions, VAD, affective dialog, proactive audio, and tool use. They also
document a preview native-audio model and state a deprecation date for the
older preview native-audio release.
The Vertex AI page is relevant to capOS for enterprise deployment:
- It documents raw PCM input/output rates and a stateful WSS protocol.
- It describes realtime voice/video agents, tool use through function calling and Google Search, audio transcriptions, barge-in, and proactive audio.
- It points at partner WebRTC integrations, while the core Vertex API remains WebSocket-oriented in the referenced docs.
- It exposes cloud operational concerns not present in the simple developer API view: access management, request logging, provisioned throughput, PayGo variants, quotas, and regional/cloud deployment policy.
Comparison
| Axis | OpenAI Realtime | Gemini Live API | Vertex AI Live API |
|---|---|---|---|
| Primary low-latency model shape | Realtime model session | Live model session | Cloud Live model session |
| Browser media path | WebRTC recommended | WebSocket with ephemeral token; partner WebRTC integrations exist | Partner WebRTC integrations; core docs emphasize WSS |
| Server path | WebSocket | WebSocket via Gen AI SDK/raw protocol | WebSocket via Gen AI SDK/raw protocol |
| Input | Text/audio/image on current realtime models | Audio/image/text | Audio/video/text |
| Output | Text/audio | Audio in Google AI overview | Audio/text in Vertex overview |
| Tool calls | Function calling, client executes and returns output | Function calling, client sends FunctionResponse | Function calling and Google Search grounding |
| Interruption | VAD, cancellation, output truncation | VAD/barge-in | VAD/barge-in |
| Client credential pattern | OpenAI ephemeral client secrets for browser realtime | Live-API ephemeral tokens | Cloud auth/service identity; client direct path depends on deployment |
The practical conclusion is that a capOS abstraction should not bake in a single provider transport. OpenAI’s best browser path is WebRTC; Gemini’s core developer path is WebSocket with ephemeral tokens; Vertex AI adds enterprise auth and throughput controls. The common semantic layer is not “WebRTC” or “WebSocket.” It is a realtime model session carrying media frames, transcripts, model audio output, structured tool calls, tool results, cancellation, and session policy.
Consequences For capOS
A First-Class RealtimeModelSession
The existing language-model proposal is text-centric:
LanguageModel.completeLanguageModel.stream- tool calls emitted in assistant messages
- runner executes tools
That remains useful. It should not be stretched to pretend realtime audio is just a token stream. Native realtime voice models need a sibling interface:
interface RealtimeModel {
info @0 () -> (info :RealtimeModelInfo);
open @1 (config :RealtimeSessionConfig) -> (session :RealtimeModelSession);
}
interface RealtimeModelSession {
sendInput @0 (event :RealtimeInputEvent) -> ();
next @1 () -> (event :RealtimeOutputEvent, done :Bool);
sendToolResult @2 (result :RealtimeToolResult) -> ();
cancel @3 (reason :CancelReason) -> ();
close @4 () -> ();
}
This interface lets a provider adapter hide whether it is OpenAI WebRTC, OpenAI WebSocket, Gemini WebSocket, Vertex AI, a local model, or a future GPU pipeline. It also keeps the existing capOS rule: the model never receives session authority. It emits structured tool calls, and the trusted runner executes or refuses them.
Direct Native Audio Versus Chained Pipeline
capOS should support both.
Use a direct native-audio session when:
- the user expects conversational voice with low latency;
- barge-in and expressive speech matter;
- the provider model can safely handle tool-call turns in the same session;
- provider telemetry, cost, and policy permit streaming user audio off-box.
Use a chained pipeline when:
- the workflow is approval-heavy or destructive;
- deterministic transcript capture is mandatory before reasoning;
- ASR and TTS need to be local for privacy;
- the agent runner needs to inspect, redact, or transform text before model inference;
- the session is anonymous or guest and broker policy forbids remote live audio.
For web-shell voice, direct native audio is a better interactive experience, but the chained path is the safer fallback and the better first local proof.
Tool Calls Remain Proposals
Realtime providers can emit tool calls while producing or pausing audio. capOS must still treat those calls exactly like text-agent tool calls:
- The model emits a structured call name and arguments.
- The agent runner validates the call against advertised tool descriptors.
- Broker policy decides
auto,consent,stepUp, orforbidden. - The runner invokes the underlying typed capability if allowed.
- The runner sends a tool result back into the realtime session.
- Audit records bind model id, session id, tool descriptor revision, typed arguments, permission decision, outcome, and any spoken/user confirmation.
The model must not hold the tool caps. The provider session must not receive
raw TerminalSession, Launcher, ProcessSpawner, tokens, credentials, or
session bundle authority.
Audio Is Not Terminal Text
Voice input should not be encoded as TerminalSession.readLine, and output
audio should not be TerminalSession.writeLine. The terminal stream remains a
presentation channel. Voice is a sibling media channel bound to the same
authenticated session id.
This separation matters because realtime audio has properties terminal text does not:
- frame timestamps;
- playback positions;
- output truncation;
- VAD and barge-in events;
- partial transcripts;
- deadline and stale-frame handling;
- binary frame formats;
- provider-specific session ids and event ids.
Media Graph Substrate
Provider-native realtime sessions do not eliminate the need for a local media graph. The graph becomes the local routing and policy layer, with the explicit goal of minimizing and guaranteeing the portion of stack latency capOS controls inside admitted realtime islands:
flowchart LR
Mic[BrowserMic / DeviceMic] --> Capture[capture buffer]
Capture --> Gate[VAD or push-to-talk gate]
Gate --> Adapter[provider adapter or local ASR]
Adapter --> Session[RealtimeModelSession]
Session --> Runner[tool-call gate in agent runner]
Runner --> Output[model audio output / local TTS]
Output --> Playback[playback buffer]
Playback --> Speaker[BrowserSpeaker / DeviceSpeaker]
On native capOS, device-facing audio eventually needs DeviceMmio, DMAPool,
and Interrupt authority. On WebShellGateway, browser WebAudio/WebRTC handles
physical microphone/speaker I/O, while capOS still owns the session authority
and tool execution boundary. The graph should follow the multimedia latency
research rule: use admitted realtime islands, preallocated media rings,
declared async-link latency, fail-closed overrun policy, and xrun/deadline
telemetry rather than hidden buffering.
Scheduling And Deadlines
Realtime voice is soft realtime for web-shell use:
- capture frames should be forwarded before they become stale;
- model output audio should be played or discarded, not accumulated without bound;
- barge-in must beat model momentum;
- tool execution must not block media handling forever.
Per-SQE or per-media-frame deadlines are useful metadata, but not authority. CPU guarantees still belong to future scheduling contexts. The media graph and realtime provider adapter should attach absolute monotonic deadlines to frames, tool calls, and playback events so stale work can be dropped deterministically.
Browser/WebShellGateway Implications
Provider docs support two deployment shapes:
- Browser connects directly to provider using provider-issued ephemeral credentials. This minimizes media latency but exposes provider session traffic directly to browser JavaScript.
- Browser streams media to
WebShellGateway, which connects to the provider server-side. This keeps provider credentials off the browser and lets capOS inspect/redact/rate-limit audio, but adds gateway latency.
For capOS, direct browser-to-provider media should be treated as an optimized
media path, not the baseline authority model. The baseline should keep
WebShellGateway and the agent runner in control of session lifecycle,
tool-call gating, audit, and teardown. If direct provider media is later used,
it should initially be media-only unless the provider offers a trusted
server-side control channel that lets the capOS adapter receive tool calls,
send tool results, and revoke the provider session without relying on browser
JavaScript.
The later browser-agent UI model is a separate policy choice: browser
JavaScript may receive provider tool-call events and orchestrate the provider
loop, but it still receives no capOS session caps or tool authority. Every
provider tool call must be forwarded as a structured ToolRequest to
WebShellGateway, and the gateway must validate descriptor freshness, session
state, consent/step-up, quotas, replay protection, and audit before invoking
real capOS capabilities. If those gateway controls are unavailable, provider
tool declarations must be disabled in the direct browser session and all
tool-capable turns must use gateway-mediated provider sessions. The browser
receives only short-lived, provider-scoped, model/config-locked tokens minted
by a broker-controlled service.
Recommended capOS Direction
- Keep
LanguageModelfor text and chained workflows. - Add
RealtimeModel/RealtimeModelSessionfor native realtime multimodal sessions. - Model provider adapters should be ordinary services:
OpenAIRealtimeProviderGeminiLiveProviderVertexLiveProviderLocalRealtimeProvider
- A capOS-side agent runner or
WebShellGateway’s server-side tool proxy remains the only holder of session caps and the only executor of real capOS tools. - WebShellGateway owns browser transport, media channels, and browser-agent tool proxy enforcement, but browser JavaScript owns no tool authority.
- Media graph primitives should use
MemoryObject, notifications, futexes, and scheduling contexts as they land. - Direct browser-to-provider connections require broker-minted ephemeral credentials and explicit audit of what bypasses gateway media inspection.
Open Design Questions
- Should
RealtimeModelSessionexpose provider event ids verbatim, or should it normalize them to capOS-generated ids and retain provider ids only in audit metadata? - Should direct provider WebRTC be allowed for operator sessions, or should all production web-shell voice flow through WebShellGateway?
- How much partial transcript text is trusted enough to render before the provider marks it final?
- Can a provider-generated audio response be spoken before pending
consentorstepUpdecisions are resolved, or must speech pause at tool-call gates? - How should local wake-word/VAD models be sandboxed so they can improve UX without becoming an authorization factor?
- Should media-frame deadlines be added to the existing SQE reserved field, or kept in media-ring metadata until the scheduler has scheduling contexts?
References
- OpenAI, Realtime conversations
- OpenAI, Realtime API with WebRTC
- OpenAI, Realtime API with WebSocket
- OpenAI, Voice agents
- OpenAI, gpt-realtime-1.5 model page
- Google AI for Developers, Gemini Live API overview
- Google AI for Developers, Tool use with Live API
- Google AI for Developers, Ephemeral tokens
- Google Cloud Vertex AI, Gemini Live API overview
Research: Robotics Realtime Control
Survey of robotics realtime-control practice and the consequences for using capOS as a robot brain for industrial robots, vacuum/mobile robots, RC cars, drones, and autonomous vehicles.
Scope
This note is about the operating-system and middleware boundary, not robot kinematics or control theory. The capOS question is whether a capability OS can be a credible robot brain without pretending that every perception, planning, networking, and actuator path has the same timing or safety requirements.
The answer is conditional:
- capOS is a plausible high-level robot brain and isolation substrate.
- capOS should eventually host bounded realtime control islands.
- capOS should not claim certified hard-realtime safety-controller status until scheduling contexts, driver isolation, timing analysis, fault containment, and certification evidence exist.
- For early physical robots, capOS should supervise and coordinate while microcontrollers, PLCs, motor controllers, or flight controllers close the tightest safety loops.
Source Snapshot
External source observations below were checked on 2026-04-25.
Related local grounding:
- Out-of-kernel scheduling explains why capOS should split scheduler policy from kernel dispatch and budget enforcement.
- Multimedia pipeline latency defines the guaranteed-realtime-island model that also applies to robot control loops.
- Completion rings and threaded runtimes grounds per-thread completion ownership and full-SMP ring direction.
- seL4 and Genode ground capability isolation, resource donation, and component composition for safety-critical systems.
External Findings
ROS 2 Realtime Direction
ROS 2 documentation frames realtime computing as central to autonomous vehicles, spacecraft, and industrial manufacturing. Its realtime programming guide emphasizes periodic loops, bounded jitter, and avoiding page faults, dynamic allocation, and indefinitely blocking synchronization on the realtime path.
The ROS 2 design background makes a sharper point: an OS can provide deterministic services, but application code must still avoid nondeterministic behavior. It recommends separating startup/preallocation, realtime-safe loop, and teardown phases. This maps directly to capOS admission: graph setup may use ordinary capability calls, but the admitted realtime cycle must run over preallocated buffers and pre-authorized work.
ros2_control
The ros2_control controller manager is a useful concrete precedent. It owns a
periodic hardware-control loop whose shape is read state, update controllers,
and write commands. Its documentation attempts to run the main controller
thread under SCHED_FIFO, reports controller/hardware periodicity and
execution-time diagnostics, and warns that normal Linux is throughput-oriented
rather than ideal for hardware control.
Consequences for capOS:
- The robot-control API should make the cyclic read/update/write loop explicit.
- Controller activation, hardware claiming, fallback, and limits are safety policy, not incidental plugin mechanics.
- Periodicity, execution time, overruns, and command-limit enforcement need to be first-class telemetry.
- A controller state query or lifecycle transition that is not realtime-safe must be prohibited inside the admitted control loop.
micro-ROS Executor
micro-ROS documents why the default ROS 2 executor is problematic for deterministic robotic control: timer precedence, non-preemptive round-robin callback execution, no explicit callback priority, and only one input per handle can all create priority inversion and weak latency bounds. Its rclc Executor adds static sequential execution, trigger conditions, optional multi-thread scheduling configuration, and Logical Execution Time semantics. It also allocates callbacks during configuration, not during runtime.
Consequences for capOS:
- A robot graph should have an explicit execution plan, not generic event-loop fairness.
- Sense-plan-act phases should be expressible as a timed DAG with trigger conditions.
- LET-style input/output boundaries are useful for sensor fusion and multi-rate control where lower jitter is worth one controlled period of latency.
- Runtime graph mutation belongs outside the realtime cycle.
Current Research Trend
A 2026 ROS 2 realtime survey reports that recent work focuses on executor analysis, DDS communication delays, response time, reaction time, data age, message filters, profiling tools, and micro-ROS. That confirms that the hard part is not merely “use ROS 2”; it is making callback scheduling, data age, and communication delays analyzable.
ReDAG-RT, submitted in March 2026, is a recent example of the same pressure. It adds a user-space global scheduler for ROS 2 callback DAGs using rate-priority ordering and per-DAG concurrency bounds. The result is relevant even if capOS does not run ROS 2 unchanged: robot workloads want graph-level scheduling policy with bounded interference, not only thread priorities.
A UAV PREEMPT_RT paper submitted in April 2026 studies a 250 Hz flight-control loop on Raspberry Pi 5 and isolates timing effects from deferred Linux activation paths versus direct realtime activation. The useful warning for capOS is that multicore SoC shared-resource contention can dominate nominal loop frequency. Capability isolation is not sufficient without temporal and cache/bus interference accounting.
seL4 MCS And Timing Work
seL4 MCS exposes scheduling contexts as kernel-managed objects, including periodic threads and passive servers. The Trustworthy Systems timing work emphasizes deadline guarantees, temporal isolation, and WCET analysis for kernel paths.
Consequences for capOS:
- Processor time should become explicit authority. A process that can command a motor still needs budget authority to do so at a period.
- Passive-server and scheduling-context donation semantics fit robot services: a controller can run on the caller’s admitted budget when that is the intended timing contract.
- Hard realtime claims require bounded kernel paths and timing evidence, not only a priority scheduler.
Linux PREEMPT_RT And Xenomai
The Linux kernel now documents PREEMPT_RT internals, including priority inheritance, threaded interrupts, and differences from non-RT kernels. Xenomai remains a strong precedent for systems that split stringent realtime work into a co-kernel or companion core while keeping Linux services available for ordinary work.
Consequences for capOS:
- There is a practical ladder: normal scheduling, soft realtime with telemetry, admitted realtime islands, and hard device deadlines.
- If capOS cannot yet provide hard bounds, it should make that status visible instead of hiding it behind a “realtime” label.
- A future capOS robotics platform may still delegate the smallest motor or flight-control loop to an MCU/RTOS while capOS owns capability isolation, planning, perception, logging, updates, and operator control.
Orocos
Orocos is a long-running robotics control precedent: portable C++ libraries for advanced machine and robot control, with the Real-Time Toolkit as a component framework for realtime components.
Consequence for capOS: robotics developers need component lifecycle, deployment, ports, and runtime introspection. capOS should not expose only raw actuator writes; it needs a component/graph model where a realtime component can be admitted, activated, monitored, and deactivated without granting broad device authority.
Mobile Robots, Drones, And Cars
Nav2 presents a production-grade ROS 2 navigation framework for mobile and surface robots, with perception, planning, control, localization, behaviors, collision monitoring, docking, and teleoperation. It is the right class of software for vacuum cleaners, warehouse robots, rovers, and small RC-car autonomy, but it is not itself a hard-safety controller.
PX4 recommends ROS 2 for companion-computer integration when low latency and Linux libraries matter, while the autopilot remains the flight controller. ArduPilot documents the same split: companion computers consume MAVLink telemetry and make higher-level decisions while the autopilot owns the hard vehicle-control loop.
Autoware is the comparable open-source autonomous-driving stack. It is built on ROS and presents perception, localization, planning, control, and vehicle interface modules for autonomous driving. That is the right architectural shape for a capOS self-driving-car prototype: capOS can isolate and supervise modules, but a safety-certified vehicle interface and independent safety controller remain mandatory.
Manufacturing Interoperability
OPC UA Companion Specifications exist to define industry/device-specific information models and environment profiles. OPC UA is designed to scale from field-level devices to enterprise management. For manufacturing robots, this matters because the robot brain rarely talks only to motors; it must also exchange state, jobs, alarms, and audit data with PLCs, MES/SCADA systems, and vendor controllers.
Consequence for capOS: industrial integration should use typed gateway
services. A capOS robot brain should expose and consume narrow manufacturing
capabilities such as RobotCellStatus, JobQueue, SafetyState,
ProgramSelector, and AlarmLog, not ambient network sockets or filesystem
paths.
Timing Classes
Robots mix several timing classes:
| Class | Typical loop | Examples | capOS stance |
|---|---|---|---|
| Hard safety | microseconds to milliseconds | e-stop chain, torque disable, flight stabilization | external certified controller first; future capOS only with evidence |
| Cyclic motion control | 250 Hz to 4 kHz or higher | joint servo, wheel velocity, PWM/ESC updates, EtherCAT cycle | future admitted realtime island; early offload to MCU/PLC |
| Local autonomy | 10 Hz to 100 Hz | obstacle avoidance, local planner, odometry fusion | plausible early capOS target with deadline/drop telemetry |
| Perception and mapping | 1 Hz to 60 Hz | camera/lidar processing, SLAM, object detection | capOS service graph, GPU/NPU caps later |
| Mission behavior | event-driven to 10 Hz | route plan, behavior tree, job dispatch, teleop mode | strong capOS fit |
| Fleet/cloud integration | seconds and slower | logs, updates, digital twin, MES/SCADA | strong capOS fit |
The mistake would be to put all of these on one generic executor and call it a robot brain. The capOS advantage is that each row can have different authority, budget, telemetry, and failure policy.
Domain Consequences
Manufacturing Robots
capOS can plausibly supervise a robot cell:
- isolate vendor robot gateways, PLC gateways, camera/lidar services, planning services, operator UI, audit, and update agents;
- hold explicit capabilities for cell state, job selection, robot program invocation, fixtures, safety-state observation, and logs;
- run non-safety planning and perception near the robot;
- bridge OPC UA, fieldbus, and vendor APIs through narrow service caps.
capOS should not initially replace:
- certified safety PLCs;
- e-stop and guarding;
- servo drives’ inner control loops;
- vendor-certified robot-controller safety functions.
Vacuum Cleaners And Indoor Mobile Robots
capOS is a better early fit here:
- high-level mapping, route planning, room segmentation, cleaning policy, docking, telemetry, and operator control are natural services;
- wheel PID, bumper debounce, cliff sensors, battery protection, and motor current cutoffs can stay on a small MCU;
- Nav2-like navigation concepts can map to capOS graph services and typed actuator/sensor caps.
The first useful physical demo could be a small differential-drive base with
capOS running on an SBC and an MCU exposing a typed BaseDrive cap.
RC Cars And Rovers
An RC-car class platform is a good capOS autonomy test because it is simple enough to instrument and unsafe enough to require strict boundaries:
- capOS can run teleop, camera perception, local planning, logging, and a geofenced mission controller;
- PWM/ESC steering and throttle should be mediated by a microcontroller or device service with a watchdog;
- command caps should carry speed, steering, freshness deadline, and mode;
- stale or revoked command authority should force neutral throttle and safe steering.
Drones
capOS should be a companion computer first:
- consume MAVLink/uORB-like telemetry through a typed autopilot bridge;
- run perception, mapping, object tracking, mission planning, and logging;
- send high-level setpoints only through a
FlightSetpointcap with mode, envelope, rate, and geofence limits; - never bypass the flight controller’s arming, failsafe, and stabilization logic in early stages.
Self-Driving Cars
capOS is a research host for autonomous-driving software, not a near-term safety-certified vehicle OS:
- isolate perception, localization, prediction, planning, map, and vehicle interface modules;
- make every actuator-affecting path explicit and auditable;
- use a safety gateway that clamps commands to an envelope and can degrade to minimal-risk behavior;
- keep independent safety monitors and hardware controls outside the model or planner process.
The useful capOS contribution is not “the LLM drives the car.” It is a capability and timing architecture that prevents perception, model, network, UI, or update components from accidentally gaining actuator or safety authority.
capOS Design Consequences
Robot Brain Means Authority Router, Not Monolith
The robot brain should be a composed service graph:
flowchart LR
Sensors[Sensor services] --> Perception[Perception]
Perception --> World[World model]
World --> Planner[Planner / behavior]
Planner --> Control[Controller island]
Control --> Actuators[Actuator gateway]
Safety[Safety monitor] --> Control
Safety --> Actuators
Operator[Operator UI / teleop] --> Planner
Audit[Audit / telemetry] --- Sensors
Audit --- Control
The security boundary is the capability graph. The timing boundary is the admitted realtime island. Both must be visible in documentation and telemetry.
Control-Loop Admission
A future ControlLoopManager should admit a loop only after it has:
- fixed period and deadline;
- declared worst-case execution budget;
- preallocated command/state buffers;
- reserved scheduling context;
- pinned or registered memory for device I/O;
- bounded input data age policy;
- actuator command clamp policy;
- overrun policy;
- watchdog/freshness behavior;
- audit/telemetry route outside the realtime path.
No Cap’n Proto allocation, service discovery, logging, credential lookup, model inference, network fetch, filesystem access, or policy prompt belongs in the admitted loop.
Capability Shapes
Likely future interfaces:
interface SensorStream {
describe @0 () -> (info :SensorInfo);
openRing @1 (config :StreamConfig) -> (ring :MemoryObject);
readStatus @2 () -> (status :StreamStatus);
}
interface ActuatorCommand {
describe @0 () -> (info :ActuatorInfo);
submit @1 (command :CommandFrame) -> (accepted :Bool);
neutral @2 (reason :Text) -> ();
}
interface ControlLoop {
describe @0 () -> (info :LoopInfo);
start @1 () -> ();
stop @2 (reason :Text) -> ();
readTelemetry @3 () -> (telemetry :LoopTelemetry);
}
interface SafetyState {
read @0 () -> (state :SafetySnapshot);
subscribe @1 () -> (events :SensorStream);
}
CommandFrame should include sequence, monotonic timestamp, deadline,
coordinate frame, mode, limit profile, and typed payload. A stale command is a
failed command.
Robot Description And Frames
capOS needs a typed robot description model rather than an ambient URDF file path. A robot description service should expose:
- kinematic tree;
- named frames and transforms;
- joint limits and command interfaces;
- sensors, actuators, and calibration;
- safety envelopes and operating modes;
- firmware/controller identity;
- simulation twins.
The description is read-only to most services. Mutating calibration or limits requires a separate authority and should produce audit records.
ROS 2 Compatibility
capOS should not try to replace the robotics ecosystem in the first pass. It should host compatibility bridges:
- ROS 2 graph bridge for topics/actions/services;
- micro-ROS/MCU bridge for embedded controllers;
- MAVLink bridge for autopilots;
- OPC UA bridge for manufacturing cells;
- simulation bridge for Gazebo/Isaac/Webots-like tools.
Each bridge receives only the caps it needs. A ROS bridge should not become an ambient authority tunnel from the ROS graph to every actuator.
Models And Agents
Language or vision-language models can help with:
- operator command interpretation;
- diagnostics and log summarization;
- task planning under human approval;
- visual inspection;
- code/config generation in simulation.
They must not hold actuator caps. Model output is untrusted. A planner or agent may propose a mission step, but a trusted runner must validate it against tool descriptors, safety state, geofence, mode, and command limits before any actuator-affecting capability is invoked.
Safety And Certification Gap
capOS currently has no certification story for:
- IEC 61508 / ISO 13849 / ISO 10218 / ISO 26262 style evidence;
- bounded interrupt latency on target hardware;
- WCET for kernel paths;
- IOMMU-backed driver isolation for physical devices;
- independent safety monitor authority;
- safe boot/update rollback for robots;
- fault-injection and hardware-in-loop test evidence.
Therefore the honest position is:
- research/simulation: capOS can be the main robot OS;
- hobby mobile robot: capOS can be the SBC brain with MCU safety;
- industrial cell: capOS can supervise and integrate, not replace safety PLCs;
- self-driving car: capOS can host research autonomy modules behind a safety gateway, not claim road-safety control.
Implementation Path
- Simulation-only robot graph: fake sensors, fake actuators, behavior service, and audit, all over typed capabilities.
- Differential-drive demo:
BaseDriveMCU bridge, encoder/IMU sensor stream, watchdog, stale-command neutral behavior, and QEMU/host simulation proof. - ROS 2/Nav2 bridge: import/export selected topics/actions with explicit caps and no broad graph authority.
- Control-loop telemetry: deadline, data age, overrun, stale command, clamp, watchdog reset, and safety-state event counters.
- Realtime island prototype: fixed-period local controller over preallocated rings once scheduling contexts and notification objects exist.
- Device authority integration: fieldbus/CAN/EtherCAT/serial through
DeviceMmio,DMAPool,Interrupt, or userspace driver caps after the DMA isolation gate. - Manufacturing gateway: OPC UA/PLC bridge exposing cell status, job dispatch, alarms, and robot-program selection as typed caps.
- Autonomy stack: perception/planning/control services with explicit timing and safety envelopes.
Open Questions
- Should capOS define a native robot-description schema or import URDF/SDF into a normalized capability service?
- Should the first physical demo target a differential-drive base, RC car, or manipulator simulator?
- What is the smallest useful scheduling-context API for a 50-100 Hz mobile robot controller?
- How should transform-tree state be represented: service, shared snapshot ring, or both?
- Where should command-limit enforcement live: actuator gateway, controller, safety monitor, or all three with different authority?
- Can the same media graph ring shape support camera/lidar frames and audio, or does robot perception need a distinct sensor-stream ABI?
References
- ROS 2, Understanding real-time programming
- ROS 2 Design, Introduction to Real-time Systems
- ROS 2 Real-Time Working Group, documentation
- ROS 2 Control, Controller Manager
- micro-ROS, Execution Management
- Casini et al., A Survey of Real-Time Support, Analysis, and Advancements in ROS 2
- Hasan et al., ReDAG-RT
- Giacomossi et al., Scheduling Analysis of UAV Flight Control Workloads using Raspberry Pi 5 Using PREEMPT_RT Linux
- seL4, MCS Extensions
- Trustworthy Systems, Timing guarantees for mixed-criticality systems on seL4
- Linux kernel docs, Real-time preemption
- Xenomai, Overview
- Orocos, Project documentation
- Nav2, Navigation System
- PX4, ROS 2
- ArduPilot, Companion Computers
- Autoware Foundation, Autoware Overview
- OPC Foundation, UA Companion Specifications